To says tokenize karma points new

Share post!

to says tokenize karma points new

C.f. NLTK github issue #1214, there are quite a few alternative tokenizers in NLTK =)

E.g. using NLTK port of @jonsafari toktok tokenizer:

>>> import nltk >>> nltk.download(‘perluniprops’) [nltk_data] Downloading package perluniprops to [nltk_data] /Users/liling.tan/nltk_data… [nltk_data] Package perluniprops is already up-to-date! True >>> nltk.download(‘nonbreaking_prefixes’) [nltk_data] Downloading package nonbreaking_prefixes to [nltk_data] /Users/liling.tan/nltk_data… [nltk_data] Package nonbreaking_prefixes is already up-to-date! True >>> from nltk.tokenize.toktok import ToktokTokenizer >>> toktok = ToktokTokenizer() >>> sent = u”¿Quién eres tú? ¡Hola! ¿Dónde estoy?” >>> toktok.tokenize(sent) [u’\xbf’, u’Qui\xe9n’, u’eres’, u’t\xfa’, u’?’, u’\xa1Hola’, u’!’, u’\xbf’, u’D\xf3nde’, u’estoy’, u’?’] >>> print ” “.join(toktok.tokenize(sent)) ¿ Quién eres tú ? ¡Hola ! ¿ Dónde estoy ? >>> from nltk import sent_tokenize >>> sentences = u”¿Quién eres tú? ¡Hola! ¿Dónde estoy?” >>> [toktok.tokenize(sent) for sent in sent_tokenize(sentences, language=’spanish’)] [[u’\xbf’, u’Qui\xe9n’, u’eres’, u’t\xfa’, u’?’], [u’\xa1Hola’, u’!’], [u’\xbf’, u’D\xf3nde’, u’estoy’, u’?’]] >>> print ‘\n’.join([‘ ‘.join(toktok.tokenize(sent)) for sent in sent_tokenize(sentences, language=’spanish’)]) ¿ Quién eres tú ? ¡Hola ! ¿ Dónde estoy ?

If you hack the code a little and add u’\xa1′ in https://github.com/nltk/nltk/blob/develop/nltk/tokenize/toktok.py#L51 , you should be able to get:


[[u’\xbf’, u’Qui\xe9n’, u’eres’, u’t\xfa’, u’?’], [u’\xa1′, u’Hola’, u’!’], [u’\xbf’, u’D\xf3nde’, u’estoy’, u’?’]]

Share post!

Leave a Reply

Your email address will not be published.