API Reference

Blob Classes

Wrappers for various units of text.

This includes the main TextBlobDE, Word, and WordList classes.

Whenever possible, classes are inherited from the main TextBlob library, but in many cases, the models for German have to be initialised here in textblob_de.blob, resulting in a lot of duplicate code. The main reason are the Word objects. If they are generated from an inherited class, they will use the English models (e.g. for pluralize/singularize) used in the main library.

Example usage:

>>> from textblob_de import TextBlobDE
>>> b = TextBlobDE("Einfach ist besser als kompliziert.")
>>> b.tags
[('Einfach', 'RB'), ('ist', 'VB'), ('besser', 'RB'), ('als', 'IN'), ('kompliziert', 'JJ')]
>>> b.noun_phrases
WordList([])
>>> b.words
WordList(['Einfach', 'ist', 'besser', 'als', 'kompliziert'])
class textblob_de.blob.BaseBlob(text, tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None, clean_html=False)[source]

BaseBlob class initialised with German default models:

An abstract base class that all textblob classes will inherit from. Includes words, POS tag, NP, and word count properties. Also includes basic dunder and string methods for making objects like Python strings.

Parameters:
  • text (str) – A string.
  • tokenizer – (optional) A tokenizer instance. If None, defaults to NLTKPunktTokenizer().
  • np_extractor – (optional) An NPExtractor instance. If None, defaults to PatternParserNPExtractor().
  • pos_tagger – (optional) A Tagger instance. If None, defaults to PatternTagger.
  • analyzer – (optional) A sentiment analyzer. If None, defaults to PatternAnalyzer.
  • classifier – (optional) A classifier.

Changed in version 0.6.0: clean_html parameter deprecated, as it was in NLTK.

classify()[source]

Classify the blob using the blob’s classifier.

correct()[source]

Attempt to correct the spelling of a blob.

New in version 0.6.0: (textblob)

Return type:BaseBlob
detect_language()[source]

Detect the blob’s language using the Google Translate API.

Requires an internet connection.

Usage:

>>> b = TextBlob("bonjour")
>>> b.detect_language()
u'fr'
Language code reference:
https://developers.google.com/translate/v2/using_rest#language-params

New in version 0.5.0.

Return type:str
ends_with(suffix, start=0, end=9223372036854775807)

Returns True if the blob ends with the given suffix.

endswith(suffix, start=0, end=9223372036854775807)

Returns True if the blob ends with the given suffix.

find(sub, start=0, end=9223372036854775807)

Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].

format(*args, **kwargs)

Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.

index(sub, start=0, end=9223372036854775807)

Like blob.find() but raise ValueError when the substring is not found.

join(iterable)

Behaves like the built-in str.join(iterable) method, except returns a blob object.

Returns a blob which is the concatenation of the strings or blobs in the iterable.

lower()

Like str.lower(), returns new object with all lower-cased characters.

ngrams(n=3)[source]

Return a list of n-grams (tuples of n successive words) for this blob.

Return type:List of WordLists
noun_phrases

Returns a list of noun phrases for this blob.

np_counts

Dictionary of noun phrase frequencies in this text.

parse(parser=None)[source]

Parse the text.

Parameters:parser – (optional) A parser instance. If None, defaults to this blob’s default parser.

New in version 0.6.0.

polarity

Return the polarity score as a float within the range [-1.0, 1.0]

Return type:float
pos_tags

Returns an list of tuples of the form (word, POS tag).

Example:

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
        ('Thursday', 'NNP'), ('morning', 'NN')]
Return type:list of tuples
replace(old, new, count=9223372036854775807)

Return a new blob object with all the occurence of old replaced by new.

rfind(sub, start=0, end=9223372036854775807)

Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].

rindex(sub, start=0, end=9223372036854775807)

Like blob.rfind() but raise ValueError when substring is not found.

sentiment

Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:namedtuple of the form Sentiment(polarity, subjectivity)
sentiment_assessments

Return a tuple of form (polarity, subjectivity, assessments ) where polarity is a float within the range [-1.0, 1.0], subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective, and assessments is a list of polarity and subjectivity scores for the assessed tokens.

Return type:namedtuple of the form Sentiment(polarity, subjectivity, assessments)
split(sep=None, maxsplit=9223372036854775807)[source]

Behaves like the built-in str.split() except returns a WordList.

Return type:WordList
starts_with(prefix, start=0, end=9223372036854775807)

Returns True if the blob starts with the given prefix.

startswith(prefix, start=0, end=9223372036854775807)

Returns True if the blob starts with the given prefix.

strip(chars=None)

Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.

subjectivity

Return the subjectivity score as a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:float
tags

Returns an list of tuples of the form (word, POS tag).

Example:

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
        ('Thursday', 'NNP'), ('morning', 'NN')]
Return type:list of tuples
title()

Returns a blob object with the text in title-case.

tokenize(tokenizer=None)[source]

Return a list of tokens, using tokenizer.

Parameters:tokenizer – (optional) A tokenizer object. If None, defaults to this blob’s default tokenizer.
tokens

Return a list of tokens, using this blob’s tokenizer object (defaults to WordTokenizer).

translate(from_lang=None, to='de')[source]

Translate the blob to another language.

upper()

Like str.upper(), returns new object with all upper-cased characters.

word_counts

Dictionary of word frequencies in this text.

words

Return a list of word tokens. This excludes punctuation characters. If you want to include punctuation characters, access the tokens property.

Returns:A WordList of word tokens.
class textblob_de.blob.BlobberDE(tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None)[source]

A factory for TextBlobs that all share the same tagger, tokenizer, parser, classifier, and np_extractor.

Usage:

>>> from textblob_de import BlobberDE
>>> from textblob_de.taggers import PatternTagger
>>> from textblob.tokenizers import PatternTokenizer
>>> tb = Blobber(pos_tagger=PatternTagger(), tokenizer=PatternTokenizer())
>>> blob1 = tb("Das ist ein Blob.")
>>> blob2 = tb("Dieser Blob benutzt die selben Tagger und Tokenizer.")
>>> blob1.pos_tagger is blob2.pos_tagger
True
Parameters:
  • text (str) – A string.
  • tokenizer – (optional) A tokenizer instance. If None, defaults to NLTKPunktTokenizer().
  • np_extractor – (optional) An NPExtractor instance. If None, defaults to PatternParserNPExtractor().
  • pos_tagger – (optional) A Tagger instance. If None, defaults to PatternTagger.
  • analyzer – (optional) A sentiment analyzer. If None, defaults to PatternAnalyzer.
  • classifier – (optional) A classifier.

New in version 0.4.0: (textblob)

class textblob_de.blob.Sentence(sentence, start_index=0, end_index=None, *args, **kwargs)[source]

A sentence within a TextBlob. Inherits from BaseBlob.

Parameters:
  • sentence – A string, the raw sentence.
  • start_index – An int, the index where this sentence begins in a TextBlob. If not given, defaults to 0.
  • end_index – An int, the index where this sentence ends in a TextBlob. If not given, defaults to the length of the sentence - 1.
classify()

Classify the blob using the blob’s classifier.

correct()

Attempt to correct the spelling of a blob.

New in version 0.6.0: (textblob)

Return type:BaseBlob
detect_language()

Detect the blob’s language using the Google Translate API.

Requires an internet connection.

Usage:

>>> b = TextBlob("bonjour")
>>> b.detect_language()
u'fr'
Language code reference:
https://developers.google.com/translate/v2/using_rest#language-params

New in version 0.5.0.

Return type:str
dict

The dict representation of this sentence.

end = None

The end index within a textBlob

end_index = None

The end index within a textBlob

ends_with(suffix, start=0, end=9223372036854775807)

Returns True if the blob ends with the given suffix.

endswith(suffix, start=0, end=9223372036854775807)

Returns True if the blob ends with the given suffix.

find(sub, start=0, end=9223372036854775807)

Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].

format(*args, **kwargs)

Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.

index(sub, start=0, end=9223372036854775807)

Like blob.find() but raise ValueError when the substring is not found.

join(iterable)

Behaves like the built-in str.join(iterable) method, except returns a blob object.

Returns a blob which is the concatenation of the strings or blobs in the iterable.

lower()

Like str.lower(), returns new object with all lower-cased characters.

ngrams(n=3)

Return a list of n-grams (tuples of n successive words) for this blob.

Return type:List of WordLists
noun_phrases

Returns a list of noun phrases for this blob.

np_counts

Dictionary of noun phrase frequencies in this text.

parse(parser=None)

Parse the text.

Parameters:parser – (optional) A parser instance. If None, defaults to this blob’s default parser.

New in version 0.6.0.

polarity

Return the polarity score as a float within the range [-1.0, 1.0]

Return type:float
pos_tags

Returns an list of tuples of the form (word, POS tag).

Example:

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
        ('Thursday', 'NNP'), ('morning', 'NN')]
Return type:list of tuples
replace(old, new, count=9223372036854775807)

Return a new blob object with all the occurence of old replaced by new.

rfind(sub, start=0, end=9223372036854775807)

Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].

rindex(sub, start=0, end=9223372036854775807)

Like blob.rfind() but raise ValueError when substring is not found.

sentiment

Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:namedtuple of the form Sentiment(polarity, subjectivity)
sentiment_assessments

Return a tuple of form (polarity, subjectivity, assessments ) where polarity is a float within the range [-1.0, 1.0], subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective, and assessments is a list of polarity and subjectivity scores for the assessed tokens.

Return type:namedtuple of the form Sentiment(polarity, subjectivity, assessments)
split(sep=None, maxsplit=9223372036854775807)

Behaves like the built-in str.split() except returns a WordList.

Return type:WordList
start = None

The start index within a TextBlob

start_index = None

The start index within a TextBlob

starts_with(prefix, start=0, end=9223372036854775807)

Returns True if the blob starts with the given prefix.

startswith(prefix, start=0, end=9223372036854775807)

Returns True if the blob starts with the given prefix.

strip(chars=None)

Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.

subjectivity

Return the subjectivity score as a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:float
tags

Returns an list of tuples of the form (word, POS tag).

Example:

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
        ('Thursday', 'NNP'), ('morning', 'NN')]
Return type:list of tuples
title()

Returns a blob object with the text in title-case.

tokenize(tokenizer=None)

Return a list of tokens, using tokenizer.

Parameters:tokenizer – (optional) A tokenizer object. If None, defaults to this blob’s default tokenizer.
tokens

Return a list of tokens, using this blob’s tokenizer object (defaults to WordTokenizer).

translate(from_lang=None, to='de')

Translate the blob to another language.

upper()

Like str.upper(), returns new object with all upper-cased characters.

word_counts

Dictionary of word frequencies in this text.

words

Return a list of word tokens. This excludes punctuation characters. If you want to include punctuation characters, access the tokens property.

Returns:A WordList of word tokens.
class textblob_de.blob.TextBlobDE(text, tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None, clean_html=False)[source]

TextBlob class initialised with German default models:

Parameters:
  • text (str) – A string.
  • tokenizer – (optional) A tokenizer instance. If None, defaults to NLTKPunktTokenizer().
  • np_extractor – (optional) An NPExtractor instance. If None, defaults to PatternParserNPExtractor().
  • pos_tagger – (optional) A Tagger instance. If None, defaults to PatternTagger.
  • analyzer – (optional) A sentiment analyzer. If None, defaults to PatternAnalyzer.
  • classifier – (optional) A classifier.
classify()

Classify the blob using the blob’s classifier.

correct()

Attempt to correct the spelling of a blob.

New in version 0.6.0: (textblob)

Return type:BaseBlob
detect_language()

Detect the blob’s language using the Google Translate API.

Requires an internet connection.

Usage:

>>> b = TextBlob("bonjour")
>>> b.detect_language()
u'fr'
Language code reference:
https://developers.google.com/translate/v2/using_rest#language-params

New in version 0.5.0.

Return type:str
ends_with(suffix, start=0, end=9223372036854775807)

Returns True if the blob ends with the given suffix.

endswith(suffix, start=0, end=9223372036854775807)

Returns True if the blob ends with the given suffix.

find(sub, start=0, end=9223372036854775807)

Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].

format(*args, **kwargs)

Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.

index(sub, start=0, end=9223372036854775807)

Like blob.find() but raise ValueError when the substring is not found.

join(iterable)

Behaves like the built-in str.join(iterable) method, except returns a blob object.

Returns a blob which is the concatenation of the strings or blobs in the iterable.

json

The json representation of this blob.

Changed in version 0.5.1: Made json a property instead of a method to restore backwards compatibility that was broken after version 0.4.0.

lower()

Like str.lower(), returns new object with all lower-cased characters.

ngrams(n=3)

Return a list of n-grams (tuples of n successive words) for this blob.

Return type:List of WordLists
noun_phrases

Returns a list of noun phrases for this blob.

np_counts

Dictionary of noun phrase frequencies in this text.

parse(parser=None)

Parse the text.

Parameters:parser – (optional) A parser instance. If None, defaults to this blob’s default parser.

New in version 0.6.0.

polarity

Return the polarity score as a float within the range [-1.0, 1.0]

Return type:float
pos_tags

Returns an list of tuples of the form (word, POS tag).

Example:

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
        ('Thursday', 'NNP'), ('morning', 'NN')]
Return type:list of tuples
raw_sentences

List of strings, the raw sentences in the blob.

replace(old, new, count=9223372036854775807)

Return a new blob object with all the occurence of old replaced by new.

rfind(sub, start=0, end=9223372036854775807)

Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].

rindex(sub, start=0, end=9223372036854775807)

Like blob.rfind() but raise ValueError when substring is not found.

sentences

Return list of Sentence objects.

sentiment

Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:named tuple of the form Sentiment(polarity=0.0, subjectivity=0.0)
sentiment_assessments

Return a tuple of form (polarity, subjectivity, assessments ) where polarity is a float within the range [-1.0, 1.0], subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective, and assessments is a list of polarity and subjectivity scores for the assessed tokens.

Return type:namedtuple of the form Sentiment(polarity, subjectivity, assessments)
serialized

Returns a list of each sentence’s dict representation.

split(sep=None, maxsplit=9223372036854775807)

Behaves like the built-in str.split() except returns a WordList.

Return type:WordList
starts_with(prefix, start=0, end=9223372036854775807)

Returns True if the blob starts with the given prefix.

startswith(prefix, start=0, end=9223372036854775807)

Returns True if the blob starts with the given prefix.

strip(chars=None)

Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.

subjectivity

Return the subjectivity score as a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:float
tags

Returns an list of tuples of the form (word, POS tag).

Example:

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
        ('Thursday', 'NNP'), ('morning', 'NN')]
Return type:list of tuples
title()

Returns a blob object with the text in title-case.

to_json(*args, **kwargs)[source]

Return a json representation (str) of this blob. Takes the same arguments as json.dumps.

New in version 0.5.1: (textblob)

tokenize(tokenizer=None)

Return a list of tokens, using tokenizer.

Parameters:tokenizer – (optional) A tokenizer object. If None, defaults to this blob’s default tokenizer.
tokens

Return a list of tokens, using this blob’s tokenizer object (defaults to WordTokenizer).

translate(from_lang=None, to='de')

Translate the blob to another language.

upper()

Like str.upper(), returns new object with all upper-cased characters.

word_counts

Dictionary of word frequencies in this text.

words

Return a list of word tokens. This excludes punctuation characters. If you want to include punctuation characters, access the tokens property.

Returns:A WordList of word tokens.
class textblob_de.blob.Word(string, pos_tag=None)[source]

A simple word representation.

Includes methods for inflection, translation, and WordNet integration.

capitalize() → unicode

Return a capitalized version of S, i.e. make the first character have upper case and the rest lower case.

center(width[, fillchar]) → unicode

Return S centered in a Unicode string of length width. Padding is done using the specified fill character (default is a space)

correct()[source]

Correct the spelling of the word. Returns the word with the highest confidence using the spelling corrector.

New in version 0.6.0: (textblob)

count(sub[, start[, end]]) → int

Return the number of non-overlapping occurrences of substring sub in Unicode string S[start:end]. Optional arguments start and end are interpreted as in slice notation.

decode([encoding[, errors]]) → string or unicode

Decodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is ‘strict’ meaning that encoding errors raise a UnicodeDecodeError. Other possible values are ‘ignore’ and ‘replace’ as well as any other name registered with codecs.register_error that is able to handle UnicodeDecodeErrors.

define(pos=None)[source]

Return a list of definitions for this word. Each definition corresponds to a synset for this word.

Parameters:pos – A part-of-speech tag to filter upon. If None, definitions for all parts of speech will be loaded.
Return type:List of strings

New in version 0.7.0: (textblob)

definitions

The list of definitions for this word. Each definition corresponds to a synset.

New in version 0.7.0: (textblob)

detect_language()[source]

Detect the word’s language using Google’s Translate API.

New in version 0.5.0: (textblob)

encode([encoding[, errors]]) → string or unicode

Encodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.

endswith(suffix[, start[, end]]) → bool

Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

expandtabs([tabsize]) → unicode

Return a copy of S where all tab characters are expanded using spaces. If tabsize is not given, a tab size of 8 characters is assumed.

find(sub[, start[, end]]) → int

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

format(*args, **kwargs) → unicode

Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).

get_synsets(pos=None)[source]

Return a list of Synset objects for this word.

Parameters:pos – A part-of-speech tag to filter upon. If None, all synsets for all parts of speech will be loaded.
Return type:list of Synsets

New in version 0.7.0: (textblob)

index(sub[, start[, end]]) → int

Like S.find() but raise ValueError when the substring is not found.

isalnum() → bool

Return True if all characters in S are alphanumeric and there is at least one character in S, False otherwise.

isalpha() → bool

Return True if all characters in S are alphabetic and there is at least one character in S, False otherwise.

isdecimal() → bool

Return True if there are only decimal characters in S, False otherwise.

isdigit() → bool

Return True if all characters in S are digits and there is at least one character in S, False otherwise.

islower() → bool

Return True if all cased characters in S are lowercase and there is at least one cased character in S, False otherwise.

isnumeric() → bool

Return True if there are only numeric characters in S, False otherwise.

isspace() → bool

Return True if all characters in S are whitespace and there is at least one character in S, False otherwise.

istitle() → bool

Return True if S is a titlecased string and there is at least one character in S, i.e. upper- and titlecase characters may only follow uncased characters and lowercase characters only cased ones. Return False otherwise.

isupper() → bool

Return True if all cased characters in S are uppercase and there is at least one cased character in S, False otherwise.

join(iterable) → unicode

Return a string which is the concatenation of the strings in the iterable. The separator between elements is S.

lemma

Return the lemma of this word using Wordnet’s morphy function.

lemmatize(**kwargs)[source]

Return the lemma for a word using WordNet’s morphy function.

Parameters:pos – Part of speech to filter upon. If None, defaults to _wordnet.NOUN.

New in version 0.8.1: (textblob)

ljust(width[, fillchar]) → int

Return S left-justified in a Unicode string of length width. Padding is done using the specified fill character (default is a space).

lower() → unicode

Return a copy of the string S converted to lowercase.

lstrip([chars]) → unicode

Return a copy of the string S with leading whitespace removed. If chars is given and not None, remove characters in chars instead. If chars is a str, it will be converted to unicode before stripping

partition(sep) -> (head, sep, tail)

Search for the separator sep in S, and return the part before it, the separator itself, and the part after it. If the separator is not found, return S and two empty strings.

pluralize()[source]

Return the plural version of the word as a string.

replace(old, new[, count]) → unicode

Return a copy of S with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

rfind(sub[, start[, end]]) → int

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

rindex(sub[, start[, end]]) → int

Like S.rfind() but raise ValueError when the substring is not found.

rjust(width[, fillchar]) → unicode

Return S right-justified in a Unicode string of length width. Padding is done using the specified fill character (default is a space).

rpartition(sep) -> (head, sep, tail)

Search for the separator sep in S, starting at the end of S, and return the part before it, the separator itself, and the part after it. If the separator is not found, return two empty strings and S.

rsplit([sep[, maxsplit]]) → list of strings

Return a list of the words in S, using sep as the delimiter string, starting at the end of the string and working to the front. If maxsplit is given, at most maxsplit splits are done. If sep is not specified, any whitespace string is a separator.

rstrip([chars]) → unicode

Return a copy of the string S with trailing whitespace removed. If chars is given and not None, remove characters in chars instead. If chars is a str, it will be converted to unicode before stripping

singularize()[source]

Return the singular version of the word as a string.

spellcheck()[source]

Return a list of (word, confidence) tuples of spelling corrections.

Based on: Peter Norvig, “How to Write a Spelling Corrector” (http://norvig.com/spell-correct.html) as implemented in the pattern library.

New in version 0.6.0: (textblob)

split([sep[, maxsplit]]) → list of strings

Return a list of the words in S, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done. If sep is not specified or is None, any whitespace string is a separator and empty strings are removed from the result.

splitlines(keepends=False) → list of strings

Return a list of the lines in S, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.

startswith(prefix[, start[, end]]) → bool

Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.

strip([chars]) → unicode

Return a copy of the string S with leading and trailing whitespace removed. If chars is given and not None, remove characters in chars instead. If chars is a str, it will be converted to unicode before stripping

swapcase() → unicode

Return a copy of S with uppercase characters converted to lowercase and vice versa.

synsets

The list of Synset objects for this Word.

Return type:list of Synsets

New in version 0.7.0: (textblob)

title() → unicode

Return a titlecased version of S, i.e. words start with title case characters, all remaining cased characters have lower case.

translate(from_lang=None, to='de')[source]

Translate the word to another language using Google’s Translate API.

New in version 0.5.0: (textblob)

upper() → unicode

Return a copy of S converted to uppercase.

zfill(width) → unicode

Pad a numeric string S with zeros on the left, to fill a field of the specified width. The string S is never truncated.

class textblob_de.blob.WordList(collection)[source]

A list-like collection of words.

append(obj)[source]

Append an object to end. If the object is a string, appends a.

Word object.

count(strg, case_sensitive=False, *args, **kwargs)[source]

Get the count of a word or phrase s within this WordList.

Parameters:
  • strg – The string to count.
  • case_sensitive – A boolean, whether or not the search is case-sensitive.
extend(iterable)[source]

Extend WordList by appending elements from iterable.

If an element is a string, appends a Word object.

index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.

insert()

L.insert(index, object) – insert object before index

lemmatize()[source]

Return the lemma of each word in this WordList.

Currently using NLTKPunktTokenizer() for all lemmatization tasks. This might cause slightly different tokenization results compared to the TextBlob.words property.

lower()[source]

Return a new WordList with each word lower-cased.

pluralize()[source]

Return the plural version of each word in this WordList.

pop([index]) → item -- remove and return item at index (default last).

Raises IndexError if list is empty or index is out of range.

remove()

L.remove(value) – remove first occurrence of value. Raises ValueError if the value is not present.

reverse()

L.reverse() – reverse IN PLACE

singularize()[source]

Return the single version of each word in this WordList.

sort()

L.sort(cmp=None, key=None, reverse=False) – stable sort IN PLACE; cmp(x, y) -> -1, 0, 1

upper()[source]

Return a new WordList with each word upper-cased.

Base Classes

Extensions to Abstract base classes in textblob.base

class textblob_de.base.BaseLemmatizer[source]

Abstract base class from which all Lemmatizer classes inherit. Descendant classes must implement a lemmatize(text) method that returns a WordList of Word object with updated lemma properties.

New in version 0.2.3: (textblob_de)

lemmatize(text)[source]

Return a list of (lemma, tag) tuples.

Tokenizers

Various tokenizer implementations.

class textblob_de.tokenizers.NLTKPunktTokenizer[source]

Tokenizer included in nltk.tokenize.punkt package.

This is the default tokenizer in textblob-de

PROs:

  • trained model available for German
  • deals with many abbreviations and common German tokenization problems oob

CONs:

  • not very flexible (model has to be re-trained on your own corpus)
itokenize(text, *args, **kwargs)

Return a generator that generates tokens “on-demand”.

New in version 0.6.0.

Return type:generator
sent_tokenize(**kwargs)[source]

NLTK’s sentence tokenizer (currently PunktSentenceTokenizer).

Uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences, then uses that to find sentence boundaries.

tokenize(text, include_punc=True, nested=False)[source]

Return a list of word tokens.

Parameters:
  • text – string of text.
  • include_punc – (optional) whether to include punctuation as separate tokens. Default to True.
  • nested – (optional) whether to return tokens as nested lists of sentences. Default to False.
word_tokenize(text, include_punc=True)[source]

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.

It assumes that the text has already been segmented into sentences, e.g. using self.sent_tokenize().

This tokenizer performs the following steps:

  • split standard contractions, e.g. don't -> do n't and they'll -> they 'll
  • treat most punctuation characters as separate tokens
  • split off commas and single quotes, when followed by whitespace
  • separate periods that appear at the end of line

Source: NLTK’s docstring of TreebankWordTokenizer (accessed: 02/10/2014)

class textblob_de.tokenizers.PatternTokenizer[source]

Tokenizer included in pattern.de package.

PROs:

  • handling of emoticons
  • flexible implementations of abbreviations
  • can be adapted very easily

CONs:

  • ordinal numbers cause sentence breaks
  • indices of Sentence() objects cannot be computed
itokenize(text, *args, **kwargs)

Return a generator that generates tokens “on-demand”.

New in version 0.6.0.

Return type:generator
sent_tokenize(text, **kwargs)[source]

Returns a list of sentences.

Each sentence is a space-separated string of tokens (words). Handles common cases of abbreviations (e.g., etc., …). Punctuation marks are split from other words. Periods (or ?!) mark the end of a sentence. Headings without an ending period are inferred by line breaks.

tokenize(text, include_punc=True, nested=False)[source]

Return a list of word tokens.

Parameters:
  • text – string of text.
  • include_punc – (optional) whether to include punctuation as separate tokens. Default to True.
class textblob_de.tokenizers.SentenceTokenizer(tokenizer=None, *args, **kwargs)[source]

Generic sentence tokenization class, using tokenizer specified in TextBlobDE() instance.

Enables SentenceTokenizer().itokenize generator that would be lost otherwise.

Aim: Not to break core API of the main TextBlob library.

Parameters:tokenizer – (optional) A tokenizer instance. If None, defaults to NLTKPunktTokenizer().
itokenize(text, *args, **kwargs)

Return a generator that generates tokens “on-demand”.

New in version 0.6.0.

Return type:generator
sent_tokenize(text, **kwargs)[source]

Compatibility method to tokenizers included in textblob-de

tokenize(text, **kwargs)[source]

Return a list of word tokens.

Parameters:
  • text – string of text.
  • include_punc – (optional) whether to include punctuation as separate tokens. Default to True.
class textblob_de.tokenizers.WordTokenizer(tokenizer=None, *args, **kwargs)[source]

Generic word tokenization class, using tokenizer specified in TextBlobDE() instance.

You can also submit the tokenizer as keyword argument: WordTokenizer(tokenizer=NLTKPunktTokenizer())

Enables WordTokenizer().itokenize generator that would be lost otherwise.

Default: NLTKPunktTokenizer().word_tokenize(text, include_punc=True)

Aim: Not to break core API of the main TextBlob library.

Parameters:tokenizer – (optional) A tokenizer instance. If None, defaults to NLTKPunktTokenizer().
itokenize(text, *args, **kwargs)

Return a generator that generates tokens “on-demand”.

New in version 0.6.0.

Return type:generator
tokenize(text, include_punc=True, **kwargs)[source]

Return a list of word tokens.

Parameters:
  • text – string of text.
  • include_punc – (optional) whether to include punctuation as separate tokens. Default to True.
word_tokenize(text, include_punc=True)[source]

Compatibility method to tokenizers included in textblob-de

textblob_de.tokenizers.sent_tokenize(text, tokenizer=None)[source]

Convenience function for tokenizing sentences (not iterable).

If tokenizer is not specified, the default tokenizer NLTKPunktTokenizer() is used (same behaviour as in the main TextBlob library).

This function returns the sentences as a generator object.

textblob_de.tokenizers.word_tokenize(text, tokenizer=None, include_punc=True, *args, **kwargs)[source]

Convenience function for tokenizing text into words.

NOTE: NLTK’s word tokenizer expects sentences as input, so the text will be tokenized to sentences before being tokenized to words.

This function returns an itertools chain object (generator).

POS Taggers

Default taggers for German.

>>> from textblob_de.taggers import PatternTagger

or

>>> from textblob_de import PatternTagger
class textblob_de.taggers.PatternTagger(tokenizer=None, include_punc=False, encoding='utf-8', tagset=None)[source]

Tagger that uses the implementation in Tom de Smedt’s pattern library (http://www.clips.ua.ac.be/pattern).

Parameters:
  • tokenizer – (optional) A tokenizer instance. If None, defaults to PatternTokenizer().
  • include_punc – (optional) whether to include punctuation as separate tokens. Default to False.
  • encoding – (optional) Input string encoding. (Default utf-8)
  • tagset – (optional) Penn Treebank II (default) or (‘penn’|’universal’|’stts’).
tag(sentence, tokenize=True)[source]

Tag a string sentence.

Parameters:
  • or list sentence (str) – A string or a list of sentence strings.
  • tokenize – (optional) If False string has to be tokenized before (space separated string).

Noun Phrase Extractors

Various noun phrase extractor implementations.

# PatternParserNPExtractor().

class textblob_de.np_extractors.PatternParserNPExtractor(tokenizer=None)[source]

Extract noun phrases (NP) from PatternParser() output.

Very naïve and resource hungry approach:

  • get parser output
  • try to correct as many obvious parser errors as you can (e.g. eliminate wrongly tagged verbs)
  • filter insignificant words
Parameters:tokenizer – (optional) A tokenizer instance. If None, defaults to PatternTokenizer().
extract(text)[source]

Return a list of noun phrases (strings) for a body of text.

Parameters:text (str) – A string.

Sentiment Analyzers

German sentiment analysis implementations.

Main resource for de-sentiment.xml:

class textblob_de.sentiments.PatternAnalyzer(tokenizer=None, lemmatizer=None, lemmatize=True)[source]

Sentiment analyzer that uses the same implementation as the pattern library. Returns results as a tuple of the form:

(polarity, subjectivity)

RETURN_TYPE

Return type declaration

alias of Sentiment

analyze(text)[source]

Return the sentiment as a tuple of the form: (polarity, subjectivity)

Parameters:text (str) – A string.
kind = 'co'

Enhancement Issue #2 adapted from ‘textblob.en.sentiments.py’

class textblob_de.sentiments.Sentiment(path=u'', language=None, synset=None, confidence=None, **kwargs)[source]
annotate(word, pos=None, polarity=0.0, subjectivity=0.0, intensity=1.0, label=None)[source]

Annotates the given word with polarity, subjectivity and intensity scores, and optionally a semantic label (e.g., MOOD for emoticons, IRONY for “(!)”).

assessments(words=[], negation=True)[source]

Returns a list of (chunk, polarity, subjectivity, label)-tuples for the given list of words: where chunk is a list of successive words: a known word optionally preceded by a modifier (“very good”) or a negation (“not good”).

clear() → None. Remove all items from D.
copy() → a shallow copy of D
fromkeys(S[, v]) → New dict with keys from S and values equal to v.

v defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
has_key(k) → True if D has a key k, else False
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys() → list of D's keys
pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a

2-tuple; but raise KeyError if D is empty.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D
synset(id, pos=u'JJ')[source]

Returns a (polarity, subjectivity)-tuple for the given synset id. For example, the adjective “horrible” has id 193480 in WordNet: Sentiment.synset(193480, pos=”JJ”) => (-0.6, 1.0, 1.0).

update([E, ]**F) → None. Update D from dict/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → list of D's values
viewitems() → a set-like object providing a view on D's items
viewkeys() → a set-like object providing a view on D's keys
viewvalues() → an object providing a view on D's values

Parsers

Default parsers for German.

>>> from textblob_de.parsers import PatternParser

or

>>> from textblob_de import PatternParser
class textblob_de.parsers.PatternParser(tokenizer=None, tokenize=True, pprint=False, tags=True, chunks=True, relations=False, lemmata=False, encoding='utf-8', tagset=None)[source]

Parser that uses the implementation in Tom de Smedt’s pattern library. http://www.clips.ua.ac.be/pages/pattern-de#parser

Parameters:
  • tokenizer – (optional) A tokenizer instance. If None, defaults to PatternTokenizer().
  • tokenize – (optional) Split punctuation marks from words? (Default True)
  • pprint – (optional) Use pattern’s pprint function to display parse trees (Default False)
  • tags – (optional) Parse part-of-speech tags? (NN, JJ, …) (Default True)
  • chunks – (optional) Parse chunks? (NP, VP, PNP, …) (Default True)
  • relations – (optional) Parse chunk relations? (-SBJ, -OBJ, …) (Default False)
  • lemmata – (optional) Parse lemmata? (schönes => schön) (Default False)
  • encoding – (optional) Input string encoding. (Default utf-8)
  • tagset – (optional) Penn Treebank II (default) or (‘penn’|’universal’|’stts’).
parse(text)[source]

Parses the text.

pattern.de.parse(**kwargs) can be passed to the parser instance and are documented in the main docstring of PatternParser().

Parameters:text (str) – A string.
parsetree(text)[source]

Returns a parsed pattern Text object from the given string.

Classifiers (from TextBlob main package)

Various classifier implementations. Also includes basic feature extractor methods.

Example Usage:

>>> from textblob import TextBlob
>>> from textblob.classifiers import NaiveBayesClassifier
>>> train = [
...     ('I love this sandwich.', 'pos'),
...     ('This is an amazing place!', 'pos'),
...     ('I feel very good about these beers.', 'pos'),
...     ('I do not like this restaurant', 'neg'),
...     ('I am tired of this stuff.', 'neg'),
...     ("I can't deal with this", 'neg'),
...     ("My boss is horrible.", "neg")
... ]
>>> cl = NaiveBayesClassifier(train)
>>> cl.classify("I feel amazing!")
'pos'
>>> blob = TextBlob("The beer is good. But the hangover is horrible.", classifier=cl)
>>> for s in blob.sentences:
...     print(s)
...     print(s.classify())
...
The beer is good.
pos
But the hangover is horrible.
neg

New in version 0.6.0.

class textblob.classifiers.BaseClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]

Abstract classifier class from which all classifers inherit. At a minimum, descendant classes must implement a classify method and have a classifier property.

Parameters:
  • train_set – The training set, either a list of tuples of the form (text, classification) or a file-like object. text may be either a string or an iterable.
  • feature_extractor (callable) – A feature extractor function that takes one or two arguments: document and train_set.
  • format (str) – If train_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.
  • kwargs – Additional keyword arguments are passed to the constructor of the Format class used to read the data. Only applies when a file-like object is passed as train_set.

New in version 0.6.0.

classifier

The classifier object.

classify(text)[source]

Classifies a string of text.

extract_features(text)[source]

Extracts features from a body of text.

Return type:dictionary of features
labels()[source]

Returns an iterable containing the possible labels.

train(labeled_featureset)[source]

Trains the classifier.

class textblob.classifiers.DecisionTreeClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]

A classifier based on the decision tree algorithm, as implemented in NLTK.

Parameters:
  • train_set – The training set, either a list of tuples of the form (text, classification) or a filename. text may be either a string or an iterable.
  • feature_extractor – A feature extractor function that takes one or two arguments: document and train_set.
  • format – If train_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.

New in version 0.6.2.

accuracy(test_set, format=None)

Compute the accuracy on a test set.

Parameters:
  • test_set – A list of tuples of the form (text, label), or a file pointer.
  • format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.
classifier

The classifier.

classify(text)

Classifies the text.

Parameters:text (str) – A string of text.
extract_features(text)

Extracts features from a body of text.

Return type:dictionary of features
labels()

Return an iterable of possible labels.

nltk_class

alias of nltk.classify.decisiontree.DecisionTreeClassifier

pprint(*args, **kwargs)

Return a string containing a pretty-printed version of this decision tree. Each line in the string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the tree.

Return type:str
pretty_format(*args, **kwargs)[source]

Return a string containing a pretty-printed version of this decision tree. Each line in the string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the tree.

Return type:str
pseudocode(*args, **kwargs)[source]

Return a string representation of this decision tree that expresses the decisions it makes as a nested set of pseudocode if statements.

Return type:str
train(*args, **kwargs)

Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

New in version 0.6.2.

Return type:A classifier
update(new_data, *args, **kwargs)

Update the classifier with new training data and re-trains the classifier.

Parameters:new_data – New data as a list of tuples of the form (text, label).
class textblob.classifiers.MaxEntClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]

A maximum entropy classifier (also known as a “conditional exponential classifier”). This classifier is parameterized by a set of “weights”, which are used to combine the joint-features that are generated from a featureset by an “encoding”. In particular, the encoding maps each (featureset, label) pair to a vector. The probability of each label is then computed using the following equation:

                          dotprod(weights, encode(fs,label))
prob(fs|label) = ---------------------------------------------------
                 sum(dotprod(weights, encode(fs,l)) for l in labels)

Where dotprod is the dot product:

dotprod(a,b) = sum(x*y for (x,y) in zip(a,b))
accuracy(test_set, format=None)

Compute the accuracy on a test set.

Parameters:
  • test_set – A list of tuples of the form (text, label), or a file pointer.
  • format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.
classifier

The classifier.

classify(text)

Classifies the text.

Parameters:text (str) – A string of text.
extract_features(text)

Extracts features from a body of text.

Return type:dictionary of features
labels()

Return an iterable of possible labels.

nltk_class

alias of nltk.classify.maxent.MaxentClassifier

prob_classify(text)[source]

Return the label probability distribution for classifying a string of text.

Example:

>>> classifier = MaxEntClassifier(train_data)
>>> prob_dist = classifier.prob_classify("I feel happy this morning.")
>>> prob_dist.max()
'positive'
>>> prob_dist.prob("positive")
0.7
Return type:nltk.probability.DictionaryProbDist
train(*args, **kwargs)

Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

New in version 0.6.2.

Return type:A classifier
update(new_data, *args, **kwargs)

Update the classifier with new training data and re-trains the classifier.

Parameters:new_data – New data as a list of tuples of the form (text, label).
class textblob.classifiers.NLTKClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]

An abstract class that wraps around the nltk.classify module.

Expects that descendant classes include a class variable nltk_class which is the class in the nltk.classify module to be wrapped.

Example:

class MyClassifier(NLTKClassifier):
    nltk_class = nltk.classify.svm.SvmClassifier
accuracy(test_set, format=None)[source]

Compute the accuracy on a test set.

Parameters:
  • test_set – A list of tuples of the form (text, label), or a file pointer.
  • format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.
classifier

The classifier.

classify(text)[source]

Classifies the text.

Parameters:text (str) – A string of text.
extract_features(text)

Extracts features from a body of text.

Return type:dictionary of features
labels()[source]

Return an iterable of possible labels.

nltk_class = None

The NLTK class to be wrapped. Must be a class within nltk.classify

train(*args, **kwargs)[source]

Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

New in version 0.6.2.

Return type:A classifier
update(new_data, *args, **kwargs)[source]

Update the classifier with new training data and re-trains the classifier.

Parameters:new_data – New data as a list of tuples of the form (text, label).
class textblob.classifiers.NaiveBayesClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]

A classifier based on the Naive Bayes algorithm, as implemented in NLTK.

Parameters:
  • train_set – The training set, either a list of tuples of the form (text, classification) or a filename. text may be either a string or an iterable.
  • feature_extractor – A feature extractor function that takes one or two arguments: document and train_set.
  • format – If train_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.

New in version 0.6.0.

accuracy(test_set, format=None)

Compute the accuracy on a test set.

Parameters:
  • test_set – A list of tuples of the form (text, label), or a file pointer.
  • format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.
classifier

The classifier.

classify(text)

Classifies the text.

Parameters:text (str) – A string of text.
extract_features(text)

Extracts features from a body of text.

Return type:dictionary of features
informative_features(*args, **kwargs)[source]

Return the most informative features as a list of tuples of the form (feature_name, feature_value).

Return type:list
labels()

Return an iterable of possible labels.

nltk_class

alias of nltk.classify.naivebayes.NaiveBayesClassifier

prob_classify(text)[source]

Return the label probability distribution for classifying a string of text.

Example:

>>> classifier = NaiveBayesClassifier(train_data)
>>> prob_dist = classifier.prob_classify("I feel happy this morning.")
>>> prob_dist.max()
'positive'
>>> prob_dist.prob("positive")
0.7
Return type:nltk.probability.DictionaryProbDist
show_informative_features(*args, **kwargs)[source]

Displays a listing of the most informative features for this classifier.

Return type:None
train(*args, **kwargs)

Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

New in version 0.6.2.

Return type:A classifier
update(new_data, *args, **kwargs)

Update the classifier with new training data and re-trains the classifier.

Parameters:new_data – New data as a list of tuples of the form (text, label).
class textblob.classifiers.PositiveNaiveBayesClassifier(positive_set, unlabeled_set, feature_extractor=<function contains_extractor>, positive_prob_prior=0.5, **kwargs)[source]

A variant of the Naive Bayes Classifier that performs binary classification with partially-labeled training sets, i.e. when only one class is labeled and the other is not. Assuming a prior distribution on the two labels, uses the unlabeled set to estimate the frequencies of the features.

Example usage:

>>> from text.classifiers import PositiveNaiveBayesClassifier
>>> sports_sentences = ['The team dominated the game',
...                   'They lost the ball',
...                   'The game was intense',
...                   'The goalkeeper catched the ball',
...                   'The other team controlled the ball']
>>> various_sentences = ['The President did not comment',
...                        'I lost the keys',
...                        'The team won the game',
...                        'Sara has two kids',
...                        'The ball went off the court',
...                        'They had the ball for the whole game',
...                        'The show is over']
>>> classifier = PositiveNaiveBayesClassifier(positive_set=sports_sentences,
...                                           unlabeled_set=various_sentences)
>>> classifier.classify("My team lost the game")
True
>>> classifier.classify("And now for something completely different.")
False
Parameters:
  • positive_set – A collection of strings that have the positive label.
  • unlabeled_set – A collection of unlabeled strings.
  • feature_extractor – A feature extractor function.
  • positive_prob_prior – A prior estimate of the probability of the label True.

New in version 0.7.0.

accuracy(test_set, format=None)

Compute the accuracy on a test set.

Parameters:
  • test_set – A list of tuples of the form (text, label), or a file pointer.
  • format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.
classifier

The classifier.

classify(text)

Classifies the text.

Parameters:text (str) – A string of text.
extract_features(text)

Extracts features from a body of text.

Return type:dictionary of features
labels()

Return an iterable of possible labels.

train(*args, **kwargs)[source]

Train the classifier with a labeled and unlabeled feature sets and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

Return type:A classifier
update(new_positive_data=None, new_unlabeled_data=None, positive_prob_prior=0.5, *args, **kwargs)[source]

Update the classifier with new data and re-trains the classifier.

Parameters:
  • new_positive_data – List of new, labeled strings.
  • new_unlabeled_data – List of new, unlabeled strings.
textblob.classifiers.basic_extractor(document, train_set)[source]

A basic document feature extractor that returns a dict indicating what words in train_set are contained in document.

Parameters:
  • document – The text to extract features from. Can be a string or an iterable.
  • train_set (list) – Training data set, a list of tuples of the form (words, label) OR an iterable of strings.
textblob.classifiers.contains_extractor(document)[source]

A basic document feature extractor that returns a dict of words that the document contains.

Blobber

class textblob_de.blob.BlobberDE(tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None)[source]

A factory for TextBlobs that all share the same tagger, tokenizer, parser, classifier, and np_extractor.

Usage:

>>> from textblob_de import BlobberDE
>>> from textblob_de.taggers import PatternTagger
>>> from textblob.tokenizers import PatternTokenizer
>>> tb = Blobber(pos_tagger=PatternTagger(), tokenizer=PatternTokenizer())
>>> blob1 = tb("Das ist ein Blob.")
>>> blob2 = tb("Dieser Blob benutzt die selben Tagger und Tokenizer.")
>>> blob1.pos_tagger is blob2.pos_tagger
True
Parameters:
  • text (str) – A string.
  • tokenizer – (optional) A tokenizer instance. If None, defaults to NLTKPunktTokenizer().
  • np_extractor – (optional) An NPExtractor instance. If None, defaults to PatternParserNPExtractor().
  • pos_tagger – (optional) A Tagger instance. If None, defaults to PatternTagger.
  • analyzer – (optional) A sentiment analyzer. If None, defaults to PatternAnalyzer.
  • classifier – (optional) A classifier.

New in version 0.4.0: (textblob)

__call__(text)[source]

Return a new TextBlob object with this Blobber’s np_extractor, pos_tagger, tokenizer, analyzer, and classifier.

Returns:A new TextBlob.
__init__(tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None)[source]

x.__init__(…) initializes x; see help(type(x)) for signature

__repr__() <==> repr(x)[source]
__str__()

x.__repr__() <==> repr(x)

File Formats (from TextBlob main package)

File formats for training and testing data.

Includes a registry of valid file formats. New file formats can be added to the registry like so:

from textblob import formats

class PipeDelimitedFormat(formats.DelimitedFormat):
    delimiter = '|'

formats.register('psv', PipeDelimitedFormat)

Once a format has been registered, classifiers will be able to read data files with that format.

from textblob.classifiers import NaiveBayesAnalyzer

with open('training_data.psv', 'r') as fp:
    cl = NaiveBayesAnalyzer(fp, format='psv')
class textblob.formats.BaseFormat(fp, **kwargs)[source]

Interface for format classes. Individual formats can decide on the composition and meaning of **kwargs.

Parameters:fp (File) – A file-like object.

Changed in version 0.9.0: Constructor receives a file pointer rather than a file path.

classmethod detect(stream)[source]

Detect the file format given a filename. Return True if a stream is this file format.

Changed in version 0.9.0: Changed from a static method to a class method.

to_iterable()[source]

Return an iterable object from the data.

class textblob.formats.CSV(fp, **kwargs)[source]

CSV format. Assumes each row is of the form text,label.

Today is a good day,pos
I hate this car.,pos
classmethod detect(stream)

Return True if stream is valid.

to_iterable()

Return an iterable object from the data.

class textblob.formats.DelimitedFormat(fp, **kwargs)[source]

A general character-delimited format.

classmethod detect(stream)[source]

Return True if stream is valid.

to_iterable()[source]

Return an iterable object from the data.

class textblob.formats.JSON(fp, **kwargs)[source]

JSON format.

Assumes that JSON is formatted as an array of objects with text and label properties.

[
    {"text": "Today is a good day.", "label": "pos"},
    {"text": "I hate this car.", "label": "neg"}
]
classmethod detect(stream)[source]

Return True if stream is valid JSON.

to_iterable()[source]

Return an iterable object from the JSON data.

class textblob.formats.TSV(fp, **kwargs)[source]

TSV format. Assumes each row is of the form text      label.

classmethod detect(stream)

Return True if stream is valid.

to_iterable()

Return an iterable object from the data.

textblob.formats.detect(fp, max_read=1024)[source]

Attempt to detect a file’s format, trying each of the supported formats. Return the format class that was detected. If no format is detected, return None.

textblob.formats.get_registry()[source]

Return a dictionary of registered formats.

textblob.formats.register(name, format_class)[source]

Register a new format.

Parameters:
  • name (str) – The name that will be used to refer to the format, e.g. ‘csv’
  • format_class (type) – The format class to register.

Exceptions (from TextBlob main package)

textblob.exceptions.MissingCorpusException

alias of textblob.exceptions.MissingCorpusError