Python API Reference

Sentence Splitter

class icu_tokenizer.sent_splitter.SentSplitter(lang: str = 'en')[source]

ICU sentence splitter.

Usage:

>>> splitter = SentSplitter(lang)
>>> sents: List[str] = splitter.split(paragraph)
__init__(lang: str = 'en')[source]

SentSplitter.

split(text: str) List[str][source]

Split a sentence with the ICU sentence splitter.

Normalizer

class icu_tokenizer.normalizer.Normalizer(lang: str = 'en', norm_puncts: bool = False)[source]

Unicode information based normalizer.

Does the following

  • Ensure NFKC format

  • Handle pseudo-spaces (for numbers)

  • Normalize by unicode categories https://www.fileformat.info/info/unicode/category/index.htm

    • [C*|So|Z*] → ‘ ‘
    • [Pc]_
    • [Pd]-
    • [Pf|Pi]" (except for ')
    • [Ps]( (except for {, [)
    • [Pe]) (except for }, ])
  • Normalize Nd (Numbers)

  • Account for some outliers

  • Remove non printable characters

  • Normalize whitespace characters

  • Perform language specific normalization

Usage:

>>> normalizer = Normalizer(lang, norm_puncts=True)
>>> norm_text: str = normalizer.normalize(text)
__init__(lang: str = 'en', norm_puncts: bool = False)[source]

Normalizer.

Parameters:
  • lang (str, optional) – Language identifier. Defaults to ‘en’.
  • norm_puncts (bool, optional) – Normalize punctuations?. Defaults to False.
normalize(text: str) str[source]

Perform normalization.

Parameters:text (str) – Input text
Returns:Normalized text
Return type:str

Tokenizer

class icu_tokenizer.tokenizer.Tokenizer(lang: str = 'en', annotate_hyphens: bool = False, protect_emails_urls: bool = False, extra_protected_patterns: List[Union[str, re.Pattern]] = [])[source]

ICU based tokenizer with additional functionality to protect sequences.

Usage:

>>> tokenizer = Tokenizer(
        lang,
        annotate_hyphens: bool,
        protect_emails_urls: bool,
        extra_protected_patterns: List[Union[str, re.Pattern]] = [],
    )
>>> tokens: List[str] = tokenizer.tokenize(text)
__init__(lang: str = 'en', annotate_hyphens: bool = False, protect_emails_urls: bool = False, extra_protected_patterns: List[Union[str, re.Pattern]] = [])[source]

Tokenizer.

Keyword Arguments:
 
  • (default (protect_emails_urls {bool} -- Protect urls) – {‘en’})
  • (default – {False})
  • (default – {False})
  • -- (extra_protected_patterns {List[Union[str, re.Pattern]]}) – A list of regex patterns (default: {[]})
tokenize(text: str) List[str][source]

Tokenize text into list of tokens.

Parameters:text (str) – Raw input text.
Returns:List of tokens.
Return type:List[str]