Python API Reference
Sentence Splitter
Normalizer
-
class
icu_tokenizer.normalizer.
Normalizer
(lang: str = 'en', norm_puncts: bool = False)[source] Unicode information based normalizer.
Does the following
Ensure NFKC format
Handle pseudo-spaces (for numbers)
Normalize by unicode categories https://www.fileformat.info/info/unicode/category/index.htm
[C*|So|Z*]
→ ‘ ‘[Pc]
→_
[Pd]
→-
[Pf|Pi]
→"
(except for'
)[Ps]
→(
(except for{
,[
)[Pe]
→)
(except for}
,]
)
Normalize Nd (Numbers)
Account for some outliers
Remove non printable characters
Normalize whitespace characters
Perform language specific normalization
Usage:
>>> normalizer = Normalizer(lang, norm_puncts=True) >>> norm_text: str = normalizer.normalize(text)
Tokenizer
-
class
icu_tokenizer.tokenizer.
Tokenizer
(lang: str = 'en', annotate_hyphens: bool = False, protect_emails_urls: bool = False, extra_protected_patterns: List[Union[str, re.Pattern]] = [])[source] ICU based tokenizer with additional functionality to protect sequences.
Usage:
>>> tokenizer = Tokenizer( lang, annotate_hyphens: bool, protect_emails_urls: bool, extra_protected_patterns: List[Union[str, re.Pattern]] = [], ) >>> tokens: List[str] = tokenizer.tokenize(text)
-
__init__
(lang: str = 'en', annotate_hyphens: bool = False, protect_emails_urls: bool = False, extra_protected_patterns: List[Union[str, re.Pattern]] = [])[source] Tokenizer.
Keyword Arguments: - (default (protect_emails_urls {bool} -- Protect urls) – {‘en’})
- (default – {False})
- (default – {False})
- -- (extra_protected_patterns {List[Union[str, re.Pattern]]}) – A list of regex patterns (default: {[]})
-