Commandline Tools
ICU-Tokenzier provides it’s full set of functionalities through the commandline. The commandline tools can be accessed by calling the module as a script.
python -m icu_tokenizer
Sentence Splitting
Split lines containing multiple sentences.
usage: python3 -m mt_experiments split [-h] [-l LANG] [-i INPUTS [INPUTS ...]]
[-o OUTPUT] [-j NUM_WORKERS]
[--show-pbar] [--verbose]
Named Arguments
-l, --lang | Language identifier Default: “en” |
-i, --inputs | Input files. Defaults to stdin. Default: [<_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>] |
-o, --output | Output file. Defaults to stdout. Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
-j, --num-workers | Number of processes to use Default: 0 |
--show-pbar | Show progressbar Default: False |
--verbose | Print splits to stderr Default: False |
Normalize
Normalize text using unicode properties.
usage: python3 -m mt_experiments normalize [-h] [-l LANG] [-p] [-lc]
[-i INPUTS [INPUTS ...]]
[-o OUTPUT] [-j NUM_WORKERS]
[--show-pbar]
Named Arguments
-l, --lang | Language identifier Default: “en” |
-p, --norm-puncts | Normalize punctuations Default: False |
-lc, --lowercase | Cast all characters to lowercase Default: False |
-i, --inputs | Input files. Defaults to stdin. Default: [<_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>] |
-o, --output | Output file. Defaults to stdout. Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
-j, --num-workers | Number of processes to use Default: 0 |
--show-pbar | Show progressbar Default: False |
Tokenize
Tokenize text using unicode properties.
usage: python3 -m mt_experiments tokenize [-h] [-i INPUTS [INPUTS ...]]
[-o OUTPUT] [-l LANG] [-a] [-url]
[-j NUM_WORKERS] [--show-pbar]
Named Arguments
-i, --inputs | Input files. Defaults to stdin. Default: [<_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>] |
-o, --output | Output file. Defaults to stdout. Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’> |
-l, --lang | Language identifier Default: “en” |
-a, --annotate-hyphens | Annotate hyphens similar to moses Default: False |
-url, --protect-urls | Protect url patterns Default: False |
-j, --num-workers | Number of processes to use Default: 0 |
--show-pbar | Show progressbar Default: False |