Commandline Tools

ICU-Tokenzier provides it’s full set of functionalities through the commandline. The commandline tools can be accessed by calling the module as a script.

python -m icu_tokenizer

Sentence Splitting

Split lines containing multiple sentences.

usage: python3 -m mt_experiments split [-h] [-l LANG] [-i INPUTS [INPUTS ...]]
                                       [-o OUTPUT] [-j NUM_WORKERS]
                                       [--show-pbar] [--verbose]

Named Arguments

-l, --lang

Language identifier

Default: “en”

-i, --inputs

Input files. Defaults to stdin.

Default: [<_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>]

-o, --output

Output file. Defaults to stdout.

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

-j, --num-workers

Number of processes to use

Default: 0

--show-pbar

Show progressbar

Default: False

--verbose

Print splits to stderr

Default: False

Normalize

Normalize text using unicode properties.

usage: python3 -m mt_experiments normalize [-h] [-l LANG] [-p] [-lc]
                                           [-i INPUTS [INPUTS ...]]
                                           [-o OUTPUT] [-j NUM_WORKERS]
                                           [--show-pbar]

Named Arguments

-l, --lang

Language identifier

Default: “en”

-p, --norm-puncts

Normalize punctuations

Default: False

-lc, --lowercase

Cast all characters to lowercase

Default: False

-i, --inputs

Input files. Defaults to stdin.

Default: [<_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>]

-o, --output

Output file. Defaults to stdout.

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

-j, --num-workers

Number of processes to use

Default: 0

--show-pbar

Show progressbar

Default: False

Tokenize

Tokenize text using unicode properties.

usage: python3 -m mt_experiments tokenize [-h] [-i INPUTS [INPUTS ...]]
                                          [-o OUTPUT] [-l LANG] [-a] [-url]
                                          [-j NUM_WORKERS] [--show-pbar]

Named Arguments

-i, --inputs

Input files. Defaults to stdin.

Default: [<_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>]

-o, --output

Output file. Defaults to stdout.

Default: <_io.TextIOWrapper name=’<stdout>’ mode=’w’ encoding=’UTF-8’>

-l, --lang

Language identifier

Default: “en”

-a, --annotate-hyphens

Annotate hyphens similar to moses

Default: False

-url, --protect-urls

Protect url patterns

Default: False

-j, --num-workers

Number of processes to use

Default: 0

--show-pbar

Show progressbar

Default: False