Tokenize [String]
Description
Separates a string into tokens. The tokenization method can be defined in the parameter (e.g., only tokenize by spaces or use all punctuation).
Input
SOURCE [STRING]
: a list of strings. Each string is tokenized.
Output
RESULT [STRING]
: the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.PAIR [STRING, STRING]
: the original string and the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.
Parameters
Tokenization
: the method to tokenize the input strings.None
: perform no tokenizationSpaces
: all valid Unicode space charactersSpaces/Punctuation
:Spaces
+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits
:Spaces/Punctuation
+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols
:Spaces/Punctuation/Digits
+ all valid Unicode symbol charactersCustom Regular Expression
: any regular expression
Min token length
: tokens whose character length is shorter than this value are discardedGram type
:Word
(default): each token is composed by UTF-8 word n-gramsCharacter
: each token is composed by UTF-8 character n-grams
Grams
: allows to extract n-gram tokens (default is 1)Stemming
: tokens can be stemmed for a specific language or left as they areCase-sensitive
: if set tofalse
, upper/lower case is ignored
Output scores can be aggregated and/or normalized.