Tokenize [String]
Description
Separates a string into tokens. The tokenization method can be defined in the parameter (e.g., only tokenize by spaces or use all punctuation).
Input
SOURCE [STRING]: a list of strings. Each string is tokenized.
Output
RESULT [STRING]: the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.PAIR [STRING, STRING]: the original string and the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.
Parameters
Tokenization: the method to tokenize the input strings.None: perform no tokenizationSpaces: all valid Unicode space charactersSpaces/Punctuation:Spaces+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits:Spaces/Punctuation+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols:Spaces/Punctuation/Digits+ all valid Unicode symbol charactersCustom Regular Expression: any regular expression
Min token length: tokens whose character length is shorter than this value are discardedGram type:Word(default): each token is composed by UTF-8 word n-gramsCharacter: each token is composed by UTF-8 character n-grams
Grams: allows to extract n-gram tokens (default is 1)Stemming: tokens can be stemmed for a specific language or left as they areCase-sensitive: if set tofalse, upper/lower case is ignored
Output scores can be aggregated and/or normalized.