Tokenize [Obj,String]
Description
Separates a string into tokens. The tokenization method can be defined in the parameter (e.g., only tokenize by spaces or use all punctuation).
Input
SOURCE [OBJ,STRING]
: a list of object-string pairs. Each string is tokenized.
Output
PAIR [OBJ,STRING]
: a result pair contains an object from the input source and a token from the tokenized string. Thus, each token from the string is returned as a separate result pair.RESULT [STRING]
: the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled. Notice that reference to which object each token came from is lost.
Parameters
Tokenization
: the method to tokenize the input strings.None
: perform no tokenizationSpaces
: all valid Unicode space charactersSpaces/Punctuation
:Spaces
+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits
:Spaces/Punctuation
+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols
:Spaces/Punctuation/Digits
+ all valid Unicode symbol charactersCustom Regular Expression
: any regular expression
Min token length
: tokens whose character length is shorter than this value are discardedGram type
:Word
(default): each token is composed by UTF-8 word n-gramsCharacter
: each token is composed by UTF-8 character n-grams
Grams
: allows to extract n-gram tokens (default is 1)Stemming
: tokens can be stemmed for a specific language or left as they areCase-sensitive
: if set tofalse
, upper/lower case is ignored
Output scores can be aggregated and/or normalized.