Tokenize [Obj,String]

Description

Separates a string into tokens. The tokenization method can be defined in the parameter (e.g., only tokenize by spaces or use all punctuation).

Input

  • SOURCE [OBJ,STRING]: a list of object-string pairs. Each string is tokenized.

Output

  • PAIR [OBJ,STRING]: a result pair contains an object from the input source and a token from the tokenized string. Thus, each token from the string is returned as a separate result pair.
  • RESULT [STRING]: the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled. Notice that reference to which object each token came from is lost.

Parameters

  • Tokenization: the method to tokenize the input strings.
    • None: perform no tokenization
    • Spaces: all valid Unicode space characters
    • Spaces/Punctuation: Spaces + all valid Unicode punctuation characters
    • Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters
    • Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters
    • Custom Regular Expression: any regular expression
  • Min token length: tokens whose character length is shorter than this value are discarded
  • Gram type:
    • Word (default): each token is composed by UTF-8 word n-grams
    • Character: each token is composed by UTF-8 character n-grams
  • Grams: allows to extract n-gram tokens (default is 1)
  • Stemming: tokens can be stemmed for a specific language or left as they are
  • Case-sensitive: if set to false, upper/lower case is ignored

Output scores can be aggregated and/or normalized.