Tokenize [String]

Description

Separates a string into tokens. The tokenization method can be defined in the parameter (e.g., only tokenize by spaces or use all punctuation).

Input

  • SOURCE [STRING]: a list of strings. Each string is tokenized.

Output

  • RESULT [STRING]: the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.
  • PAIR [STRING, STRING]: the original string and the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.

Parameters

  • Tokenization: the method to tokenize the input strings.
    • None: perform no tokenization
    • Spaces: all valid Unicode space characters
    • Spaces/Punctuation: Spaces + all valid Unicode punctuation characters
    • Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters
    • Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters
    • Custom Regular Expression: any regular expression
  • Min token length: tokens whose character length is shorter than this value are discarded
  • Gram type:
    • Word (default): each token is composed by UTF-8 word n-grams
    • Character: each token is composed by UTF-8 character n-grams
  • Grams: allows to extract n-gram tokens (default is 1)
  • Stemming: tokens can be stemmed for a specific language or left as they are
  • Case-sensitive: if set to false, upper/lower case is ignored

Output scores can be aggregated and/or normalized.