Rank by Text TF-IDF

Description

Ranks objects in SOURCE [OBJ,STRING] according to the relevance score of each STRING with keywords in QUERY [STRING]. The relevance score is computed using a generic Vector Space Model framework, which can be customized to implement several weighting schemes. The default weighting scheme is tf-idf.

Inputs

  • SOURCE [OBJ,STRING]: a 2-column input with an object-string pair. Typically obtained with the Extract string block
  • QTERMS [STRING]: a list of keywords to rank SOURCE objects against

Outputs

  • RETRIEVE [OBJ]: a list of ranked objects

Parameters

Notice that not all combinations are expected to work well. Also, some methods inherently perform score normalizations, others do not.

  • Stemming: tokens can be stemmed for a specific language or left as they are
  • Case-sensitive: if set to false, upper/lower case is ignored
  • Normalize diacritics: transliterates non-ASCII characters into their closest ASCII form
  • Tokenization: the method to tokenize the input strings.
    • None: perform no tokenization
    • Spaces: all valid Unicode space characters
    • Spaces/Punctuation: Spaces + all valid Unicode punctuation characters
    • Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters
    • Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters
    • Custom Regular Expression: any regular expression
  • Min token length: tokens whose character length is shorter than this value are discarded
  • Gram type:
    • Word (default): each token is composed by UTF-8 word n-grams
    • Character: each token is composed by UTF-8 character n-grams
  • Grams: allows to extract n-gram tokens (default is 1)
  • All query terms must match: if set to true, only candidates where all tokens in QTERMS match a string in SOURCE are considered a match
  • Document TF: term frequency weight for documents in SOURCE
    • BNRY: binary, only encodes term occurrence, ignoring the number of occurrences
    • FREQ: frequency, encodes term frequency (number of occurrences)
    • LOGA: logarithmic (aka log normalization)
    • LOGN: normalized logarithmic (aka average log normalization)
    • ANTF05: augmented normalized (aka double normalization 0.5)
    • BM25: Okapi BM-25 term frequency
      • k1: controls non-linear term frequency normalization (saturation). Lower value = quicker saturation (term frequency is more quickly less important)
      • b: degree of document-length normalization applied. 0=no normalization, 1=full normalization
  • Document IDF: inverse document frequency weight for documents in SOURCE
  • Document normalization:
  • Query TF: term frequency weight for documents in QUERY
    • (same options as for Document TF)
  • Query IDF: inverse document frequency weight for documents in QUERY
    • (same options as for Document IDF)

Output scores can be normalized.