Rank by Text LM

Description

Ranks objects in SOURCE [OBJ,STRING] according to the relevance score of each STRING with keywords in QUERY [STRING]. The relevance is computed using the Language modelling ranking method. Smoothing variants implemented: Jelinek-Mercer, Dirichlet, Dirichlet parameter-free.

 : param μ, equivalent to Jelinek-Mercer with `λ = μ / (μ + |D|)`
- Dirichlet (param-free): `μ = avg(|D|)`

Inputs

  • SOURCE [OBJ,STRING]: a 2-column input with an object-string pair. Typically obtained with the Extract string block
  • QTERMS [STRING]: a list of keywords to rank SOURCE objects against

Outputs

  • RETRIEVE [OBJ]: a list of ranked objects

Parameters

  • Stemming: tokens can be stemmed for a specific language or left as they are
  • Case-sensitive: if set to false, upper/lower case is ignored
  • Normalize diacritics: transliterates non-ASCII characters into their closest ASCII form
  • Tokenization: the method to tokenize the input strings.
    • None: perform no tokenization
    • Spaces: all valid Unicode space characters
    • Spaces/Punctuation: Spaces + all valid Unicode punctuation characters
    • Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters
    • Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters
    • Custom Regular Expression: any regular expression
  • Min token length: tokens whose character length is shorter than this value are discarded
  • Gram type:
    • Word (default): each token is composed by UTF-8 word n-grams
    • Character: each token is composed by UTF-8 character n-grams
  • Grams: allows to extract n-gram tokens (default is 1)
  • Smoothing: smoothing method
    • Jelinek-Mercer: linear interpolation between foreground document model and background collection model
      • λ: 0 = only foreground, 1 = only background
    • Dirichlet: equivalent to Jelinek-Mercer where λ = μ / (μ + |D|)
      • μ: collection and query specific parameter. 0 = only foreground, 2000 = generic default.
    • Dirichlet (param-free): Dirichlet with μ = avg(|D|)

Output scores can be normalized.