Rank by Text LM
Description
Ranks objects in SOURCE [OBJ,STRING] according to the relevance score of each STRING with keywords in QUERY [STRING].
The relevance is computed using the Language modelling ranking method.
Smoothing variants implemented: Jelinek-Mercer, Dirichlet, Dirichlet parameter-free.
: param μ, equivalent to Jelinek-Mercer with `λ = μ / (μ + |D|)`
- Dirichlet (param-free): `μ = avg(|D|)`Inputs
SOURCE [OBJ,STRING]: a 2-column input with an object-string pair. Typically obtained with theExtract stringblockQTERMS [STRING]: a list of keywords to rankSOURCEobjects against
Outputs
RETRIEVE [OBJ]: a list of ranked objects
Parameters
Stemming: tokens can be stemmed for a specific language or left as they areCase-sensitive: if set tofalse, upper/lower case is ignoredNormalize diacritics: transliterates non-ASCII characters into their closest ASCII formTokenization: the method to tokenize the input strings.None: perform no tokenizationSpaces: all valid Unicode space charactersSpaces/Punctuation:Spaces+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits:Spaces/Punctuation+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols:Spaces/Punctuation/Digits+ all valid Unicode symbol charactersCustom Regular Expression: any regular expression
Min token length: tokens whose character length is shorter than this value are discardedGram type:Word(default): each token is composed by UTF-8 word n-gramsCharacter: each token is composed by UTF-8 character n-grams
Grams: allows to extract n-gram tokens (default is 1)Smoothing: smoothing methodJelinek-Mercer: linear interpolation between foreground document model and background collection modelλ:0= only foreground,1= only background
Dirichlet: equivalent toJelinek-Mercerwhereλ = μ / (μ + |D|)μ: collection and query specific parameter.0= only foreground,2000= generic default.
Dirichlet (param-free):Dirichletwithμ = avg(|D|)
Output scores can be normalized.