Rank by Text LM
Description
Ranks objects in SOURCE [OBJ,STRING]
according to the relevance score of each STRING
with keywords in QUERY [STRING]
.
The relevance is computed using the Language modelling ranking method.
Smoothing variants implemented: Jelinek-Mercer, Dirichlet, Dirichlet parameter-free.
: param μ, equivalent to Jelinek-Mercer with `λ = μ / (μ + |D|)`
- Dirichlet (param-free): `μ = avg(|D|)`
Inputs
SOURCE [OBJ,STRING]
: a 2-column input with an object-string pair. Typically obtained with theExtract string
blockQTERMS [STRING]
: a list of keywords to rankSOURCE
objects against
Outputs
RETRIEVE [OBJ]
: a list of ranked objects
Parameters
Stemming
: tokens can be stemmed for a specific language or left as they areCase-sensitive
: if set tofalse
, upper/lower case is ignoredNormalize diacritics
: transliterates non-ASCII characters into their closest ASCII formTokenization
: the method to tokenize the input strings.None
: perform no tokenizationSpaces
: all valid Unicode space charactersSpaces/Punctuation
:Spaces
+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits
:Spaces/Punctuation
+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols
:Spaces/Punctuation/Digits
+ all valid Unicode symbol charactersCustom Regular Expression
: any regular expression
Min token length
: tokens whose character length is shorter than this value are discardedGram type
:Word
(default): each token is composed by UTF-8 word n-gramsCharacter
: each token is composed by UTF-8 character n-grams
Grams
: allows to extract n-gram tokens (default is 1)Smoothing
: smoothing methodJelinek-Mercer
: linear interpolation between foreground document model and background collection modelλ
:0
= only foreground,1
= only background
Dirichlet
: equivalent toJelinek-Mercer
whereλ = μ / (μ + |D|)
μ
: collection and query specific parameter.0
= only foreground,2000
= generic default.
Dirichlet (param-free)
:Dirichlet
withμ = avg(|D|)
Output scores can be normalized.