Rank by Text BM25
Description
Ranks objects in SOURCE [OBJ,STRING]
according to the relevance score of each STRING
with keywords in QUERY [STRING]
.
The relevance is computed using Okapi BM-25 ranking method.
Inputs
SOURCE [OBJ,STRING]
: a 2-column input with an object-string pair. Typically obtained with theExtract string
blockQTERMS [STRING]
: a list of keywords to rankSOURCE
objects against
Outputs
RETRIEVE [OBJ]
: a list of ranked objects
Parameters
Stemming
: tokens can be stemmed for a specific language or left as they areCase-sensitive
: if set tofalse
, upper/lower case is ignoredNormalize diacritics
: transliterates non-ASCII characters into their closest ASCII formTokenization
: the method to tokenize the input strings.None
: perform no tokenizationSpaces
: all valid Unicode space charactersSpaces/Punctuation
:Spaces
+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits
:Spaces/Punctuation
+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols
:Spaces/Punctuation/Digits
+ all valid Unicode symbol charactersCustom Regular Expression
: any regular expression
Min token length
: tokens whose character length is shorter than this value are discardedGram type
:Word
(default): each token is composed by UTF-8 word n-gramsCharacter
: each token is composed by UTF-8 character n-grams
Grams
: allows to extract n-gram tokens (default is 1)All query terms must match
: if set totrue
, only candidates where all tokens in aQTERMS
entry match a string inSOURCE
are considered a match (AND logic for terms)One query per QTERMS row
: if set totrue
, each row inQTERMS
is considered as a separate query. All queries contributions are summed up (OR
logic for queries)k1
: controls non-linear term frequency normalization (saturation). Lower value = quicker saturation (term frequency is more quickly less important)b
: degree of document-length normalization applied.0
=no normalization,1
=full normalization
Output scores can be normalized.