Match by BM25
Description
This is a multi-query BM25 block, multiple lists of query keywords instead of a single one.
It is in fact equivalent to a matching operation.
It finds matches between the STRING
-columns in the inputs by calculating the BM25 relevance score.
Input
Because this is originally a retrieval block, the notation SOURCE
/ QTERMS
will be used, instead of A
/ B
as in other matching blocks.
SOURCE [OBJ,STRING]
: a list of candidates, in which theSTRING
-column will be used for comparison and theOBJ
-column will be the resultQTERMS [OBJ,STRING]
: a list of candidates, in which theSTRING
-column will be used for comparison and theOBJ
-column will be the result
Output
RESULT [OBJ,OBJ]
: the matched objects fromSOURCE
andQTERMS
Parameters
Stemming
: tokens can be stemmed for a specific language or left as they areCase-sensitive
: if set tofalse
, upper/lower case is ignoredNormalize diacritics
: transliterates non-ASCII characters into their closest ASCII formTokenization
: the method to tokenize the input strings.None
: perform no tokenizationSpaces
: all valid Unicode space charactersSpaces/Punctuation
:Spaces
+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits
:Spaces/Punctuation
+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols
:Spaces/Punctuation/Digits
+ all valid Unicode symbol charactersCustom Regular Expression
: any regular expression
Min token length
: tokens whose character length is shorter than this value are discardedGram type
:Word
(default): each token is composed by UTF-8 word n-gramsCharacter
: each token is composed by UTF-8 character n-grams
Grams
: allows to extract n-gram tokens (default is 1)All query terms must match
: if set totrue
, only candidates where all tokens inQTERMS
match a string inSOURCE
are considered a matchk1
: controls non-linear term frequency normalization (saturation). Lower value = quicker saturation (term frequency is more quickly less important)b
: degree of document-length normalization applied.0
=no normalization,1
=full normalization