Match by BM25
Description
This is a multi-query BM25 block, multiple lists of query keywords instead of a single one.
It is in fact equivalent to a matching operation.
It finds matches between the STRING-columns in the inputs by calculating the BM25 relevance score.
Input
Because this is originally a retrieval block, the notation SOURCE / QTERMS will be used, instead of A / B as in other matching blocks.
SOURCE [OBJ,STRING]: a list of candidates, in which theSTRING-column will be used for comparison and theOBJ-column will be the resultQTERMS [OBJ,STRING]: a list of candidates, in which theSTRING-column will be used for comparison and theOBJ-column will be the result
Output
RESULT [OBJ,OBJ]: the matched objects fromSOURCEandQTERMS
Parameters
Stemming: tokens can be stemmed for a specific language or left as they areCase-sensitive: if set tofalse, upper/lower case is ignoredNormalize diacritics: transliterates non-ASCII characters into their closest ASCII formTokenization: the method to tokenize the input strings.None: perform no tokenizationSpaces: all valid Unicode space charactersSpaces/Punctuation:Spaces+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits:Spaces/Punctuation+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols:Spaces/Punctuation/Digits+ all valid Unicode symbol charactersCustom Regular Expression: any regular expression
Min token length: tokens whose character length is shorter than this value are discardedGram type:Word(default): each token is composed by UTF-8 word n-gramsCharacter: each token is composed by UTF-8 character n-grams
Grams: allows to extract n-gram tokens (default is 1)All query terms must match: if set totrue, only candidates where all tokens inQTERMSmatch a string inSOURCEare considered a matchk1: controls non-linear term frequency normalization (saturation). Lower value = quicker saturation (term frequency is more quickly less important)b: degree of document-length normalization applied.0=no normalization,1=full normalization