Rank by Text TF-IDF
Description
Ranks objects in SOURCE [OBJ,STRING] according to the relevance score of each STRING with keywords in QUERY [STRING].
The relevance score is computed using a generic Vector Space Model framework, which can be customized to implement several weighting schemes.
The default weighting scheme is tf-idf.
Inputs
SOURCE [OBJ,STRING]: a 2-column input with an object-string pair. Typically obtained with theExtract stringblockQTERMS [STRING]: a list of keywords to rankSOURCEobjects against
Outputs
RETRIEVE [OBJ]: a list of ranked objects
Parameters
Notice that not all combinations are expected to work well. Also, some methods inherently perform score normalizations, others do not.
Stemming: tokens can be stemmed for a specific language or left as they areCase-sensitive: if set tofalse, upper/lower case is ignoredNormalize diacritics: transliterates non-ASCII characters into their closest ASCII formTokenization: the method to tokenize the input strings.None: perform no tokenizationSpaces: all valid Unicode space charactersSpaces/Punctuation:Spaces+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits:Spaces/Punctuation+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols:Spaces/Punctuation/Digits+ all valid Unicode symbol charactersCustom Regular Expression: any regular expression
Min token length: tokens whose character length is shorter than this value are discardedGram type:Word(default): each token is composed by UTF-8 word n-gramsCharacter: each token is composed by UTF-8 character n-grams
Grams: allows to extract n-gram tokens (default is 1)All query terms must match: if set totrue, only candidates where all tokens inQTERMSmatch a string inSOURCEare considered a matchDocument TF: term frequency weight for documents inSOURCEBNRY: binary, only encodes term occurrence, ignoring the number of occurrencesFREQ: frequency, encodes term frequency (number of occurrences)LOGA: logarithmic (aka log normalization)LOGN: normalized logarithmic (aka average log normalization)ANTF05: augmented normalized (aka double normalization 0.5)BM25: Okapi BM-25 term frequencyk1: controls non-linear term frequency normalization (saturation). Lower value = quicker saturation (term frequency is more quickly less important)b: degree of document-length normalization applied.0=no normalization,1=full normalization
Document IDF: inverse document frequency weight for documents inSOURCENONE: unary (constant1)IDFB: inverse document frequencyIDFP: smoothed probabilistic inverse document frequencyBM25: Okapi BM-25 inverse document frequency
Document normalization:NONE: no normalizationDL: document-length normalization (longer = smaller prior)PUQN: pivoted unique document length normalizationSlope: tunable parameter forPUQN
Query TF: term frequency weight for documents inQUERY- (same options as for
Document TF)
- (same options as for
Query IDF: inverse document frequency weight for documents inQUERY- (same options as for
Document IDF)
- (same options as for
Output scores can be normalized.