Rank by Text TF-IDF
Description
Ranks objects in SOURCE [OBJ,STRING]
according to the relevance score of each STRING
with keywords in QUERY [STRING]
.
The relevance score is computed using a generic Vector Space Model framework, which can be customized to implement several weighting schemes.
The default weighting scheme is tf-idf.
Inputs
SOURCE [OBJ,STRING]
: a 2-column input with an object-string pair. Typically obtained with theExtract string
blockQTERMS [STRING]
: a list of keywords to rankSOURCE
objects against
Outputs
RETRIEVE [OBJ]
: a list of ranked objects
Parameters
Notice that not all combinations are expected to work well. Also, some methods inherently perform score normalizations, others do not.
Stemming
: tokens can be stemmed for a specific language or left as they areCase-sensitive
: if set tofalse
, upper/lower case is ignoredNormalize diacritics
: transliterates non-ASCII characters into their closest ASCII formTokenization
: the method to tokenize the input strings.None
: perform no tokenizationSpaces
: all valid Unicode space charactersSpaces/Punctuation
:Spaces
+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits
:Spaces/Punctuation
+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols
:Spaces/Punctuation/Digits
+ all valid Unicode symbol charactersCustom Regular Expression
: any regular expression
Min token length
: tokens whose character length is shorter than this value are discardedGram type
:Word
(default): each token is composed by UTF-8 word n-gramsCharacter
: each token is composed by UTF-8 character n-grams
Grams
: allows to extract n-gram tokens (default is 1)All query terms must match
: if set totrue
, only candidates where all tokens inQTERMS
match a string inSOURCE
are considered a matchDocument TF
: term frequency weight for documents inSOURCE
BNRY
: binary, only encodes term occurrence, ignoring the number of occurrencesFREQ
: frequency, encodes term frequency (number of occurrences)LOGA
: logarithmic (aka log normalization)LOGN
: normalized logarithmic (aka average log normalization)ANTF05
: augmented normalized (aka double normalization 0.5)BM25
: Okapi BM-25 term frequencyk1
: controls non-linear term frequency normalization (saturation). Lower value = quicker saturation (term frequency is more quickly less important)b
: degree of document-length normalization applied.0
=no normalization,1
=full normalization
Document IDF
: inverse document frequency weight for documents inSOURCE
NONE
: unary (constant1
)IDFB
: inverse document frequencyIDFP
: smoothed probabilistic inverse document frequencyBM25
: Okapi BM-25 inverse document frequency
Document normalization
:NONE
: no normalizationDL
: document-length normalization (longer = smaller prior)PUQN
: pivoted unique document length normalizationSlope
: tunable parameter forPUQN
Query TF
: term frequency weight for documents inQUERY
- (same options as for
Document TF
)
- (same options as for
Query IDF
: inverse document frequency weight for documents inQUERY
- (same options as for
Document IDF
)
- (same options as for
Output scores can be normalized.