Match by Word Overlap
Description
Finds matches between the STRING
-columns in the inputs by calculating the Jaccard Index: count of word intersection divided by count of word union.
Input
A [OBJ,STRING]
: a list of candidates, in which theSTRING
-column will be used for comparison and theOBJ
-column will be the resultB [OBJ,STRING]
: a list of candidates, in which theSTRING
-column will be used for comparison and theOBJ
-column will be the result
Output
RESULT [OBJ,OBJ]
: the matched objects fromA
andB
NOTA [OBJ]
: the objects from A that did not match with an item fromB
NOTB [OBJ]
: the objects from B that did not match with an item fromA
Parameters
Case-sensitive
: if set tofalse
, upper/lower case is ignoredConsider term frequency
: whether the number of occurrences of a term in a text should influence the scoreExclude self-matches
: whether to emit the match if the objects inA
andB
are the same. Mostly useful whenA
andB
come from the same source