Match by Word Overlap
Description
Finds matches between the STRING-columns in the inputs by calculating the Jaccard Index: count of word intersection divided by count of word union.
Input
A [OBJ,STRING]: a list of candidates, in which theSTRING-column will be used for comparison and theOBJ-column will be the resultB [OBJ,STRING]: a list of candidates, in which theSTRING-column will be used for comparison and theOBJ-column will be the result
Output
RESULT [OBJ,OBJ]: the matched objects fromAandBNOTA [OBJ]: the objects from A that did not match with an item fromBNOTB [OBJ]: the objects from B that did not match with an item fromA
Parameters
Case-sensitive: if set tofalse, upper/lower case is ignoredConsider term frequency: whether the number of occurrences of a term in a text should influence the scoreExclude self-matches: whether to emit the match if the objects inAandBare the same. Mostly useful whenAandBcome from the same source