Match by Word Overlap

Description

Finds matches between the STRING-columns in the inputs by calculating the Jaccard Index: count of word intersection divided by count of word union.

Input

  • A [OBJ,STRING]: a list of candidates, in which the STRING-column will be used for comparison and the OBJ-column will be the result
  • B [OBJ,STRING]: a list of candidates, in which the STRING-column will be used for comparison and the OBJ-column will be the result

Output

  • RESULT [OBJ,OBJ]: the matched objects from A and B
  • NOTA [OBJ]: the objects from A that did not match with an item from B
  • NOTB [OBJ]: the objects from B that did not match with an item from A

Parameters

  • Case-sensitive: if set to false, upper/lower case is ignored
  • Consider term frequency: whether the number of occurrences of a term in a text should influence the score
  • Exclude self-matches: whether to emit the match if the objects in A and B are the same. Mostly useful when A and B come from the same source