Match Strings

Description

Finds matches between the STRING-columns in the inputs. Various comparison options can be chosen: equals, contains, startsWith, endsWith or edit-distance. The result provides both the matching strings, and the strings from both inputs that didn't generate a match.

It is strongly recommended that the inputs are already deduplicated. This block does not do that, and duplicates can increase computation time.

Input

  • A [STRING]: a list of candidates
  • B [STRING]: a list of candidates

Output

  • RESULT [STRING,STRING]: the matched strings from A and B
  • NOTA [STRING]: the strings from A that did not match with a strings from B
  • NOTB [STRING]: the strings from B that did not match with a strings from A

Parameters

  • Comparison: Comparison function to use
    • equal: the strings must be equal
    • contains: the string in B must be contained in A
    • containsWholeWord: the string in B must be contained in A, as a whole word (only punctuation/spaces around)
    • startsWith: the string in A must start with B
    • endsWith: the string in A must end with B
    • prefix: strings in A and B share a prefix of a given length
    • levenshtein: the string in A may not have more than Max edit-distance differences (character insertions or deletions) with B.
    • jaro-winkler: the strings in A and B must have a Jaro-Winkler similarity score not smaller than Min similarity.
  • Case-sensitive: if set to false, upper/lower case is ignored
  • Exclude self-matches: whether to emit the match if the objects in A and B are the same. Mostly useful when A and B come from the same source