Filter/Match by String

Description

Finds matches between the STRING-columns in the inputs. Various comparison options can be chosen: equals, contains, startsWith, endsWith or edit-distance.

Input

  • A [OBJ,STRING]: a list of candidates, in which the STRING-column will be used for comparison and the OBJ-column will be the result
  • B [STRING]: a list of candidate strings, to be used for comparison

Output

  • FILTER [OBJ,STRING]: the filtered [OBJ,STRING] from A
  • MATCH [OBJ,STRING,OBJ]: the matched [OBJ,STRING] from A and [STRING] from B

Parameters

  • Comparison: Comparison function to use
    • equal: the strings must be equal
    • contains: the string in B must be contained in A
    • containsWholeWord: the string in B must be contained in A, as a whole word (only punctuation/spaces around)
    • startsWith: the string in A must start with B
    • endsWith: the string in A must end with B
    • levenshtein: the string in A may not have more than Max edit-distance differences (character insertions or deletions) with B. The distance does not affect the score of the match.
    • jaro-winkler: the strings in A and B must have a Jaro-Winkler similarity score not smaller than Min similarity. The distance does not affect the score of the match.
  • Case-sensitive: if set to false, upper/lower case is ignored

Output scores can be aggregated and/or normalized.