Replace with RegEx

Description

Transforms strings in a [OBJ,STRING] input using a regular expression replacement.

Input

  • SOURCE [OBJ,STRING]: a 2-column input with an object-string pair. Typically obtained with the Extract string block

Output

  • RESULT [OBJ,STRING]: the pairs from SOURCE, where the string has been modified
  • STRINGS [STRING]: the modified strings, without the object they were paired to

Parameters

  • Pattern RegEx: the regular expression to use for the match in SOURCE.
  • Replacement RegEx: the regular expression to use for the replacement in RESULT.
  • Occurrences:
    • First: replace only the first occurrence in each string in input
    • All: replace all the occurrences in each string in input
  • Case-sensitive: if set to false, upper/lower case is ignored

Output scores can be aggregated and/or normalized.

Regular Expressions

Regular expressions are internally evaluated by a PCRE engine. For a syntax reference, see this page. For a 1-page syntax reference, see this cheat-sheet.

Some of the Most Common Questions and Mistakes

  • Regular expressions are different from glob patterns using wildcards. In particular, * does NOT mean "anything", .* does.
  • All special characters (. * + ? | \ ( ) [ ] ^ $) must be escaped (prefixed with \) when they are meant literally, in the Pattern RegEx. They are always meant literally (thus, no escaping!) in the Replacent RegEx (except group references, see below)
  • Capturing groups are indicated by parentheses, and back-references by either \n or $n, with n being the n-th group in the pattern.
  • Parentheses can also be used to group sub-expressions together, for example in choices: (one|two|three). To use parentheses only for grouping and not capturing, use the ?: prefix, as in (?:one|two|three).
  • ^ indicates the beginning of an input text, or negation when used inside a multiple choice (e.g., [^\d-_]). $ indicates the end of an input text.
  • \b indicates a word-boundary (spaces, punctuation, etc.).

Examples

  • Normalize spaces (with Occurrences = All)
    • Pattern RegEx: \s+
    • Replacement RegEx: (a single space)
  • Turn Smith, John into John Smith:
    • Pattern RegEx: ^([^,]+)\s*,\s*(.+)$
    • Replacement RegEx: $2 $1
  • Extract any day of the week (with Case-sensitive = false):
    • Pattern RegEx: .*\b((?:mon|tue|wednes|thurs|fri|sat|sun)day)\b.*
    • Replacement RegEx: $1