Text Transformations for Search Term Matching

Search terms may contain special characters, diacritics, case differences, or language-specific variations, while your indexed data may not (or vice versa). When building a search experience in SPARQUE Desk, you need to ensure that incoming search terms (keywords) can successfully match with the terms stored in your dataset.

To achieve this, SPARQUE Desk provides several text transformation blocks. These blocks help normalize both user search queries and dataset entries so they can match consistently.

Scenarios

For understanding why text transformations are needed, consider these scenarios:

A user searches for “St.-Veit-Straße”, but your data only contains “St.-Veit-Strasse”.
A user searches for “Busch-Jäger”, but your index contains “Busch-Jaeger”.
A user searches for “John Smith”, but the data stores it as “Smith, John”.

Without transformations, these searches fail to match. By applying normalization steps, you ensure that both sides of the comparison use the same string representation.

Available Transformation Blocks

This section covers the text processing blocks you can use in SPARQUE Desk.

Normalize Diacritics

Purpose: Convert accented or special characters into their ASCII equivalent.

Example:

Nguyễn Tấn Dũng → Nguyen Tan Dung
St.-Veit-Straße → St.-Veit-Strasse

This ensures that diacritics do not prevent matching when the dataset stores only plain ASCII versions. For details, refer to Normalize Diacritics.

Language Transliteration

Purpose: Handle language-specific transliterations for special characters.

Example (German):

ä → ae (Busch-Jäger → Busch-Jaeger)
ö → oe
ü → ue

Example (Swedish):

å → aa (Håkan → Haakan)

This block is useful when your data follows language-aware spelling rules. For details, refer to Language Transliteration and Language Transliteration [Strings].

String Fingerprint

Purpose: Create a simplified “fingerprint” of a string by:

Lowercasing
Removing accents
Tokenizing and sorting tokens

Example:

The Big House → bighousethe
House, Big The → bighousethe

Both strings result in the same fingerprint, ensuring consistent matching regardless of word order or accents. For details, refer to String Fingerprint.

Change Case

Purpose: Standardize letter casing.

Example:

SPARQUE DESK → sparque desk (lower-case)
sparque desk → SPARQUE DESK (upper-case)

This ensures case-insensitive matching. For details, refer to Change Case.

Stem

Purpose: Reduce words to their root (stem) form.

Example (English):

running, runs, ran → run

Example (Dutch):

lopend, lopen → loop

Stemming is especially useful when you want different inflected forms of a word to match. For details, refer to Stem.

Replace with RegEx

Purpose: Use regular expressions to replace text patterns. This is the most flexible block and can handle custom formatting issues.

Examples:

Normalize multiple spaces:
Pattern: \s+ → Replacement: " "
John Smith → John Smith
Swap names:
Pattern: ^([^,]+)\s*,\s*(.+)$
Replacement: $2 $1
Smith, John → John Smith
Extract a day of the week (case-insensitive):
Pattern: .*\b((?:mon|tue|wednes|thurs|fri|sat|sun)day)\b.*
Replacement: $1
Next Monday Morning → Monday

For details, refer to Replace with RegEx and Replace with RegEx [Strings].

Best Practices

When working with text transformations, consider the following best practices:

Normalize both sides:
Apply transformations to both the incoming keyword and the dataset entries to ensure that they are comparable.
Keep transformations consistent:
If you lowercase the dataset terms, also lowercase incoming queries.
Use fingerprints for messy data:
If your data varies in order or spacing, use String Fingerprint.
Start simple, extend as needed:
Begin with Normalize Diacritics and Change Case. Add Transliteration, Stemming, or RegEx rules for specific use cases.

Example Workflow

Transform the words from your dataset into normal characters. For example, apply Normalize Diacritics and Change Case.
Repeat these steps for the incoming parameter (often keyword).
Compare the transformed parameter against the transformed dataset terms.

With these transformations, your SPARQUE Desk search engine can handle special characters, accents, cases, and language-specific spelling differences, leading to more accurate and user-friendly results.