Data Type-Specific Settings
When adding data sources, there are specific settings for each data type. These are described in the following:
CSV
The settings for CSV files are:
- Character encoding:
Specifies the character encoding used in the file. (default=’UTF-8’) - Column separator character:
Specifies the separator in the file. Must be a single character. Use ‘\t’ for tab. (default=,) - Contains header row:
Specifies whether the first row (after the skipped rows) is the list of column names. (default=true) - Batch size (internal):
Number of rows to emit in a single XML fragment. Increasing this number usually improves performance, but also affects memory usage. Usually a batch size of 256 is fine. Make sure that your indexer template can handle multiple rows in a single XML fragment. (default=1) - Rows to skip (internal):
Specifies how many initial rows to skip. (default=0) - Trim whitespace from values:
Specifies whether to strip whitespace from strings in the CSV file. This feature is particularly useful when parsing with older databases that pad values with spaces to meet a specific width. (default=false) - Method to escape strings:
Specifies how a literal quoting-character is escaped. There are two modes supported: DOUBLING (example: to express a literal double-quote, write two double-quotes), or BACKSLASH to use a preceding ‘’ to denote a literal escaping character (“ to write a “). (default=BACKSLASH) - Allow rows to be spread out over multiple rows:
Specifies whether quoted text fields may have line breaks in them. (default=false) - Character to use for quoting a string:
Specifies how text is identified. Usually a single or a double quote. (default=”)
JSON
The settings for JSON files are:
- One JSON object per line
- Allowed size (internal)
- Try to sanitize the JSON (solves some issues with malformatted JSON)
- Simplify XML representation (show the JSON keys directly as element names, example:
<x>
instead of<property name='x'/>
)
RDF
The settings for RDF files are:
- Graph name:
Name of the graph (optional) - Base URI:
Prefix of the relative URIs in the content (optional) - Batch size:
Triples to read in a single batch (default=1024)
XLSX
The settings for XLSX files are:
- Worksheet
- Start at row
- Locale
- Strip whitespace around values
- Timezone
- Batch size
- Multi-row header size
XML
The settings for XML files are:
- Root tag:
Identify and set a root tag. This will split the source file into fragments, making the indexing process more performant. - Max number of nodes per fragment:
Specifies how many nodes to process at once. More nodes means more capacity to index. - Batching:
Allows to organize fragments by grouping them under a specified tag. When set, fragments will be processed in batches. - Maximum number of items in a batch:
Specifies the maximum number of items in a batch. This setting only applies if a batch tag is set.