Data Type-Specific Settings

When adding data sources, there are specific settings for each data type. These are described in the following:

CSV

The settings for CSV files are:

Character encoding:
Specifies the character encoding used in the file. (default=’UTF-8’)
Column separator character:
Specifies the separator in the file. Must be a single character. Use ‘\t’ for tab. (default=,)
Contains header row:
Specifies whether the first row (after the skipped rows) is the list of column names. (default=true)
Batch size (internal):
Number of rows to emit in a single XML fragment. Increasing this number usually improves performance, but also affects memory usage. Usually a batch size of 256 is fine. Make sure that your indexer template can handle multiple rows in a single XML fragment. (default=1)
Rows to skip (internal):
Specifies how many initial rows to skip. (default=0)
Trim whitespace from values:
Specifies whether to strip whitespace from strings in the CSV file. This feature is particularly useful when parsing with older databases that pad values with spaces to meet a specific width. (default=false)
Method to escape strings:
Specifies how a literal quoting-character is escaped. There are two modes supported: DOUBLING (example: to express a literal double-quote, write two double-quotes), or BACKSLASH to use a preceding ‘’ to denote a literal escaping character (“ to write a “). (default=BACKSLASH)
Allow rows to be spread out over multiple rows:
Specifies whether quoted text fields may have line breaks in them. (default=false)
Character to use for quoting a string:
Specifies how text is identified. Usually a single or a double quote. (default=”)

JSON

The settings for JSON files are:

One JSON object per line
Allowed size (internal)
Try to sanitize the JSON (solves some issues with malformatted JSON)
Simplify XML representation (show the JSON keys directly as element names, example: <x> instead of <property name='x'/>)

RDF

The settings for RDF files are:

Graph name:
Name of the graph (optional)
Base URI:
Prefix of the relative URIs in the content (optional)
Batch size:
Triples to read in a single batch (default=1024)

XLSX

The settings for XLSX files are:

Worksheet
Start at row
Locale
Strip whitespace around values
Timezone
Batch size
Multi-row header size

XML

The settings for XML files are:

Root tag:
Identify and set a root tag. This will split the source file into fragments, making the indexing process more performant.
Max number of nodes per fragment:
Specifies how many nodes to process at once. More nodes means more capacity to index.
Batching:
Allows to organize fragments by grouping them under a specified tag. When set, fragments will be processed in batches.
Maximum number of items in a batch:
Specifies the maximum number of items in a batch. This setting only applies if a batch tag is set.