Data Type-Specific Settings

When adding data sources, there are specific settings for each data type. These are described in the following:

CSV

The settings for CSV files are:

  • Character encoding:
    Specifies the character encoding used in the file. (default=’UTF-8’)
  • Column separator character:
    Specifies the separator in the file. Must be a single character. Use ‘\t’ for tab. (default=,)
  • Contains header row:
    Specifies whether the first row (after the skipped rows) is the list of column names. (default=true)
  • Batch size (internal):
    Number of rows to emit in a single XML fragment. Increasing this number usually improves performance, but also affects memory usage. Usually a batch size of 256 is fine. Make sure that your indexer template can handle multiple rows in a single XML fragment. (default=1)
  • Rows to skip (internal):
    Specifies how many initial rows to skip. (default=0)
  • Trim whitespace from values:
    Specifies whether to strip whitespace from strings in the CSV file. This feature is particularly useful when parsing with older databases that pad values with spaces to meet a specific width. (default=false)
  • Method to escape strings:
    Specifies how a literal quoting-character is escaped. There are two modes supported: DOUBLING (example: to express a literal double-quote, write two double-quotes), or BACKSLASH to use a preceding ‘’ to denote a literal escaping character (“ to write a “). (default=BACKSLASH)
  • Allow rows to be spread out over multiple rows:
    Specifies whether quoted text fields may have line breaks in them. (default=false)
  • Character to use for quoting a string:
    Specifies how text is identified. Usually a single or a double quote. (default=”)

JSON

The settings for JSON files are:

  • One JSON object per line
  • Allowed size (internal)
  • Try to sanitize the JSON (solves some issues with malformatted JSON)
  • Simplify XML representation (show the JSON keys directly as element names, example: <x> instead of <property name='x'/>)

RDF

The settings for RDF files are:

  • Graph name:
    Name of the graph (optional)
  • Base URI:
    Prefix of the relative URIs in the content (optional)
  • Batch size:
    Triples to read in a single batch (default=1024)

XLSX

The settings for XLSX files are:

  • Worksheet
  • Start at row
  • Locale
  • Strip whitespace around values
  • Timezone
  • Batch size
  • Multi-row header size

XML

The settings for XML files are:

  • Root tag:
    Identify and set a root tag. This will split the source file into fragments, making the indexing process more performant.
  • Max number of nodes per fragment:
    Specifies how many nodes to process at once. More nodes means more capacity to index.
  • Batching:
    Allows to organize fragments by grouping them under a specified tag. When set, fragments will be processed in batches.
  • Maximum number of items in a batch:
    Specifies the maximum number of items in a batch. This setting only applies if a batch tag is set.