Configuring Database Indexes

1. Overview

In this section, we discuss the types of database indexes used by eXist, as well as how they are created, configured and maintained. It assumes readers have a basic understanding of XML and XQuery.

Database indexes are used extensively by eXist to facilitate efficient querying of the database. This is accomplished both by system-generated and user-configured database indexes. The current version of eXist by default includes the following types of indexes:

  1. Structural Indexes : These index the nodal structure, elements (tags) and attributes, of the documents in a collection.
  2. Range Indexes : These map specific text nodes and attributes of the documents in a collection to typed values.
  3. Old Legacy Full Text Indexes These map specific text nodes and attributes of the documents in a collection to text tokens.
  4. New Full Text Indexes (eXist 1.4): new full text indexing module. Features faster and customizable full text indexing by transparently integrating Lucene into the XQuery engine. Prefer this index over the old builtin implementation.
  5. NGram Indexes : These map specific text nodes and attributes of the documents in a collection to splitted tokens of n-characters (where n = 3 by default). Very efficient for exact substring searches and for queries on scripts (mostly non-european ones) which can not be easily split into whitespace separated tokens and are thus a bad match for the full text index.
  6. Spatial Indexes (Experimental): These map elements of the documents in a collection containing georeferenced geometries to dedicated data structures that allow efficient spatial queries.

Note

Currently, comments and processing instruction nodes are not indexed. Whether they should be or not, and how some processing instructions could "hint" the indexing process is still being considered.

Since version 1.2, eXist features a new modularized indexing architecture. Most types of indexes have been moved out of the database core and are now maintained as pluggable extensions. The full text, ngram and spatial indexes fall under this category. Please refer to the blog article which introduced the new architecture.

2. Built-in Indexes

This section describes the features of those indexes that are part of the eXist distribution. Some of those indexes (n-gram, spatial) may need to be enabled first (see below).

2.1. Structural index

This index keeps track of the elements (tags), attributes, and nodal structure for all XML documents in a collection. It is created and maintained automatically in eXist, and can neither be reconfigured nor disabled by the user. The structural index is required for nearly all XPath and XQuery expressions in eXist (with the exception of wildcard-only expressions such as "//*"). This index is stored in the database file elements.dbx.

Technically, the structural index maps every element and attribute qname (or qualified name) in a document collection to a list of <documentId, nodeId> pairs. This mapping is used by the query engine to resolve queries for a given XPath expression.

For example, given the following query:

//book/section

eXist uses two index lookups: the first for the <book> node, and the second for the <section> node. It then computes the structural join between these node sets to determine which <section> elements are in fact children of <book> elements.

2.2. Range index

Range indexes provide a shortcut for the database to directly select nodes based on their typed values. They are used when matching or comparing nodes by way of standard XPath operators and functions. Without a range index, comparison operators like =, > or < will default to a "brute-force" inspection of the DOM, which can be extremly slow if eXist has to search through maybe millions of nodes: each node has to be loaded and cast to the target type.

To see how range indexes work, consider the following fragment:

Example: Example: List Entry

<items>
    <item n="1">
       <name>Tall Bookcase</name>
       <price>299.99</price>
    </item>
    <item n="2">
       <name>Short Bookcase</name>
       <price>199.99</price>
    </item>
</items>

With this short inventory, the text nodes of the <price> elements have dollar values expressed as a floating-point number, (e.g. "299.99"), which has an XML Schema Definition (XSD) data type of xs:double. Using this builtin type to define a range index, we can improve the efficiency of searches for <price> values. (Instructions on how to configure range indexes using configuration files are provided under the Configuring Indexes section below.) During indexing, eXist will apply this data type selection by attempting to cast all <price> values as double floating point numbers, and add appropriate values to the index. Values that cannot be cast as double floating point numbers are therefore ignored. This range index will then be used by any expression that compares <price> to an xs:double value - for instance:

//item[price > 100.0]

For non-string data types, the range index provides the query engine a more efficient method of data conversion. Instead of retrieving the value of each selected element and casting it as a xs:double type, the engine can evaluate the expression by using the range index as a form of lookup index. Without an index, eXist has to do a full scan over all price <price> elements, retrieve the string values of their text node and cast them to a double number. This is a time-consuming process which also scales very bad with growing data sets. With a proper index, eXist needs just a single index lookup to evaluate price = 100.0. The range expression price > 100.0 is processed with an index scan starting at 100.

For string data, the index will also be used by the standard functions fn:contains(), fn:starts-with(), fn:ends-with() and fn:matches().

To illustrate this functionality, let's return to the previous example. If you define a range index of type xs:string for element <name> , a query on this element to select tall bookcases using fn:matches() will be supported by the following index:

//item[fn:matches(name, '[Tt]all\s[Bb]')]

Note that fn:matches will by default try to match the regular expression anywhere in the string. We can thus speed up the query dramatically by using "^" to restrict the match to the start of the string:

//item[fn:matches(name, '^[Tt]all\s[Bb]')]

Also, if you really need to search for an exact substring in a longer text sequence, it is often better to use the NGram index instead of the range index, i.e. use ngram:contains() instead of fn:contains(). Unfortunately, there's no equivalent NGram function for fn:matches() yet, but we may add one in the future as it could help to increase performance dramatically.

In general, three conditions must be met in order to optimize a search using a range index:

  1. The range index must be defined on all items in the input sequence.

    For example, suppose you have two collections in the database: C1 and C2. If you have a range index defined for collection C1, but your query happens to operate on both C1 and C2, then the range index would not be used. The query optimizer selects an optimization strategy based on the entire input sequence of the query. Since, in this example, since only nodes in C1 have a range index, no range index optimization would be applied.

  2. The index data type (first argument type) must match the test data type (second argument type).

    In other words, with range indexes, there is no promotion of data types (i.e. no data type precedes or replaces another data type). For example, if you defined an index of type xs:double on <price> , a query that compares this element's value with a string literal would not use a range index, for instance:

    //item[price = '1000.0']

    In order to apply the range index, you would need to cast the value as a type xs:double, i.e.:

    //item[price = xs:double($price)] (where $price is any test value)

    Similarly, when we compare xs:double values with xs:integer values, as in, for instance:

    //item[price = 1000]

    the range index would again not be used since the <price> data type differs from the test value type, although this conflict might not seem as obvious as it is with string values.

  3. The right-hand argument has no dependencies on the current context item.

    That is, the test or conditional value must not depend on the value against which it is being tested. For example, range indexes will not be applied given the following expression:

    //item[price = self]

Concerning range indexes on strings there's another restriction to be considered: up to version 1.3, range indexes on strings can only be used with the default unicode collation. Also, string indexes will always be case sensitive (while n-gram and full text indexes are not). It is not yet possible to define a string index on a different collation (e.g. for German or French) or to make it case insensitve. This is a limitation we plan to address in the next release.

2.3. Spatial Index

A working proof-of-concept index, which listens for spatial geometries described through the Geography Markup Language (GML). A detailed description of the implementation can be found in the Developer's Guide to Modularized Indexes.

3. Optional Index Modules

eXist features a modularized indexing architecture, which allows arbitrary indexes to be plugged into an indexing pipeline. Consequently, some indexes were moved out of the database core and are now available as plugins. For the DB core, those indexes are a black box: they handle their own creation, configuration, destruction etc.

While the structural and the range index are always available, the optional indexes can be enabled or disabled on a given database instance. Optional modules are:

N-Gram Index

N-gram indexes are optimized for exact substring queries (like contains). Substring searches are nearly as fast as with the full text index. However, the n-gram index also preserves whitespace and punctuation, and is case-insensitive by default.

New Full Text Index

Faster, better configurable, more feature rich and reliable than eXist's old index.

Legacy Full Text Index

Old full text index. Deprecated, though still supported for backwards compatibility.

Spatial Index

A working proof-of-concept index, which listens for spatial geometries described through the Geography Markup Language (GML). A detailed description of the implementation can be found in the Developer's Guide to Modularized Indexes.

4. Enabling Index Modules

While a few indexes (n-gram, full text) are already pre-build in the standard eXist distribution, other modules may need to be enabled first. For example, the spatial index depends on a bunch of external libraries, which do not ship with eXist. However, enabling the spatial index is a simple process:

  1. Copy the properties file extensions/indexes/build.properties and store it as local.properties in the same directory.

  2. Edit extensions/indexes/local.properties:

    Example: local.properties

    # N-gram module
    include.index.ngram = true
    
    # Spatial module
    include.index.spatial = false
                            

    To include an index, change the corresponding property to "true".

  3. Call the Ant build system once to regenerate the eXist libraries:

    build.sh

    or

    build.bat

The build process should create a jar file for every index implementation in directory lib/extensions. For example, the spatial index is packaged into the jar exist-spatial-module.jar.

Once the index module has been built, it can be announced to eXist. To activate an index plugin, it needs to be added to the <modules> section within the global configuration file conf.xml:

Example: Index Plugin Configuration in conf.xml

<modules>
    <module id="ngram-index" class="org.exist.indexing.ngram.NGramIndex"
        file="ngram.dbx" n="3"/>
    <!-- The full text index is always required and should
         not be disabled. We still have some dependencies on
         this index in the database core. These will be removed
         once the redesign has been completed. -->
    <module id="ft-legacy-index" class="org.exist.fulltext.FTIndex"/>
</modules>

Every <module> element needs at least an id and class attribute. The class attribute contains the name of the plugin class, which has to be an implementation of org.exist.indexing.Index.

All other attributes or nested configuration elements below the <module> element are specific to the implementation and will differ between indexes. They should be documented by the index implementor.

If an index implementation can not be loaded from the specified class, the entry will simply be ignored. A warning will be written to the logs which should provide more information on the issue which caused the configuration to fail.

5. Configuring Indexes

eXist has no "create index" command. Instead, indexes are configured in collection-specific configuration files. These files are stored as standard XML documents in the system collection: /db/system/config, which can be accessed like any other document (e.g. using the Admin interface or Java Client). In addition to defining settings for indexing collections, the configuration document specifies collection-specific other settings such as triggers or default permissions.

The contents of the system collection (/db/system/config) mirrors the hierarchical structure of the main collection. Configurations are shared by descendants in the hierarchy unless they have their own configuration (i.e. the configuration settings for the child collection override those set for the parent). If no collection-specific configuration file is created for any document, the global settings in the main configuration file, conf.xml, will apply by default. That being said, the conf.xml file should only define the default global index creation policy.

To configure a given collection - e.g. /db/foo - create a file collection.xconf and store it as /db/system/config/db/foo/collection.xconf. Note the replication of the /db/foo hierarchy inside /db/system/config/. Subcollections which do not have a collection.xconf file of their own will be governed by the configuration policy specified for the closest ancestor collection which does have such a file, so you are not required to specify a configuration for every collection. Note, however, that configuration settings do not cascade. If you choose to deploy a collection.xconf file in a subcollection, you must specify in that file all the configuration options you wish to have applied to that subcollection (and any lower-level subcollections without collection.xconf files of their own).

Note

Due to backward compatibility concerns, the file does not have to be called collection.xconf, which is now the preferred file name, but it must have the .xconf extension.

You can only have one .xconf file at each level.

5.1. Maintaining Indexes and Re-indexing

The eXist index system automatically maintains and updates indexes defined by the user. You therefore do not need to update an index when you update a database document or collection. eXist will even update indexes following partial document updates via XUpdate or XQuery Update expressions.

The only exception to eXist's automatic update occurs when you add a new index definition to an existing database collection. In this case, the new index settings will only apply to new data added to this collection, or any of its sub-collections, and not to previously existing data. To apply the new settings to the entire collection, you need to trigger a "manual reindex" of the collection being updated. You can re-index collections using the Java Admin Client (shown on the right). From the Admin menu, select FileยปReindex Collection

5.2. General Configuration Structure and Syntax

Index configuration files are standard XML documents that have their elements and attributes defined by the eXist namespace:

http://exist-db.org/collection-config/1.0

The following example shows a configuration example:

Example: Configuration Document

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <!-- Old full text index configuration. Deprecated. -->
        <fulltext default="none" attributes="false">
            <!-- Full text indexes -->
            <create qname="author"/>
            <create qname="title" content="mixed"/>
            <!-- "old" context-dependant configuration using the path attribute: -->
            <include path="booktitle"/>
        </fulltext>
        <!-- New full text index based on Lucene -->
        <lucene>
            <text qname="SPEECH">
                <ignore qname="SPEAKER"/>
            </text>
            <text qname="TITLE"/>
        </lucene>
        <!-- Range indexes -->
        <create qname="title" type="xs:string"/>
        <create qname="author" type="xs:string"/>
        <create qname="year" type="xs:int"/>
        <!-- "old" context-dependant configuration using the path attribute: -->
        <create path="//booktitle" type="xs:string"/>

        <!-- N-gram indexes -->
        <ngram qname="author"/>
        <ngram qname="title"/>
    </index>
</collection>

All configuration documents have the <collection> root element (in the http://exist-db.org/collection-config/1.0 namespace). These documents also have an <index> element directly below the root element, which encloses the index configuration. Only one <index> element is permitted in a document. Apart from the index configuration, the document may also contain non index-related settings, e.g. for triggers, which will not be covered here.

In the <index> element are elements that define the various index types. Each index type can add its own configuration elements, which are directly forwarded to the corresponding index implementation. The example above configures three different types of indexes: full text, range and ngram.

Namespaces

If the document to be indexed uses namespaces, you should add an xmlns attribute for each of the required namespaces to the <index> element:

Example: Using Namespaces

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:atom="http://www.w3.org/2005/Atom">
        <fulltext default="none" attributes="no">
            <create qname="atom:title"/>
        </fulltext>
        <create qname="atom:id" type="xs:string"/>
    </index>
</collection>

The example configuration above creates two indexes on a collection of atom documents. The two elements which should be indexed are both in the atom namespace and we thus need to declare a mapping for this namespace. Please note that the xmlns namespace attributes have to be specified on the <index> element, not the <create> or <fulltext> elements.

Range index configuration

Example: Range Index Configuration

<!-- Range indexes -->
<create qname="title" type="xs:string"/>
<create qname="author" type="xs:string"/>
<create qname="year" type="xs:int"/>
<!-- "old" context-dependant configuration using the path attribute: -->
<create path="//booktitle" type="xs:string"/>

A range index is configured by adding a <create> element directly below the root <index> node. As explained above, the node to be indexed is either specified through a path or a qname attribute.

As range indexes are type specific, the type attribute is always required. The type should be one of the atomic XML schema types, currently including xs:string, xs:integer and its derived types, xs:double, xs:float, xs:boolean and xs:dateTime. Further types may be added in the future. If the name of the type is unknown, the index configuration will be ignored and you will get a warning written into the logs.

Please note that the index configuration will only apply to the node specified via the path or qname attribute, not to descendants of that node. Consider a mixed content element like:

Example: Mixed Content Element

<mixed><s>un</s><s>even</s></mixed>

If an index is defined on <mixed> , the key for the index is built from the concatenated text nodes of element <mixed> and all its descendants, i.e. "uneven". The created index will only be used to evaluate queries on <mixed> , but not for queries on <s> . However, you can create an additional index on <s> without getting into conflict with the existing index on <mixed> .

Configuration by path vs. configuration by qname

It is important to note the difference between the path and qname attributes used throughout above example. Both attributes are used to define the elements or attributes to which the index should be applied. However, the path attribute creates context-dependant indexes, while the qname attribute does not. The path attribute takes a simple path expression:

<create path="//book/title" type="xs:string"/>

The path expression looks like XPath, but it's really not. Index path syntax uses the following components to construct paths:

The example above creates a range index of type string on all <title> elements which are children of <book> elements, which may occur at an arbitrary position in the document tree. All other <title> elements, e.g. those being children of <section> nodes, are not indexed. The path expression thus defines a selective index, which is also context-dependant: we always need look at the context of each <title> node before we can determine if this particular title is to be indexed or not.

This kind of context-dependant index definition helps to keep the index small. But unfortunately, it makes it hard for the query optimizer to properly rewrite the expression tree without missing some nodes. The optimizer needs to make an optimization decision at compile time, where the context of an expression is unknown or at least not exactly known (read the blog article to get the whole picture). This means that some of the highly efficient optimization techniques can not be applied to context-dependant indexes!

We thus had to introduce an alternative configuration method which is not context-dependant. To keep things simple, we decided to define the index on the qname of the element or attribute alone and to ignore the context altogether:

<create qname="title" type="xs:string"/>

This results in an index being created on every <title> element found in the document node tree. Section titles will be indexed as well as chapter or book titles. Indexes on attributes are defined as above by prepending "@" to the attribute's name, e.g.:

<create qname="@type" type="xs:string"/>

defines an index on all attributes named "type", but not on elements with the same name.

Defining indexes on qnames may result in a considerably larger index, but it also allows the query engine to apply all available optimization techniques, which can improve query times by an order of magnitude. As so often, there's a trade-off between performance and storage space. In many cases, the performance win can be dramatic enough to justify an increase in index size.

Important

To be on the safe side and to benefit from current and future improvements in the query engine, you should prefer qname over path - unless you really need to exclude certain nodes from indexing.

New Full Text Index

Please refer to the separate documentation.

6. Check Index Usage

It is sometimes a bit difficult to see if a range index is correctly defined or not. The simplest way to get some information on index usage is to set the priority for eXist's standard logger to TRACE. For example, change the <root> category in log4j.xml as follows:

Example: Configure log4j to Display Trace Output

<root>
    <priority value="trace"/>
    <appender-ref ref="console"/>
</root>

This enables trace and sends all log output to the console instead of the log files. For expressions that can benefit from a range index, you should now see messages like "Checking if range index can be used ..." or "Using range index for key...".

//city[population > 100000]

Without further query optimizations, I receive the following trace output:

Example: TRACE Output


04 Jan 2008 14:17:51,767 [main] TRACE (GeneralComparison.java [quickNodeSetCompare]:543) - found an index of type: xs:integer
04 Jan 2008 14:17:51,768 [main] TRACE (GeneralComparison.java [quickNodeSetCompare]:587) - Checking if range index can be used for key: 10000
04 Jan 2008 14:17:51,768 [main] TRACE (GeneralComparison.java [quickNodeSetCompare]:592) - Using range index for key: 10000

If you enabled query rewriting (by setting enable-query-rewriting="yes" in conf.xml, you will see additional messages from the optimizer:

Example: TRACE Output with Additional Query Optimization


04 Jan 2008 14:26:24,851 [main] TRACE (Optimize.java [visitGeneralComparison]:177) - exist:optimize: found optimizable: org.exist.xquery.GeneralComparison
04 Jan 2008 14:26:24,856 [main] TRACE (Optimize.java [before]:198) - exist:optimize: context step: descendant-or-self::country[child::population > 10000]
04 Jan 2008 14:26:24,856 [main] TRACE (GeneralComparison.java [preSelect]:260) - Using QName index on type xs:integer
04 Jan 2008 14:26:24,856 [main] TRACE (GeneralComparison.java [preSelect]:292) - Using QName range index for key: 10000
04 Jan 2008 14:26:24,865 [main] TRACE (Optimize.java [eval]:89) - exist:optimize: pre-selection: 1483
04 Jan 2008 14:26:24,866 [main] TRACE (Optimize.java [eval]:116) - Ancestor selection took 1
04 Jan 2008 14:26:24,866 [main] TRACE (Optimize.java [eval]:117) - Found: 49

Another possibility to see what's in your index is to use the util:index-keys function (for range indexes):

Example: Query to List Index Contents

declare function local:term-callback($term as xs:string, $data as xs:int+) as element() {
   <entry>
     <term>{$term}</term>
     <frequency>{$data[1]}</frequency>
     <documents>{$data[2]}</documents>
     <position>{$data[3]}</position>
   </entry>
};

util:index-keys(//name, "A", util:function(xs:QName("local:term-callback"), 2), 1000)

This query will show you the first 1000 keys (starting with the letter 'A') indexed for the element selected by the path expression //city/name together with some information about this key. The first argument to index-keys specifies the node set for which index entries should be listed. The second argument contains a start value which also determines the index type. For example, if the start value is a string, eXist will only search for indexes configured with type xs:string, if it is an integer number, only indexes with type xs:integer will be considered. For a string index, you may also pass the empty string to retrieve all keys in the index.

Finally, to view the contents of the full text index instead of the range index, a different function needs to be used, text:index-terms :

Example: Query to List Full Test Index Contents

declare function local:term-callback($term as xs:string, $data as xs:int+) as element() {
   <entry>
     <term>{$term}</term>
     <frequency>{$data[1]}</frequency>
     <documents>{$data[2]}</documents>
     <position>{$data[3]}</position>
   </entry>
};

text:index-terms(//name, "A", util:function(xs:QName("local:term-callback"), 2), 1000)
                
November 2009
The eXist Project