Tuning the Database

1. Memory settings

Java always limits the maximum amount of memory available to a process. eXist will thus not automatically use all of the memory available on your machine. The default setting is limited to 512MB.

The maximum amount of memory Java will allocate is determined by the -Xmx parameter passed to Java on the command line. If you launch eXist via one of the shell or batch scripts, you need to change -Xmx in there.

On a Unix system, edit EXIST_HOME/bin/functions.d/eXist-settings.sh. Search for the JAVA_OPTIONS variable, which sets -Xmx for the server. The CLIENT_JAVA_OPTIONS variable does the same for the Java admin client. Instead of directly editing eXist-settings.sh, you may also override those variables globally in your own shell.

On Windows, the -Xmx settings is done in the main .bat files, .e.g. EXIST_HOME\bin\startup.bat.

If you launch eXist as a service, all Java settings will be controlled by the service wrapper. In this case, the file to edit is EXIST_HOME/tools/wrapper/conf/wrapper.conf. Search for the following lines:

# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=128

2. Cache settings

Each of the core database files and indexes has a page cache. The main purpose of this cache is to make sure that the most frequently used pages of the db files are kept in memory. If a file's cache becomes too small, eXist may start to unload pages just to reload them a few moment later. This "trashing effect" results in an immediate performance drop, in particular while indexing documents.

All caches share a single memory pool, whose size is determined by the attribute cacheSize in the <db-connection> section of conf.xml. The global cache manager will dynamically grant more memory to caches while they are under load and free the memory used by idle caches.

Example: cacheSize parameter in conf.xml

<db-connection cacheSize="48M" collectionCache="24M" database="native"
        files="webapp/WEB-INF/data" pageSize="4096" nodesBuffer="-1">

The default setting for cacheSize is very conservative (48M). It will be ok for smaller databases, but you may soon experience a performance drop when indexing more than several 100M of XML data. Consider increasing cacheSize up to approx. 1/3 of the main memory available to Java (determined by the -Xmx parameter passed to the Java command line). If you are running eXist with other web applications in the same servlet engine, you may need to choose a smaller setting (running out of memory will crash the database, so please be careful).

The cacheSize is mainly relevant when storing/updating data. The effect on query speed should not be that big, unless some of the index caches are really much too small.

If you continue to experience performance issues while storing data, you may need to revisit your index configuration. Removing unused indexes will give more room to the other indexes. In particular, the full text index can grow very fast until it becomes a bottleneck. Try to disable the default full text index (see below).

The nodesBuffer attribute can be used to set eXist's temporary internal buffer to a fixed size. The buffer is used during indexing to cache nodes before they are flushed to disk. The default setting (nodesBuffer="-1") is to use as much memory as is available, but this can be problematic if you have to store large documents in a multi-user environment. For a production server, a good recommendation would be to set nodesBuffer to 1000 or less if there are many concurrent write operations.

3. Index configuration

3.1. Don't rely on the default behaviour

eXist does NOT index any element or attribute values by default. It may create a full text index (see below), but this won't help with standard comparison operators or functions. Thus, when evaluating an expression like

//SPEECH[SPEAKER = "HAMLET"]

the query engine will fall back to a full scan over all <SPEAKER> elements in the db. This is very slow and limits concurrency. You should at least create a global index definition (in /db/system/config/db/collection.xconf) and add range indexes for the most frequently used comparisons.

3.2. Disable any default indexes (pre 1.4)

If no other index index configuration is found for a database collection, eXist will use the default settings specified in conf.xml. For older eXist versions, the default is to create a full text index on ALL elements and attributes in the database. The problem with this is that

  1. maintaining the default index costs performance and memory, which could be better used for other indexes. The index may grow very fast, which can be a destabilizing factor.
  2. the index is unspecific. The query engine cannot use it as efficiently as a dedicated index on a set of named elements or attributes (see below).

If you experience memory issues or observe a constantly decreasing performance while loading documents, tuning your indexes should be one of the first steps:

3.3. Prefer simple index definitions

Keeping your index definitions simple makes it easier for the query optimizer to resolve dependencies. In particular, avoid context-dependant index definitions unless you really have a reason to use them. A context-dependant index is defined on a path like /book/chapter/title, while general indexes are defined on a simple element or attribute qname:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <!-- Range indexes by qname -->
        <create qname="title" type="xs:string"/>
        <create qname="@ID" type="xs:string"/>

        <!-- context-dependant configuration using the path attribute: -->
        <create path="/book/title" type="xs:string"/>
    </index>
</collection>

Defining indexes on qnames may result in a larger index, but it also allows the query engine to apply all available optimization techniques, which can improve query times by an order of magnitude. Replacing a context-dependant index by a simple index on qname can thus result in a performance boost, thanks to eXist's new query-rewriting optimizer. Older versions of eXist did not offer those possibilities.

3.4. Use range indexes on strongly typed data or short strings

Range indexes work with the standard XQuery operators and string functions. Querying for something like

//book[year = 2000]

will always be slow without an index. As long as no index is defined, eXist has to scan over every year element in the db, casting its string value to an integer.

For queries on string content, range indexes work well for exact comparisons (author = 'Joe Doe') or regular expressions (matches(author, "^Joe.*")), though you may also consider using a full text index in the latter case. However, please note that range indexes on strings are case-sensitive or rather, to use the correct formulation, sensitive to the default collation. If you need case-insensitive queries, consider an ngram index.

3.5. Consider an n-gram index for exact substring queries on longer text sequences

While range indexes tend to become slow for substring queries (like contains(title, "XSLT 2.0")), an n-gram index is nearly as fast as a full text index, but it also indexes whitespace and punctuation. ngram:contains(title, "XSLT 2.0") will only match titles containing the exact phrase "XSLT 2.0". n-gram indexes are case insensitive.

3.6. Choose a full text index for tokenizable text where whitespace/punctuation is mostly irrelevant

The full text index is fast and should be used whenever you need to query for a sequence of separate words or tokens in a longer text. It can sometimes even be faster to post-process the returned node set and filter out wrong matches than using a much slower regular expression.

eXist 1.4 offers a new full text index which is based on Apache Lucene. It provides better performance and overall stability than the builtin index.

4. Writing Queries

4.1. Prefer short paths

eXist uses indexes to directly locate an element or attribute by its name. It doesn't need to traverse the entire document tree. This means that the direct selection of a node through a single descendant step is faster than walking down the child axis. For example:

a/b/c/d/e/f

will be slower than

a//f

The first expression requires 6 (!) index lookups while the second just needs two. The same rules apply to the ancestor axis, e.g. f/ancestor::a.

4.2. Always process the most selective filter/expression first

If you need multiple steps to select certain nodes from a larger node set, try to process the most selective steps first. The earlier you can reduce the node set to be processed, the faster your query will run. For example, assume we have to find publications written by "Bjarne Stroustrup" after the year 2000:

/dblp/*[year > 2000][author = 'Bjarne Stroustrup']

The database has 568824 records matching year > 2000, but only 41 of them were written by Stroustrup. Moving the filter on the author to the front of the expression should thus result in better performance:

/dblp/*[author = 'Bjarne Stroustrup'][year > 2000]

It would certainly be nice if eXist could do this kind of optimization automatically. We are working on it. eXist recognizes more and more cases for intelligent query rewritings. For example, it already transforms the boolean expression

/dblp/*[author = 'Bjarne Stroustrup' and year > 2000]

into a multi filter step as shown above.

4.3. Allow eXist to process large node sets in one step

The query engine is optimized to process a path expression in one, single operation. For instance, the XPath:

//A/*[B = 'C']

is evaluated in a single operation for all context items. It doesn't make a difference if the input set comes from a single large document, includes all the documents in a specific collection or even the entire database. The logic of the operation remains the same.

However, "bad" queries can force the query engine to partition the input sequence and process it in an item-by-item mode. Several examples for bad uses of FLWOR expressions will be given below. Those should be easy to understand. Other cases are not so obvious. For example, most function calls will also force the query engine into item-by-item mode:

//A/*[f:process(B) = 'C']

The function has to be called once for every instance of B. Normally, eXist would try to evaluate the general comparison in a single step (assuming there's a usable index on B). However, it now needs to call a (non-optimized) function for each B and will thus need to process the entire comparison once for every context item.

There are functions to which the above does not apply. This includes most functions which operate on indexes, e.g. contains, matches, starts-with, ngram:contains, and the like. They are optimized so eXist only needs to call them once to process the entire context set. For example, using ngram:contains as below is perfectly ok:

//A/*[ngram:contains(B, 'C')]

while

//A/*[ngram:contains(f:process(B), 'C')]

will again force eXist into step-by-step evaluation.

4.4. Prefer XPath Predicates Over Where Expressions

This is a variation of the problems discussed above. Many users tend to formulate SQL-style queries using an explite "where" clause:

Example:

for $e in //entry 
where $e/@type = 'subject'
return $e

could be rewritten as:

Example: Equivalent query using XPath predicate

for $e in //entry[@type = 'subject'] 
return $e

The "for"..."where" expression forces the query engine into a step-by-step iteration over the input sequence, testing each instance of $e against the where expression. Possible optimizations are lost.

Contrary to this, the XPath predicate expression can be processed in one single step, making best use of any available indexes. Sure, there are use cases which cannot be handled without using "where", e.g. joins between multiple documents. That's ok. However, you shouldn't use "where" if you can replace it by a simple XPath.

Internally, the query engine will always try to process a "where" clause like an equivalent XPath with predicate. However, it only detects the simple cases.

4.5. Use general comparisons to compare an item to a list of alternatives

General comparisons are very handy if you need to compare a given item to several alternative values. For example, you could use an "or" to find all <b> children whose string value is either "c" or "d".

//a[b eq 'c' or b eq 'd']

A shorter way to express this is:

//a[b = ('c', 'd')]

The comparison will be true if b's string value matches one of the strings in the right hand sequence. If an index is defined on <b> , eXist will need only one index lookup to find all b's matching the comparison. The equivalent "or" expression needs 2 separate index lookups.

4.6. Querying Multiple Collections

If you need to query multiple collections which are on the same level of the collection hierarchy, you could use a for loop to iterate over the collection paths. However, this forces the query engine to process the remaining expression once for each collection. It is thus better to construct the initial node set once and use it as input for the main expression. For example:

Example: Nested for loop

for $path in ('/db/a', '/db/b')
for $result in collection($path)//test[...]
return
    ...

will be less efficient than:

Example: Single loop over initial node set

let $docs :=
    for $path in ('/db/a', '/db/b') return $collection($path)
for $result in $docs//test[...]
return
    ...

4.7. Use the ancestor or parent axis instead of a top-down approach

eXist can navigate the ancestor axis as fast as the descendant axis. It can thus be more efficient to build a query bottom-up instead of top-down. Here's a top-down example:

Example: Top-down query using nested for

for $section in collection("/db/articles")//section
for $match in $section//p[contains(., "XML")]
return
    <match>
        <section>{$section/title/text()}</section>
        {$match}
    </match>

This query walks through a set of sections and queries each of them for paragraphs containing the string "XML". It then outputs the title of the section, followed by the matching paragraphs. Note that it will also return the title of all sections which do not have any matches.

The nested for loop again forces the query engine into a step-by-step iteration over the section elements. We can avoid this by using a bottom-up approach:

Example: Bottom-up query using ancestor axis

for $match in collection("/db/articles")//section//p[contains(., "XML")]
return
    <match>
        <section>{$match/ancestor::title/text()}</section>
        {$match}
    </match>

The second query should be several times faster than the first one.

4.8. Match regular expressions against the start of a string

Function fn:matches returns true if any substring of its argument string matches the regular expression. The query engine thus needs to scan all index entries as the match could be at any position of an entry.

You can reduce the range of entries to be scanned by anchoring your pattern at the start of a string (where applicable):

fn:matches($str, "^XQuery")

4.9. Use fn:id to lookup xml:id attributes

eXist automatically indexes all xml:id attributes and other attributes with type ID as declared in a DTD (only if validation is enabled). This automatic index is used by the standard id functions and provides a fast way to look up an element. For example,

id("sect1")/head

locates the element with id "sect1" and returns its <head> child. This is done through a fast index lookup. However, please note that the equivalent expression

//section[@xml:id = 'sect1']/head

will NOT use the id index (you will need to declare an extra range index for that).

Please be also aware that larger xml:id values cost performance as has been reported by some users working with large databases.

4.10. Defer output generation until really needed

When working with large result sets within a query, it is important to understand the differences between stored nodes and in-memory XML: if a node set consists of nodes which are stored in the database, eXist will usually never load those nodes into memory. Instead, it uses lightweight references for most processing steps. This way, even large node sets do not consume too much memory.

However, all new XML nodes created within an XQuery will reside in memory and you should be aware that the constructed XML fragments need to fit into the memory available to the Java VM. If a query generates too many nodes, the XQuery watchdog (if enabled) may step in and kill it.

A typical scenario: a query selects a large number of documents from the database and then iterates through each to generate some HTML output for display. However, only the first 10 results are really returned to the user, the rest is stored into an HTTP session for later viewing.

In this case it is important to limit the HTML generation to those items which are actually returned. Though the source XML documents may be large, eXist will not load them into memory, but just keep references to them. Storing those references into a session does not consume much memory.

Example: Limiting output generation

let $nodes := (: select some nodes in the db :)
let $session := session:set-attribute("result", $nodes) (: store result into session :)
(: only return the first 10 nodes :)
for $node in subsequence($nodes, 1, 10)
return
    (: Generate HTML for output :)
    <div>(: Create complex HTML markup using $node :)</div>

Please note also that eXist uses lazy evaluation when constructing new XML fragments. For example:

<book>{$node/title}</book>

Assuming that $node references a node in the database, the query engine will not copy $node/title into the constructed <book> element. Instead, only a reference is inserted. The reference will not be expanded until the fragment is serialized or queried. So if you only need to wrap selected parts of an element into a new fragment, memory consumption will not be too high.

5. Known Issues in the 1.2.x Series

5.1. Queries on constructed XML fragments

The 1.2.x series has a limitation with respect to queries on constructed, in-memory XML fragments: before evaluating a path expression on an in-memory node, the query engine stores the node into a persistent document created in a temporary collection of the db. A very simple example is the query:

(<root><node/></root>)//node

Executing this expression should leave a log message similar to the following in your server logs (if you didn't disable debug messages):

06 Mar 2009 16:46:07,786 [main] DEBUG (XQueryContext.java
[storeTemporaryDoc]:2242) - Stored: 3: /db/system/temp/534d7dd16f9b97e6cb47054140e544a1.xml

The problem with those temporary fragments is that they need to be cleaned up sometimes. This has a negative influence on performance and the cleanup process can even block the db for a while. Also, experience shows that the overall stability of the db suffers if too many temporary fragments have to be stored/cleaned up in short time.

eXist 1.4 fixes those issues completely. It has a redesigned query engine, which is able to directly operate on in-memory nodes without transforming them into persistent nodes. No temporary documents need to be stored. As a consequence, queries on in-memory fragments will be much faster with 1.4.0 and have no effect on stability.

If you experience performance or stability issues with the 1.2.x series, we recommend to check your log files for messages similar to the one given above. If you see lots of them, there are two possibilities:

  1. upgrade to 1.4.
  2. find the XQuery expressions which generate the temporary fragments. In many cases, it is easily possible to rewrite them. At least try to reduce the amount of temp documents.
November 2009
Wolfgang M. Meier
wolfgang at exist-db.org