Server Configuration

1. Overview

This section deals with the configuration of the eXist server. The main configuration file for eXist is called conf.xml, which is loaded from different directories depending on the server set-up (see Server Deployment for more information).

Specifically, if you installed the standalone eXist distribution with the installer, the conf.xml file located in the root directory of the distribution (as specified by the system property exist.home) will be loaded by default. On the other hand, if eXist is installed as a web application (packaged in a .war file) in a servlet engine like tomcat, conf.xml is read from the WEB-INF directory of the web application.

Why is the configuration file placed in two separate locations? The reason is that eXist normally has no access to files outside the context in which it is running when it is deployed as part of a web application. Therefore, when eXist is deployed in this way, the configuration is read from the WEB-INF directory.

2. eXist Configuration: Editing conf.xml

The configuration file conf.xml can be divided into four sections with the following elements:

<db-connection>

Configures the storage back-end.

<serializer>

Default settings for the serializer (external data representation).

<indexer>

Controls the indexing process.

<xupdate>

Configuration options related to XUpdate processing.

The following sections describe the attributes and child elements of the above elements.

2.1. <db-connection>

This element contains basic default storage settings for eXist, including memory and system limits. Only one <db-connection> should be specified. An example configuration for the native back-end is shown below:

Example: Basic <db-connection> Entry

<db-connection cacheSize="48M" collectionCache="24M" database="native"
        files="webapp/WEB-INF/data" pageSize="4096" nodesBuffer="-1">
      <pool min="1" max="15" sync-period="240000" wait-before-shutdown="60000"/>
      <!--default-permissions collection="0775" resource="0775" /-->
      <recovery enabled="yes" sync-on-commit="no" group-commit="no" size="100M" 
            journal-dir="webapp/WEB-INF/data"/>
      <!-- security class="org.exist.security.LDAPSecurityManager" /-->
      <watchdog query-timeout="-1" output-size-limit="10000"/>
      <default-permissions collection="0775" resource="0775"/>
</db-connection>

<db-connection> Attributes

database

This attribute selects a database system type. Since relational database back-ends are no longer supported by the current release of eXist, only "native" and "native_cluster" are available.

files

This attribute specifies the directory where the native back-end will keep its database files, and so it is necessary that this directory exists. If a relative path is specified, it will be based on the root directory as defined in the exist.home system property. If this data directory does not have write permissions (see User Authentication and Access Control), eXist will internally switch to read-only mode such that any attempt to change the database will throw an exception.

cacheSize

This attribute sets the maximum amount of main memory used by all page buffers (i.e. assuming all page buffers are at full capacity). The database uses this parameter to calculate the maximum size of each internal cache. You can increase this value if your system allows for greater memory use.

While indexing documents, eXist will reserve the amount of memory specified in cacheSize - even if not all caches are filled - and will not use it for temporary data.

The cacheSize should not be more than half of the size of the JVM heap size (set by the JVM -Xmx parameter). If the JVM heap is less than 512 megabyte, the cacheSize should even be smaller, e.g. 1/3.

collectionCache

Determines the size of the collection cache, which is a separate caching space. Usually this setting does not need to be changed unless you really have more than a few thousand collections in the db. Increase it carefully, maybe up to 128M.

pageSize

This specifies the number of bytes used for internal data and B-tree pages. This should be equal to or a multiple of the page size used by the filesystem (usually a multiple of 4096).

nodesBuffer

Size of the temporary buffer used by eXist for caching index data while indexing a document. If set to -1, eXist will use the entire free memory to buffer index entries and will flush the cache once the memory is full.

If set to a value > 0, the buffer will be fixed to the given size. The specified number corresponds to the number of nodes the buffer can hold, in thousands. Usually, a good default could be nodesBuffer="1000".

The default setting, nodesBuffer="-1", can be problematic if you frequently need to store large documents in a multi-user environment. In this case, the index operation may consume most of the memory resources, which means that concurrent threads will be slowed down or may even come to a halt.

pool

These settings control the internal database connection pool.

<pool> Attributes

min | max

These options specify the minimum and maximum size of the connection pool. This pool restricts the number of parallel (basic) operations that can be executed by the database. Settings should be somewhere between 1 and 20. (Please note that this has nothing to do with the HTTP and XMLRPC server settings - these servers have their own connection pools.)

sync-period

This option defines how often the database will flush its internal buffers to disk (in milliseconds). The sync-thread will interrupt normal database operation after the specified time and write all dirty pages to disk. It also writes a checkpoint to the transaction log. In case of a database crash, only transactions which started after the last checkpoint have to be redone or rolled back. The sync-period should thus not be set too long.

wait-before-shutdown

This option specifies the maximum amount of time (in milliseconds) that the database will allow for any running processes to complete upon database shutdown. After that, eXist will try to kill the remaining processes.

If wait-before-shutdown is set to a positive number, eXist will stop the db after the specified timeout, even if there were still running database operations. In this case, no checkpoint will be written to the transaction log. If there were any open transactions, eXist will trigger a recovery run after restart.

If wait-before-shutdown is set to -1, eXist will not shut down before all active database operations returned. This is a safe setting, but it may require a manual intervention to stop the jvm.

<recovery>

This element configures the journaling and recovery of the database. With recovery enabled, the database is able to recover from an unclean database shutdown due to, for example, power failures, OS reboots, and hanging processes. For this to work correctly, all database operations must be logged to a journal file. The location, size and other parameters for this file can be set using the <recovery> element.

<recovery> Attributes

enabled

If this attribute is set to yes, automatic recovery is enabled.

size

This attributes sets the maximum allowed size of the journal file. Once the journal reaches this limit, a checkpoint will be triggered and the journal will be cleaned. However, the database waits for running transactions to return before processing this checkpoint. In the event one of these transactions writes a lot of data to the journal file, the file will grow until the transaction has completed. Hence, the size limit is not enforced in all cases.

journal-dir

This attribute sets the directory where journal files are to be written. If no directory is specified, the default path is to the data directory.

sync-on-commit

This attribute determines whether or not to protect the journal during operating system failures. That is, it determines whether the database forces a file-sync on the journal after every commit. If this attribute is set to "yes", the journal is protected against operating system failures. However, this will slow performance - especially on Windows systems. If set to "no", eXist will rely on the operating system to flush out the journal contents to disk. In the worst case scenario, in which there is a complete system failure, some committed transactions might not have yet been written to the journal, and so will be rolled back.

group-commit

If set to "yes", eXist will not sync the journal file immediately after every transaction commit. Instead, it will wait until the current file buffer (32kb) is really full. This can speed up eXist on some systems where a file sync is an expensive operation (mainly windows XP; not necessary on Linux).

However, group-comit="yes" will increase the chance that an already committed operation is rolled back after a database crash.

force-restart

Try to restart the db even if crash recovery failed. This is dangerous because there might be corruptions inside the data files. The transaction log will be cleared, all locks removed and the db reindexed.

Set this option to "yes" if you need to make sure that the db is online, even after a fatal crash. Errors encountered during recovery are written to the log files. Scan the log files to see if any problems occurred.

consistency-check

If set to "yes", a consistency check will be run on the database if an error was detected during crash recovery. This option requires force-restart to be set to "yes", otherwise it has no effect.

The consistency check outputs a report to the directory {files}/sanity and if inconsistencies are found in the db, it writes an emergency backup to the same directory.

<watchdog>

This is the global configuration for the query watchdog. The watchdog monitors all query processes, and can terminate any long-running queries if they exceed one of the predefined limits. These limits are as follows:

<watchdog> Attributes

query-timeout

This attribute sets the maximum amount of time (expressed in milliseconds) that the query can take before it is killed. The setting can be overwritten in an XQuery by specifiying the option exist:timeout:

declare option exist:timeout "time-in-ms";

Please check the documentation on XQuery options.

output-size-limit

This attribute limits the size of XML fragments constructed using XQuery, and thus sets the maximum amount of main memory a query is allowed to use. This limit is expressed as the maximum number of nodes allowed for an in-memory DOM tree. The purpose of this option is to avoid memory shortages on the server in cases where users are allowed to run queries that produce very large output fragments. The setting can be overwritten in an XQuery by specifying the option exist:output-size-limit:

declare option exist:output-size-limit "size-hint";

<default-permissions>

Specifies the default permissions for all resources and collections in eXist (see User Authentication and Access Control). When this is not configured, the default "mod" (similar to the Unix "chmod" command) is set to 0775 in the resources and collections attributes. A different default value may be set for a database instance, and local overrides are also possible.

<security>

The <security> element in the <db-connection> node is used to select the security manager Class and control the database of users and groups.

<security> Attributes

class

This attribute is required, and specifies a Java class name used to implement the org.exist.security.SecurityManager interface, as in the following example:

Example: <security> class Attribute (LDAP)

<security class="org.exist.security.LDAPSecurityManager" />

eXist is distributed with the following built-in security manager implementations:

org.exist.security.XMLSecurityManager

Stores the user information in the database. This is the default manager if the <security> element is not included in <db-connection> .

org.exist.security.LDAPSecurityManager

Retrieves the user and groups from the LDAP database. This requires addition configuration parameters which are described in the LDAP Security Manager documentation.

password-encoding

Password encoding can be set to one of the following types:

  1. plain - Applies plain encryption.
  2. md5 (default) - Applies the MD5 algorithm to encrypt passwords.
  3. simple-md5 - Applies a simplified MD5 algorithm to encrypt passwords.

password-realm

The realm to use for basic auth or http-digest password challenges.

2.2. <indexer>

This element sets parameters on how XML files are to be indexed by eXist. An example configuration is shown below:

Example: Specifying Indexer-Options in conf.xml

<indexer caseSensitive="no"
	suppress-whitespace="both" index-depth="1"
	tokenizer="org.exist.storage.analysis.SimpleTokenizer"
	validation="no">
	
	<modules>
        <module id="ngram-index" class="org.exist.indexing.ngram.NGramIndex"
            file="ngram.dbx" n="3"/>
        <!--
        <module id="spatial-index" class="org.exist.indexing.spatial.GMLHSQLIndex"
            connectionTimeout="10000" flushAfter="300" />            
        -->
		<!-- The full text index is always required and should
			 not be disabled. We still have some dependencies on
             this index in the database core. These will be removed
             once the redesign has been completed. -->
		<module id="ft-legacy-index" class="org.exist.fulltext.FTIndex"/>
    </modules>
        
    <stopwords file="stopword"/>
    
	<!-- Default index configuration -->
    <index>
        <fulltext default="all" attributes="false">
            <exclude path="/auth"/>
        </fulltext>
    </index>

    <entity-resolver>
	    <catalog file="samples/xcatalog.xml"/>
    </entity-resolver>
</indexer>

<indexer> Attributes

caseSensitive

Specifies whether string comparisons are to be case-sensitive. This option applies to XPath equality tests (i.e. "=" operator), as well as functions such as contains(), starts-with() and ends-with(). This setting does not apply to operators or functions of the fulltext index (e.g. "&=", "|=", "near()") or the n-gram index, which are never case-sensitive

Setting caseSensitive="yes" violates the XQuery specs! The option should be regarded as a dirty workaround, which will be removed in the future. Please use the n-gram or full-text indexes for case-insensitive queries or - if that is impossible - specify a collation.

suppress-whitespace

Specifies how the <indexer> is to treat whitespace at the start or end of a character sequence. This option ONLY applies to newly stored files, and therefore changing it has no effect on previously stored documents. Possible values for this attribute are:

  1. leading - Suppresses leading whitespace.
  2. trailing - Suppresses trailing whitespace.
  3. both - Suppresses leading and trailing whitespace.
  4. none - Preserves all whitespace.

Note that suppressing whitespace at the start or end of character sequences does effectively change the document!

preserve-whitespace-mixed-content

controls how ignorable whitespace is handled. If set to no, ignorable whitespace, e.g. between the end tag of an element and the start tag of another, will not be stored into the persistent DOM. This leads to a smaller DOM and usually increases the readability of the XML. Ignorable whitespace is not considered as a part of the logical document model, so removing it doesn't change the document.

tokenizer

This attribute invokes the Java class used to tokenize a string into a sequence of single words or tokens, which are stored to the fulltext index. Currently only the SimpleTokenizer is available.

index-depth

This attribute specifies the depth of the DOM index, or the tree level up to which elements will be added to the index. For example, a value of "2" results in the document root node and all its child elements being indexed; a value of "1" only indexes the root node.

The DOM index maps unique node identifiers to the nodes' storage locations in the DOM file. Generating this index is time- and memory-consuming. It is furthermore primarily needed to access nodes by their unique node identifier - for example, when serializing XML data for query results or XUpdate - which are operations not normally considered time-critical. Moreover, most XPath expressions can do without this index since they use short-cuts to access the node directly.

Beginning with version 0.9, only top-level elements are added to the DOM index, whereas attributes and text nodes are always excluded. This results in much smaller index sizes and, consequently, a smaller dom.dbx file size. Usually, setting the index-depth to a value of "2" offers a reasonable compromise of index size and performance. However, if your documents are deeply-structured, you might consider increasing this setting to a level of 3, 4 or 5. For example, if the longest path from the document root to an element node has greater than ten node levels, an index-depth setting of 4 or 5 would probably help to increase overall query performance for some types of queries.

validation

This attribute defines the default setting for the validation of documents by the XML parser. If it is set to "no", documents will never be validated against an existing DTD or schema. A value of "auto" will leave document validation to the SAX parser (i.e. the Xerces parser).

modules

This section configures optional indexing modules. Beginning with version 1.2, eXist features a modularized indexing architecture, which allows new indexes to be plugged into the indexing pipeline. The <modules> section lists and configures the indexes that will be available to the database:

Example: Configuring Index Modules

<modules>
    <module id="ngram-index" class="org.exist.indexing.ngram.NGramIndex"
        file="ngram.dbx" n="3"/>
    <!--
    <module id="spatial-index" class="org.exist.indexing.spatial.GMLHSQLIndex"
        connectionTimeout="10000" flushAfter="300" />            
    -->
	<!-- The full text index is always required and should
		 not be disabled. We still have some dependencies on
         this index in the database core. These will be removed
         once the redesign has been completed. -->
	<module id="ft-legacy-index" class="org.exist.fulltext.FTIndex"/>
</modules>

The only common attributes for each <module> element are class and id. The other attributes as well as any nested elements are specific to the index implementation. Detailed information is available in the document on Configuring Database Indexes.

stopwords

The file for this element points to a file containing a list of stopwords. Note that stopwords are NOT added to the fullext index.

index

This configuration element specifies the default index settings. These settings are applied if neither the collection nor any of its ancestors provide a collection configuration. Configuring indexes via the default settings is not recommended. If you need a global collection configuration, store one for the root collection /db. For more information, read the Configuring Indexes documentation.

2.3. <scheduler>

This section is used to configure asynchronous jobs with eXist's internal scheduler. Three types of jobs are supported:

startup jobs

Startup jobs are executed once during database startup, but before the database becomes available. These jobs are synchronous. The database is blocked to outside requests and no other operations will run at the same time.

system jobs

System jobs require the database to be in a consistent state. The scheduler will run them in an exclusive environment. Once the job is triggered, the database will block all new requests and wait for running operations to complete. It then executes the job. All other database operations will be stopped until the job returns or throws an exception. Any exception will be caught and a warning written to the log.

user jobs

User jobs may be scheduled at any time and may be mutually exclusive or non-exclusive

Below is an example which configures a BackupSystemTask:

Example: Configuring a BackupSystemTask

<scheduler>
    <job type="system" class="org.exist.storage.BackupSystemTask" cron-trigger="0 0 */6 * * ?">
        <parameter name="dir" value="backup"/>
        <parameter name="suffix" value=".zip"/>
        <parameter name="prefix" value="backup-"/>
        <parameter name="collection" value="/db"/>
        <parameter name="user" value="admin"/>
        <parameter name="password" value=""/>
        <parameter name="zip-files-max" value="28"/>
    </job>
</scheduler>

Each job is configured in a <job> element which accepts a number of standard attributes:

job attributes

type

The type of the job to schedule. Must be either "startup", "system" or "user".

class

If the job is written in Java then this should be the name of the class that extends either

  • org.exist.scheduler.StartupJob
  • org.exist.storage.SystemTask
  • org.exist.scheduler.UserJavaJob

xquery

If the job is written in XQuery (not suitable for system jobs) then this should be a path to an XQuery stored in the database. e.g. /db/myCollection/myJob.xql XQuery job's will be launched under the guest account initially, although the running XQuery may switch permissions through calls to xmldb:login().

cron-trigger

To define a firing pattern for the Job using Cron style syntax use this attribute otherwise for a periodic job use the period attribute. Not applicable to startup jobs.

period

Can be used to define an explicit period for firing the job instead of a Cron style syntax. The period should be in milliseconds. Not applicable to startup jobs.

delay

Can be used with a period to delay the start of a job. If unspecified jobs will start as soon as the database and scheduler are initialised.

repeat

Can be used with a period to define for how many periods a job should be executed. If unspecified jobs will repeat for every period indefinitely.

Every job can take additional parameters, which are passed as name/value pairs (see example above).

2.4. <serializer>

The serializer is responsible for serializing XML documents or document fragments back into XML. This configuration element defines default settings for various parameters, which can also be specified programmatically. All settings can be overwritten by XQuery serialization options.

<serializer> Attributes

enable-xinclude

This attribute determines whether <xinclude> tags are to be expanded during serialization. Setting the value to "false" will leave <xinclude> tags unexpanded.

enable-xsl

This attribute (when set to "true") tells the serializer to pass its output to an XSL stylesheet when it encounters an XSL processing-instruction at the start of the document.

add-exist-id

This attribute tells the serializer to add debug information to each element expressed as additional attributes. This information includes the internal identifier of the node and source document. These are the accepted values:

  1. all - Adds debug information to every node in the output.
  2. element - Adds debug information to top-level elements only.
  3. none (default) - Disables debugging feature.

indent

The serializer defaults to pretty-print the resulting XML source code. Set this option to "no" to disable pretty-printing.

match-tagging-elements

The database can highlight matches in the text content of a node by tagging the matching text string with <exist:match> . Clearly, this only works for XPath expressions using the fulltext index. Set the parameter to "yes" to disable this feature.

2.5. <transformer>

This section determines which XSLT processor will be used by eXist. By default, eXist relies on Xalan, which is an XSLT 1.0 engine. Please refer to this howto to switch to an XSLT 2.0 processor like saxon.

2.6. <validation>

Defines the default validation settings that will be active when parsing XML and links to catalog files. Catalog files are used to locate DTDs, schemas and resolve external entities in general.

Please refer to the corresponding documentation on XML Validation.

2.7. <xupdate>

Inserting new nodes into a document can lead to fragmentation in the DOM storage file. eXist will thus trigger a defragmentation run if the fragmentation exceeds a certain limit. The frequency of such defragmentation runs can be configured in the <xupdate> section. The main parameter is called allowed-fragmentation:

Example: XUpdate-Options in conf.xml

<xupdate allowed-fragmentation="20" enable-consistency-checks="no" />

<xupdate> Attributes

allowed-fragmentation

This attribute defines the maximum number of page splits allowed within a document before a defragmentation run is triggered.

enable-consistency-checks

This attribute is for or debugging purposes only. If the parameter is set to "yes", a consistency check will be run on modified documents after every XUpdate request. This checks whether the persistent DOM is complete, and all pointers in the structural index point to valid storage addresses that contain valid nodes.

2.8. <xquery>

Example: XQuery Engine Configuration

<xquery enable-java-binding="no" enable-query-rewriting="no"  disable-deprecated-functions="no" raise-error-on-failed-retrieval="no" backwardCompatible="no">
    <builtin-modules>
        <!-- Default Modules -->
        <module class="org.exist.xquery.functions.util.UtilModule"
            uri="http://exist-db.org/xquery/util" />
        <!-- ... more modules ... -->
    </builtin-modules>
</xquery>

The <xquery> section is used to enable/disable certain core features of the XQuery engine. It also lists the XQuery modules that will be known to the query engine by default.

xquery attributes

enable-java-binding=yes|no

enables or disables the java binding. Giving users full access to all Java classes should be considered a security risk and the feature is thus disabled by default. If you enable it, you should think about configuring XACML to restrict Java access from XQuery.

disable-deprecated-functions=yes|no

enables or disables XQuery functions marked as deprecated.

raise-error-on-failed-retrieval=yes|no

set to "yes" if a call to doc(), xmldb:document(), collection() or xmldb:xcollection() should raise an error (FODC0002) when an XML resource can not be retrieved.

set to "no" if a call to doc(), xmldb:document(), collection() or xmldb:xcollection() should return an empty sequence when an XML resource can not be retrieved.

enable-query-rewriting=yes|no

the query engine can often achieve considerable performance improvements by rewriting an XQuery expression into a more efficient form (see the documentation about indexing). However, these features are relatively new. If you have doubts about the correctness of a query result, you may temporarily set enable-query-rewriting to "no" and see if the result changes in any way. If it does, you have hit a bug which should be reported.

backwardCompatible=yes|no

enables or disables XPath 1.0 backwards compatibility. The setting mainly effects automatic type conversions, which were less strict in XPath 1.0 than in XQuery/XPath 2.0.

builtin-modules

This section lists the XQuery modules which will be known to the query engine. The modules in this list can be imported into a query without specifying a location. For example, the following entry:

<module class="org.exist.xquery.modules.file.FileModule"
uri="http://exist-db.org/xquery/file" />

establishes a static mapping between the module URI for the file module and the Java class which implements it. When using that module, it is sufficient to provide the correct URI in the import. Specifying a location is not needed:

import module namespace file="http://exist-db.org/xquery/file";

Instead of providing a Java class, one can also specify a src URI which must point to the XQuery source code of the module, e.g.:

<module src="resource:org/exist/xquery/lib/json.xq" 
uri="http://www.json.org"/>

For the src attribute, eXist understands the same types of URIs as in an ordinary XQuery import statement.

3. Cocoon Sitemap Configuration

Cocoon uses a sitemap XML file called sitemap.xmap to configure the processing pipelines it uses to process HTTP requests. eXist's integration with Cocoon is completely based on the XML:DB database API, however any XML:DB-enabled database (e.g. Xindice) can be integrated with Cocoon.

Beginning with Cocoon version 2.0, pseudo-protocols are supported. Pseudo-protocols allow you to register handlers for special URLs via so-called "source factories". In essence, these protocols specify resources wherever a known protocol such as http:// or file:// is specified in the sitemap. Currently, the distribution defines a pseudo-protocol to access XML:DB-enabled databases.

In eXist, pseudo-protocols are configured in Cocoon's main configuration file WEB-INF/cocoon.xconf. To make use of these protocols, simply specify the correct database driver class, as in the following example:

Example: Defining the XML:DB Database Driver

<source-handler logger="core.source-handler"&gt;
        <!-- xmldb pseudo protocol -->
     <protocol 
            class="org.apache.cocoon.components.source.XMLDBSourceFactory" 
            name="xmldb">
        <driver class="org.exist.xmldb.DatabaseImpl" type="exist"/>
        <!-- Add here other XML:DB compliant databases drivers -->
      </protocol>
</source-handler>

Once the database driver has been registered with the handler, it is possible to use an XML:DB URI wherever Cocoon expects a URI in its site configuration file sitemap.xmap. For example, to access our collection of Shakespeare plays from the web-browser, and with a stylesheet applied to each document, we could use the following code fragment in the sitemap's processing pipeline:

Example: Using XML:DB URIs in the Sitemap

<!-- apply stylesheet shakes.xsl to all XML documents
in xmldb-collection /db/shakespeare/plays --> 
<map:match pattern="xmldb/db/shakespeare/plays/**.xml">
    <map:generate src="xmldb:exist:///db/shakespeare/plays/{1}.xml"/>
    <map:transform src="xmldb:exist:///db/shakespeare/plays/shakes.xsl"/>
    <map:serialize type="html"/>
</map:match>
</programlisting>

The sitemap.xmap delivered with eXist also contains more complex examples.

September 2009
Wolfgang M. Meier
wolfgang at exist-db.org