XML Validation

1. Introduction

eXist supports implicit and explicit validation of XML documents. Implicit validation can be executed automatically when documents are being inserted into the database, explicit validation can be performed through the use of provided XQuery extension functions.

2. Implicit validation

To enable implicit validation, the eXist-db configuration must be changed by editing the file conf.xml. The following items must be configured:

2.1. Validation mode

Example: Default configuration

    <validation mode="auto">
        <entity-resolver>
            <catalog uri="${WEBAPP_HOME}/WEB-INF/catalog.xml" />
        </entity-resolver>
    </validation>

With the parameter mode it is possible to switch on/off the validation capabilities of the (Xerces) XML parser. The possible values are:

yes

Switch on validation. All XML documents will be validated. Note - If the grammar (XML schema, DTD) document(s) cannot be resolved, the XML document is rejected.

no

Switch off validation. No grammar validation is performed and all well-formed XML documents will be accepted.

auto (default)

Validation of an XML document will be performed based on the content of the document. When a document contains a reference to a grammar (XML schema or DTD) document, the XML parser tries to resolve this grammar and the XML document will be validated against this grammar, just like mode="yes" is configured. Again, if the grammar cannot be resolved, the XML document will be rejected. When the XML document does not contain a reference to a grammar, it will be parsed like mode="no" is configured.

2.2. Catalog Entity Resolver

All grammars (XML schema, DTD) that are used for implicit validation must be registered with eXist using OASIS catalog files. These catalog files can be stored on disk and/or in the database itself. In eXist the actual resolving is performed by the apache xml-commons resolver library.

It is possible to configure any number of catalog entries in the entity-resolver section of conf.xml . The relative "uri="s are resolved relative to the location of the catalog document.

Example: Catalog stored in database

    <validation mode="auto">
        <entity-resolver>
            <catalog uri="xmldb:exist:///db/grammar/catalog.xml" />
            <catalog uri="${WEBAPP_HOME}/WEB-INF/catalog.xml" />
        </entity-resolver>
    </validation>

A catalog stored in the database can be addressed by a URL like 'xmldb:exist:///db/mycollection/catalog.xml' (note the 3 leading slashes which imply localhost) or the shorter equivalent '/db/mycollection/catalog.xml'.

In the preceeding example ${WEBAPP_HOME} is substituted by a file:// URL pointing to the 'webapp'-directory of eXist (e.g. '$EXIST_HOME/webapp/') or the equivalent directory of a deployed WAR file when eXist is deployed in a servlet container (e.g. '${CATALINA_HOME}/webapps/exist/')

Example: Default OASIS catalog file

    <?xml version="1.0" encoding="UTF-8"?>
    <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
        <public publicId="-//PLAY//EN" uri="entities/play.dtd"/>
        <system systemId="play.dtd" uri="entities/play.dtd"/>
        <system systemId="mondial.dtd" uri="entities/mondial.dtd"/>    
        
        <uri name="http://exist-db.org/samples/shakespeare" uri="entities/play.xsd"/>
        
        <uri name="http://www.w3.org/XML/1998/namespace" uri="entities/xml.xsd"/>
    	<uri name="http://www.w3.org/2001/XMLSchema" uri="entities/XMLSchema.xsd"/>
    
        <uri name="urn:oasis:names:tc:entity:xmlns:xml:catalog" uri="entities/catalog.xsd" />
    </catalog>

2.3. Collection configuration

Within the database the validation mode for each individal collection can be configured using collection.xconf documents, in the same way these are used for configuring indexes. These documents need to be stored in '/db/system/config/db/....'.

Example: collection.xconf

<?xml version='1.0'?>
<collection xmlns="http://exist-db.org/collection-config/1.0">
    <validation mode="no"/>
</collection>

This example xconf file turns the implicit validation off.

3. Explicit validation

Extension functions for validating XML in an XQuery script are provided. Starting with eXist-db release 1.4 the following validation options are provided:

Each of these options are discussed in the following sections. Consult the XQuery Function Documentation for detailed functions descriptions.

3.1. JAXP

The JAXP validation functions are based on the validation capabilities of the javax.xml.parsers API. The actual validation is performed by the Xerces2 library.

When parsing an XML document a reference to a grammar (either DTDs or XSDs) is found, then the parser attempts resolve the grammar reference by following either the XSD xsi:schemaLocation, xsi:noNamespaceSchemaLocation hints, the DTD DOCTYPE SYSTEM information, or by outsourcing the retrieval of the grammars to an Xml Catalog resolver. The resolver identifies XSDs by the (target)namespace, DTDs are identified by the PublicId.

Validation performance is increased through grammar caching; the cached compiled grammars are shared by the implicit validation feature.

The jaxp() and jaxp-report() functions accept the following parameters:

$instance

The XML instance document, referenced as document node (returned by fn:doc()), element node, xs:anyURI or as Java file object.

$cache-grammars

Set this to true() to enable grammar caching.

$catalogs

One or more OASIS catalog files referenced as xs:anyURI. Depending on the xs:anyURI a different resolver will be used:

  • When an empty sequence is set, the catalog files defined in conf.xml are used.
  • If the URI ends with ".xml" the specified catalog is used.
  • If the URI points to a collection (when the URL ends with "/") the grammar files are searched in the database using an xquery. XSDs are found by their targetNamespace attributes, DTDs are found by their publicId entries in catalog files.

3.2. JAXV

The JAXV validation functions are based on the java.xml.validation API which has been introduced in Java 5 to provide a schema-language-independent interface to validation services. Although officially the specification allows use of additional schema languages, only XML schemas can be really used so far.

The jaxv() and jaxv-report() functions accept the following parameters:

$instance

The XML instance document, referenced as document node (returned by fn:doc()), element node, xs:anyURI or as Java file object.

$grammars

One or more grammar files, referenced as document nodes (returned by fn:doc()), element nodes, xs:anyURI or as Java file objects.

3.3. Jing

The Jing validation functions are based on James Clark's Jing library. eXist uses the maintained version that is available via Google Code. The library relies on the com.thaiopensource.validate.ValidationDriver which supports a wide range of grammar types:

The jing() and jing-report() functions accept the following parameters:

$instance

The XML instance document, referenced as document node (returned by fn:doc()), element node, xs:anyURI or as Java file object.

$grammar

The grammar file, referenced as document node (returned by fn:doc()), element node, as xs:anyURI, binary document (returned by util:binary-doc() for RNC files) or as Java file object.

4. Validation report

The validation report contains the following information:

Example: Valid document

    <?xml version='1.0'?>
    <report>
        <status>valid</status>
        <namespace>MyNameSpace</namespace>
        <duration unit="msec">106</duration>
    </report>

Example: Invalid document

    <?xml version='1.0'?>
    <report>
        <status>invalid</status>
        <namespace>MyNameSpace</namespace>
        <duration unit="msec">39</duration>
        <message level="Error" line="3" column="20">cvc-datatype-valid.1.2.1: 'aaaaaaaa' is not a valid value for 'decimal'.</message>
        <message level="Error" line="3" column="20">cvc-type.3.1.3: The value 'aaaaaaaa' of element 'c' is not valid.</message>
    </report>

Example: Exception

    <?xml version='1.0'?>
    <report>
        <status>invalid</status>
        <duration unit="msec">2</duration>
        <exception>
            <class>java.net.MalformedURLException</class>
            <message>unknown protocol: foo</message>
            <stacktrace>java.net.MalformedURLException: unknown protocol: foo 
            at java.net.URL.<init>(URL.java:574) 
            at java.net.URL.<init>(URL.java:464) 
            at java.net.URL.<init>(URL.java:413) 
            at org.exist.xquery.functions.validation.Shared.getStreamSource(Shared.java:140) 
            at org.exist.xquery.functions.validation.Shared.getInputSource(Shared.java:190) 
            at org.exist.xquery.functions.validation.Parse.eval(Parse.java:179) 
            at org.exist.xquery.BasicFunction.eval(BasicFunction.java:68) 
            at ......
            </stacktrace>
        </exception>
    </report>

5. Grammar management

The XML parser (Xerces) compiles all grammar files (dtd, xsd) upon first use. For efficiency reasons these compiled grammars are cached and made available for reuse, resulting in a significant increase of validation processing performance. However, under certain circumstances (e.g. grammar development) it may be desirable to manually clear this cache, for this purpose two grammar management functions are provided:

Example: Cached grammars Report

    <?xml version='1.0'?>
    <report>
    <grammar type="http://www.w3.org/2001/XMLSchema">
        <Namespace>http://www.w3.org/XML/1998/namespace</Namespace>
        <BaseSystemId>file:/Users/guest/existdb/trunk/webapp//WEB-INF/entities/XMLSchema.xsd</BaseSystemId>
        <LiteralSystemId>http://www.w3.org/2001/xml.xsd</LiteralSystemId>
        <ExpandedSystemId>http://www.w3.org/2001/xml.xsd</ExpandedSystemId>
    </grammar>
    <grammar type="http://www.w3.org/2001/XMLSchema">
        <Namespace>http://www.w3.org/2001/XMLSchema</Namespace>
        <BaseSystemId>file:/Users/guest/existdb/trunk/schema/collection.xconf.xsd</BaseSystemId>
    </grammar>
    </report>

Note: the element BaseSystemId typically does not provide usefull information.

6. Interactive Client

The interactive shell mode of the java client provides a simple validate command that accepts the similar explicit validation arguments.

7. Special notes

8. References

September 2009