Lucene-based Full Text Index

1. Introduction

The 1.4 version of eXist features a new full text indexing module which replaces eXist's built-in full text index. The new module is faster, better configurable and more feature rich than eXist's old index. It will also be the basis for an implementation of the W3C's fulltext extensions for XQuery.

The new full text module is based on Apache Lucene. It thus benefits from a stable, well-designed and widely-used framework. The module is tightly integrated with eXist's modularized indexing architecture: the index behaves like a plugin which adds itself to the db's index pipelines. Once configured, the index will be notified of all relevant events, like adding/removing a document, removing a collection or updating single nodes. No manual reindex is required to keep the index up to date. The module also implements common interfaces which are shared with other indexes, e.g. for highlighting matches. It is thus easy to switch between the lucene index and e.g. the ngram index without rewriting too much XQuery code.

2. Enabling the Lucene Module

The Lucene index is enabled by default in all newer releases of eXist. However, in case it is not enabled in your installation, here's how to get it up and running:

  1. Before building eXist, you need to enable the Lucene module by editing extensions/indexes/build.properties (also see the documentation on index modules):

    Example: build.properties

    # Lucene integration
    include.index.lucene = true
    
  2. Then (re-)build eXist using the provided build.sh or build.bat. The build process downloads the required Lucene jars automatically. If everything builds ok, you should find a jar exist-lucene-module.jar in the lib/extensions directory. Next, edit the main configuration file, conf.xml and comment in the two lucene-related sections:

    Example: conf.xml

    <modules>
      <module id="lucene-index" class="org.exist.indexing.lucene.LuceneIndex" buffer="32"/>
      ...
    </modules>
    ...
    <builtin-modules>
      <module id="lucene-index" class="org.exist.indexing.lucene.LuceneIndex"/>
      ...
    </builtin-modules>
    

2.1. Global configuration options

The index has a single configuration parameter which can be specified on the <module> element within the <modules> section:

buffer

Defines the amount of memory (in megabytes) Lucene will use for buffering index entries before they are written to disk. See the Lucene javadocs.

3. Configuring the Index

Like other indexes, you create a lucene index by configuring it in a collection.xconf document. If you have never done that before, read the corresponding documentation. An example collection.xconf is shown below:

Example: collection.xconf

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:atom="http://www.w3.org/2005/Atom"
        xmlns:html="http://www.w3.org/1999/xhtml"
        xmlns:wiki="http://exist-db.org/xquery/wiki">
	    <!-- Disable the standard full text index -->
        <fulltext default="none" attributes="no"/>
	    <!-- Lucene index is configured below -->
        <lucene>
	        <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
	        <text qname="TITLE" analyzer="ws"/>
	        <text qname="p">
	            <inline qname="em"/>
            </text>
            <text match="//foo/*"/>
            <!-- "inline" and "ignore" can be specified globally or per-index as
                 shown above -->
	        <inline qname="b"/>
	        <ignore qname="note"/>
        </lucene>
    </index>
</collection>

You can either define a lucene index on a single element or attribute name (qname="...") or a node path with wildcards (match="...", see below). It is important make sure to choose the right context for an index, which has to be the same as in your query. To better understand this, let's have a look at how the index creation is handled by eXist and Lucene. The following configuration:

<text qname="SPEECH"/>

creates an index ONLY on SPEECH. What is passed to Lucene is the string value of SPEECH, which includes the text of all its descendant text nodes (*except* those filtered out by an optional <ignore> ). For example, consider the fragment:

<SPEECH>
    <SPEAKER>Second Witch</SPEAKER>
    <LINE>Fillet of a fenny snake,</LINE>
    <LINE>In the cauldron boil and bake;</LINE>
</SPEECH>

If you have an index on SPEECH, Lucene will create a "document" with the text "Second Witch Fillet of a fenny snake, In the cauldron boil and bake;" and indexes it. eXist internally links this Lucene document to the SPEECH node, but Lucene has no knowledge of that (it doesn't know anything about XML nodes).

The query:

//SPEECH[ft:query(., 'cauldron')]

searches the index and finds the "document" containing the SPEECH text, which eXist can trace back to the SPEECH node in the XML document. However, it is required that you use the same context (SPEECH) for creating and querying the index. The query:

//SPEECH[ft:query(LINE, 'cauldron')]

will not return anything, even though LINE is a child of SPEECH and 'cauldron' was indexed. This particular 'cauldron' is linked to its ancestor SPEECH node, not its parent LINE.

However, you are free to give the user both options, i.e. use SPEECH and LINE as context at the same time. How? Simply define a second index on LINE:

<text qname="SPEECH"/>
<text qname="LINE"/>

Let's use a different example to illustrate that. Assume you have a document with encoded place names:

Example: Paragraph with place name

<p>He loves <placeName>Paris</placeName>.</p>

For a general query you probably want to search through all paragraphs. However, you may also want to provide an advanced search option, which allows the user to restrict his query to place names. To make this possible, simply define an index on placeName as well:

Example: collection.xconf fragment

<lucene>
    <text qname="p"/>
    <text qname="placeName"/>
</lucene>

Based on this setup, you'll be able to query for the word 'Paris' anywhere in a paragraph:

//p[ft:query(., 'paris')]

as well as 'Paris' occurring within a <placeName> :

//p[ft:query(placeName, 'paris')]

3.1. Using match="..."

In addition to defining an index on a given qname, you may also specify a "path" with wildcards. This feature is subject to change, so please be careful when using it.

Assume you want to define an index on all the possible elements below SPEECH. You can do this by creating one index for every element:

<text qname="LINE"/>
<text qname="SPEAKER"/>

As a shortcut, you can use a match attribute with a wildcard:

<text match="//SPEECH/*"/>

which will create a separate index on each child element of SPEECH it encounters. Please note that the argument to match is a simple path pattern, not an XPath expression. It only allows / and // to denote a child or descendant step, plus the wildcard to match an arbitrary element.

As explained above, you have to figure out which parts of your document will likely be interesting as context for a full text query. The full text index will work best if the context isn't too narrow. For example, if you have a document structure with section divs, headings and paragraphs, I would probably want to create an index on the divs and maybe on the headings, so the user can differentiate between the two. In some cases, I could decide to put the index on the paragraph level, but then I don't need the index on the section since I can always get from the paragraph back to the section.

If you query a larger context, you can use the KWIC module to show the user only a certain chunk of text surrounding each match. Or you can ask eXist to highlight each match with an <exist:match> tag, which you can later use to locate the matches within the text.

3.2. Whitespace Treatment and Ignored Content

Inlined elements

By default, eXist's indexer assumes that element boundaries break a word or token. For example, if you have an element:

Example: Not a Mixed Content Element

<size><width>12</width><height>8</height></size>

You want "12" and "8" to be indexed as separate tokens, even though there's no whitespace between the elements. By default, eXist will indeed pass the content of the two elements to Lucene as separate strings and Lucene will thus see two tokens instead of just "128".

However, you usually don't want this behaviour for mixed content nodes. For example:

Example: Mixed Content Node

<p>This is <b>un</b>clear.</p>

In this case, you want "unclear" to be indexed as one word. This can be done by telling eXist which nodes are "inline" nodes. The example configuration above defines:

<inline qname="b"/>

The <inline> option can be specified globally, which means it will be applied to all <b> elements, or per-index:

<text qname="p">
    <inline qname="em"/>
</text>

Ignored elements

Also, it is sometimes necessary to skip the content of an inlined element, which can appear in the middle of a text sequence you want to index. Notes are a good example:

Example: Paragraph With Inline Note

<p>This is a paragraph
<note>containing an inline note</note>.</p>

Use an <ignore> element in the collection configuration to have eXist ignore the note:

<ignore qname="note"/>

Basically, <ignore> simply allows you to hide a chunk of text before Lucene sees it.

Like the <inline> tag, <ignore> may appear globally or within a single index definition.

The <ignore> only applies to descendants of an indexed element. You can still create another index on the ignored element itself. For example, you can have index definitions for <p> and <note> :

Example: collection.xconf fragment

<lucene>
    <text qname="p"/>
    <text qname="note"/>
    <ignore qname="note"/>
</lucene>

If <note> appears within <p> , it will not be added to the index on <p> , but only to the index on <note> . This means that the query

//p[ft:query(., "note")]

may not return a hit if "note" occurs within a <note> , while

//p[ft:query(note, "note")] 

may still find a match.

3.3. Boost

A boost value can be assigned to an index to give it a higher score. The score for each match will be multiplied by the boost factor (default is: 1.0). For example, you may want to rank matches in titles higher than other matches. Here's how we configure the documentation search indexes in eXist:

Example: collection.xconf using boost

<lucene>
    <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
    <text qname="section">
        <ignore qname="title"/>
        <ignore qname="programlisting"/>
        <ignore qname="screen"/>
        <ignore qname="synopsis"/>
    </text>
    <text qname="title" boost="2.0"/>
</lucene>

The title index gets a boost of 2.0 to make sure that title matches get a higher score. Since the <title> element does occur within <section> , we add an ignore rule to the index definition on the section and create a separate index on title. Without this, title would be matched two times.

Because the title is now indexed separately, we also need to query it explicitely. For example, to search the section and the title at the same time, the documentation search interface issues the following query:

for $sect in /book//section[ft:query(., "ngram")] | /book//section[ft:query(title, "ngram")]
order by ft:score($sect) descending return $sect

3.4. Analyzers

One of the strengths of Lucene is that it allows the developer to determine nearly every aspect of the text analysis. This is mostly done through analyzer classes, which combine a tokenizer with a chain of filters to post-process the tokenized text. eXist's Lucene module does already allow different analyzers to be used for different indexes.

Example:

<lucene>
    <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
    <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
    <text match="//SPEECH//*"/>
    <text qname="TITLE" analyzer="ws"/>
</lucene>

In the example above, we define that Lucene's StandardAnalyzer should be used by default (the <analyzer> element without id attribute). We provide an additional analyzer and assign it the id ws, by which the analyzer can be referenced in the actual index definitions.

The whitespace analyzer is the most basic one. As the name says, it tokenizes the text at white space characters, but treats all other characters - including punctuation - as part of the token. The tokens are not converted to lower case and there's no stopword filter applied.

We will certainly add more features in the future, e.g. a possibility to construct a new analyzer from a set of filters. For the time being, you can always provide your own analyzer or use one of those supplied by Lucene or compatible software.

4. Querying the Index

Querying lucene from XQuery is straightforward. For example:

Example: A Simple Query

for $m in //SPEECH[ft:query(., "boil bubble")]
order by ft:score($m) descending
return $m

The query function takes a query string in Lucene's default query syntax. It returns a set of nodes which are relevant with respect to the query. Lucene assigns a relevance score or rank to each match. This score is preserved by eXist and can be accessed through the score function, which returns a decimal value. The higher the score, the more relevant is the text. You can use Lucene's features to "boost" a certain term in the query, i.e. give it a higher or lower influence on the final rank.

Please note that the score is computed relative to the root context of the index. If you created an index on SPEECH, all scores will be computed on basis of the text in the SPEECH nodes, even though your actual query may only return LINE children of SPEECH.

The Lucene module is fully supported by eXist's query-rewriting optimizer, which means that the query engine can rewrite the XQuery expression to make best use of the available indexes. All the rules and hints given in the tuning guide do fully apply to the Lucene index.

To present search results in a Keywords in Context format, you may want to have a look at eXist's KWIC module.

4.1. Describing Queries in XML

Lucene's default query syntax does not provide access to all available features. However, eXist's ft:query function also accepts a description of the query in XML as an alternative to passing a query string. The XML description closely mirrors Lucene's query API. It is transformed into an internal tree of query objects, which is directly passed to Lucene for execution. This has some advantages. For example, you can specify if the order of terms should be relevant for a phrase query:

Example: Using an XML Definition of the Query

let $query :=
    <query>
        <near ordered="no">miserable nation</near>
    </query>
return
    //SPEECH[ft:query(., $query)]

The following elements may occur within a query description:

<term>

Defines a single term to be searched in the index. If the root query element contains a sequence of term elements, they will be combined as in a boolean "or" query. For example:

let $query :=
    <query>
        <term>nation</term><term>miserable</term>
    </query>
return
//SPEECH[ft:query(., $query)]

finds all SPEECH elements containing either "nation" or "miserable" or both.

<wildcard>

A string with a '*' wildcard in it, which will be matched against the terms of a document. Can be used instead of a <term> element. For example:

let $query :=
    <query>
        <term>nation</term><wildcard>miser*</wildcard>
    </query>
return
//SPEECH[ft:query(., $query)]

<regex>

A regular expression which will be matched against the terms of a document. Can be used instead of a <term> element. For example:

let $query :=
    <query>
        <term>nation</term><regex>miser.*</regex>
    </query>
return
//SPEECH[ft:query(., $query)]

<bool>

Constructs a boolean query from its children. Each child element may have an occurrance indicator, which could be either must, should or not:

must

this part of the query must be matched

should

this part of the query should be matched, but doesn't need to

not

this part of the query must not be matched

let $query :=
    <query>
        <bool><term occur="must">boil</term><term occur="should">bubble</term></bool>
    </query>
return //SPEECH[ft:query(LINE, $query)]

<phrase>

Searches for a group of terms occurring in the correct order. The element may either contain explicit <term> elements or text content. Text will be automatically tokenized into a sequence of terms. For example:

let $query :=
    <query>
        <phrase>cauldron boil</phrase>
    </query>
return //SPEECH[ft:query(., $query)]

has the same effect as:

let $query :=
    <query>
        <phrase><term>cauldron</term><term>boil</term></phrase>
    </query>
return //SPEECH[ft:query(., $query)]

Attribute slop can be used for a proximity search: Lucene will try to find terms which are within the specified distance:

let $query :=
    <query>
        <phrase slop="10"><term>frog</term><term>dog</term></phrase>
    </query>
return //SPEECH[ft:query(., $query)]

<near>

<near> is a powerful alternative to <phrase> and one of the features not available through the standard Lucene query parser.

If the element has text content only, it will be tokenized into terms and the expression behaves like <phrase> . Otherwise it may contain any combination of <term> , <first> and nested <near> elements. This makes it possible to search for two sequences of terms which are within a specific distance. For example:

let $query :=
    <query>
        <near slop="20"><term>snake</term><near>tongue dog</near></near>
    </query>
return //SPEECH[ft:query(., $query)]

Element <first> matches a span against the start of the text in the context node. It takes an optional attribute end to specify the maximum distance from the start of the text. For example:

let $query :=
    <query>
        <near slop="50"><first end="2"><near>second witch</near></first><near>tongue dog</near></near>
    </query>
    return //SPEECH[ft:query(., $query)]

As shown above, the content of <first> can again be text, a <term> or <near> .

Contrary to <phrase> , <near> can be told to ignore the order of its components. Use parameter ordered="yes|no" to change near's behaviour. For example:

let $query :=
    <query>
        <near slop="100" ordered="no"><term>snake</term><term>bake</term></near>
    </query>
return //SPEECH[ft:query(., $query)]

All elements in a query may have an optional boost parameter (a float value). The score of the nodes matching the corresponding query part will be multiplied by the boost.

September 2009
The eXist Project