This afternoon, Maarten Marx and I will be going to the Open Government Data Camp. Right now I’m preparing by installing software I think I might need there. I’ll try to get a demo version of the faceted search implemented in eXist ready.
eXist Faceted Search
October 8th, 2010 § 2
In late 2007 eXist was enriched with a new indexing architecture. On this page I will describe how I intent to use this flexible architecture to add Faceted Search to eXist.
The Goal
In general, the goal of faceted search is providing a way to drill down in search results. Given the results for a (full-text) query, categories (such as `color’, when searching for a car) are provided together with possible values for this category (`red’, `blue’, …) and for each of these values a count is displayed, indicating how many of the current search result have this value for this category.
Values can be nominal (such as in the above example), or a discretisation of a continues value. Dates, for example, can be binned in a fixed set of predefined ranges.
These categories (facets), their values and especially their counts provide a quick and intuitive way to drill down in search results.
The Challenges
We have chosen to use eXist as our database; it gives us many features that fit our dataset. For one, we can be flexible about what we call `documents’; the level at which we can perform search. eXist allows for specifying indexes on all elements. So, we cut loose from the original .xml documents and allow for any arbitrary element to act as a document with respect to search.
At index time, we will need to create a bitvector for each <facet, value> pair. Each bit in this vector represents a pseudo document. A bit turned 1 means the document has the this value for this facet (it talks about a red car, for <color, red>), otherwise it is turned 0.
For faceted search we will need to have a fixed order for each set of these pseudo documents. That way, the bitvectors are always created in the same order, meaning that the nth bit in all vectors refers to the same pseudo document. Such bitvectors can then be combined using a bit wise AND operation. Since these <facet, value> bitvectors can be pre-calculated at index time (when the collection changes), this methods allows for fast calculations of the before mentioned counts. Only the bitvector for the current set of pseudo documents will need to be created on the fly.
The challenge becomes
- Keeping these pre-computed bitvectors up-to-date when the data collection changes.
- Efficient mapping of node-sequences (current set of pseudo documents) to a bitvector.
- Fast calculation of the cardinality of a bitvector resulting from a bit wise AND operation. (see http://stackoverflow.com/questions/1754560/can-someone-explain-to-me-what-this-getcardinality-method-is-doing)
- Defining a simple but flexible configuration
- Defining a simple but expressive xquery interface
Proposed Implementation
Let us assume we have a collection of documents like the following:
red-2007-cars.xml <root> <meta> <color>red</color> <date>2007</color> ... </meta> <cars> <car> <name>Audi</name> <description>...</description> ... </car> <car>...</car> </cars> </root>
Configuration
For existing configuration options see: http://exist.sourceforge.net/indexing.html#idxconf
This is aimed at integrating the facets into the same setting.
collection.xconf <collection xmlns="http://exist-db.org/collection-config/1.0"> <index> ... <facet qname="car" name="color" select="ancestor::root//meta/color"> <value name="red" /> <value name="blue" /> <value name="orange" /> </facet> <facet qname="car" name="age" select="ancestor::root//meta/date"> <range name="oldtimer" max="1980" /> <range name="older" min="1980" max="2008" /> <range name="new" min="2008" /> </facet> ... </index> </colletion>
Note: this syntax might not be ideal. It would probably be better to have a full xpath 2.0 expression per value that evaluates to true.
XQuery Syntax
See http://exist.sourceforge.net/lucene.html#N10325, for the existing Lucene Full-text xquery syntax:
for $hit in //car[ft:query(., "sedan")] order by ft:score($hit) descending return $hit
We define a facet syntax analogously:
for $hit in //car[facet:query("color", "red")]
return $hit
Note that a ordering by a score would not make sense here. Facet search is binary: something is a hit or not.
This syntax can trivially be combined with for example full-text search:
for $hit in //car[facet:query("color", "red")][ft:query(., "sedan")]
order by ft:score($hit) descending
return $hit
That is all nice, but the real strength of facets is displaying the counts for <facet, value> pairs. We propose yet another syntax to obtain these:
let $query := //car[facet:query("color", "red")][ft:query(., "sedan")]
for $facet in facet:counts($query)
return $facet
So we need to define an xml format that is returned by the facet:counts(.) function. We propose the following structure:
<facet name="color"> <value name="red">14</value> <value name="blue">26</value> <value name="orange">0</value> <other>3</other> </facet> <facet name="age"> <value name="oldtimer">6</value> <value name="older">13</value> <value name="new">24</value> <other /> </facet>
Note: the <other> is added to account for unforeseen facet values.
Under the hood
Extension Module
See also http://exist.sourceforge.net/extensions.html.
Source files are added in:
$EXIST_HOME/extensions/indexes/facet/src/org/exist/xquery/modules/facet/
- FacetModule.java
- Optimize.java
- Query.java
New Index
See also information on the index architecture.
Source files are added in:
$EXIST_HOME/extensions/indexes/facet/src/org/exist/indexing/facet/
- FacetIndex.java
- FacetIndexWorker.java
- FacetIndexConfig.java
Changes to to conf.xml
Add an element for facet indexes in <indexes> (next to the lucene index and the ngram index). Also, add the facet module (for xquert extension) under <modules>.
Understanding the New Indexing Features
October 8th, 2010 § 2
[mirror of: http://atomic.exist-db.org/wiki/blogs/eXist/NewIndexing]
The upcoming next release of eXist will introduce quite a few changes with respect to index types and index creation. While your old index configuration should still work with the new version, knowing the new features and possibilities can sometimes result in a dramatic performance boost.
To better understand the changes, we have to look at two different areas of development, which both have direct effects on indexing features:
- The switch to a modularized indexing architecture
- The new query-rewriting optimizer
Modularized Index Architecture
So far, all indexes in eXist were part of the database core, i.e. their creation and configuration was hardcoded, their indexing methods were directly called from the main indexer. Certainly this “design” was too limiting. Most XML projects tend to develop slightly different indexing needs and it is hard to define one type of index that fits them all.
We thus decided to replace the old hard-wired indexes by a completely new architecture, which is based on pluggable indexing pipelines and makes nearly no assumptions about the data to be indexed. In the new design, the database core just passes a stream of index events to a pipeline of external index plugins. The index plugin scans the stream for interesting events and processes them according to its own logic (for example, the spatial plugin listens for GML geometries in the source XML to enable spatial queries). For the DB core, the index is now a black box, which handles its own creation, configuration, destruction etc..
The plugin architecture makes it even possible to use other database systems to index contents stored in eXist. The new spatial index is a good example as it currently uses an SQL database to index spatial data.
Consequently, all the new index modules were moved out of the eXist core packages and into an extra directory extensions/indexes. New indexes can be added any time (copy the directory structure from an existing index and start implementing).
Our work is not finished yet as some of the old indexes have only be moved in part. However, we implemented two new index plugins: an n-gram index and the spatial index already mentioned above. Both plugins served as a proof-of-concept and matured while we were working on the new architecture. We also heard of users who have already written their own customized index implementation.
Query-Rewriting Optimizer
The second major development which affects the indexing system is a new query-optimizer. It analyzes the query at compile time and searches for optimizable subexpressions within the query tree. If it finds an optimizable expressions, the optimizer will modify the query and wrap some special instructions around the optimizable code block.
Note: right now, you have to enable the optimizer in conf.xml:
Understanding the optimization techniques requires some in-depth knowledge of eXist’s indexing scheme and query processing approach. This should be covered by a separate article and I will not go into details here. The basic idea is to limit the number of nodes that need to be processed as early as possible. For example, take the following query:
Let’s say we have 100,000 inproceedings and 600,000 author nodes in the database, but only 34 of them were actually written by a “Joe Doe”. The index lookups for inproceedings and author can be rather expensive (or at least memory consuming), while finding “Joe Doe” in the index is really cheap (a few milliseconds).
Now, the main trick used by the query optimizer is to evaluate the comparison author = "Joe Doe" ahead of the rest of the query. It then computes the ancestor nodes which happen to be inproceedings elements and puts them into relation with the outer query context.
Obviously, the efficiency of this technique depends on how selective the predicate expression is. Also, the range of expressions which can be optimized automatically is currently limited. For example, the optimizer can not yet deal with conditions in “where” clauses. However, it can speed up some types of XPath predicate expressions by factor 10 or more.
Coming back to indexing: the query-rewriting optimizer changes the evaluation sequence of the query expression. The expression author = "Joe Doe" is processed out of sequence, which means that the context of this expression (the inproceedings nodes) is unknown or not exactly known. This is not a real problem since the result of the expression wl be re-contextualized later. But it has some consequences for the indexing:
To make an optimization decision, the optimizer needs to know if an index is available for the subexpression and if it can be used out-of-sequence. If the index depends on a certain context, it can’t be used in query rewriting (it could still be used for other optimizations though).
Unfortunately, all the old eXist indexes were designed to be context-dependant. For example, you could define an index only on those author nodes being children of inproceedings elements, but not on those being children of book:
This kind of context-dependant index definitions helps to keep the index small, but it makes it hard for the optimizer to properly rewrite the expression tree without missing some nodes. We thus had to introduce an alternative configuration method which is not context-dependant. To keep things simple, we decided to define the index on the QName of the node alone and to ignore the context altogether:
As you see, the only change is that we are now using an attribute qname instead of the path attribute. An index defined in this way can be used with the optimization techniques described above. This is done transparently and you don’t need to provide anything else apart from the index.
You are still free to prefer the path attribute instead of qname (at least for the old indexes, not for the new n-gram index). An index configured like that will be used where possible, but not by the query-rewriting optimizer. The expected performance gain will thus be much smaller.
The main disadvantage of index definitions by qname is that the index might grow much larger (as so often, there’s a trade-off between performance and storage space). You cannot index only some author nodes and exclude others in the same collection. On the other hand, the performance win can be dramatic enough to justify an increase in index size. And since eXist allows you to configure different indexes for different database collections, you can always move documents you don’t want to be indexed into another collection.
Putting It All Together
Below is an example collection.xconf document, which uses most configuration options. To find out how to create and where to store this document, please refer to the documentation.
Let’s walk through this step by step:
We first need to tell the indexer not to create any full text indexes by default (default="none"), not even for attributes (attributes="false"). We then create a standard full text index on all author elements, identified by their QName. The next line declares an index on title, but with the added option content="mixed". This parameter causes the indexer to watch out for mixed-content nodes. For example, if your source XML contains markup like:
You may want to treat “uneven” as a single word so you can query for p |= "uneven". In this case, you need to pass content="mixed" to the indexer. On the other hand, if you have
you probably want to be able to query for “March” even though there’s no space between the year and month elements. In this case the standard settings are ok as they will add a virtual break between the elements.
Finally, the index on booktitle is specified in a context-dependant way using the path attribute (which doesn’t make that much sense here since only books can have a booktitle).
Range indexes support the XQuery comparison operators (=, less than, greater than …) as well as the standard string functions fn:starts-with, fn:ends-with, fn:contains or fn:matches. Range index are type-specific: an index defined on xs:int will not be used for string comparisons and vice versa. Also, string indexes will always be case sensitive and can only be used with the default collation. In other words: it is not possible to define a string index for a predefined collation (this is a limitation we plan to address soon).
Finally, let’s have a look at n-gram indexes:
The n-gram index splits the text it receives into sequences of n-characters (where n = 3 by default). For example, the text “love me” will be split into the tri-grams: “lov”, “ove”, “ve_”, “e_m”, “_me” (I replaced the space char by a _). This type of index is very efficient for exact substring matches, for example:
However, the main benefit of an n-gram index is that it works well with non-european languages like Chinese. The full text index is a bad match for these languages (and terribly slow) as they can not be easily split into whitespace separated tokens (or “words”).
Choosing the Right Index
If you are using a version of eXist which already supports the new query optimization techniques (applies to all versions since June 2007), the first rule should be:
Prefer simple index definitions on qname
The optimizer is still work in progress and you may not always see faster query times by following this rule. However, there’s rarely a good reason for using a context-dependant index definition, at least in my experience. So to be on the safe side and to benefit from current and future improvements, use qname="author" instead of path="//author".
Use range indexes on strongly typed data or short strings
Range indexes work with the standard XQuery operators and string functions. Querying for something like
will always be slow without an index. As long as no index is defined it, eXist has to scan over every year element in the db, casting its string value to an integer.
For queries on string content, range indexes work well for exact comparisons (author = 'Joe Doe') or regular expressions (matches(author, "^Joe.*")), though you may also consider using a full text index in the latter case. However, please note that range indexes on strings are case-sensitive or rathorrect formulation, sensitive to the default collation.
Consider an n-gram index for exact substring queries on longer text sequences
While range indexes tend to become slow for substring queries (like contains(title, "XSLT 2.0")), an n-gram index is nearly as fast as a full text index, but it also indexes whitespace and punctuation. ngram:contains(title, "XSLT 2.0") will only match titles containing the exact phrase “XSLT 2.0″. Please note also that n-gram indexes are case insensitive.
Choose a full text index for tokenizable text where whitespace/punctuation is mostly irrelevant
I don’t think I need to explain much here. The full text index is fast and should be used whenever you need to query for any sequence of separate words or tokens in a longer text. It can sometimes even be faster to post-process the returned node set and filter out wrong matches than using a much slower regular expression. The available full text functions and extension operators are described in the documentation.
The full text index is currently undergoing a major redesign, but this should only increase its general usability. For example, it is currently quite difficult to replace the general-purpose tokenizer. This will certainly become much easier in the future.