eXistDB Prototype Database Server

Filed under: Uncategorized. Tags: . | Leave a Comment 

I’ve got a prototype eXistDB server, built as a WebObjects application, running on an iMac on my desk. Works pretty well, and it does some great XQuery stuff. I’ve entered the CAREO metadata, current as of a week ago, so it’s got 3733 IMS LOM records to play with.

Check it out.

It has a simple search (just enter a term and hit the button), as well as a generic XQuery entry panel. Feel free to experiment on any XQuery statements you want (do a simple search and look at the source HTML for the result page for a starting point…) Go ahead and try some funky boolean searches like “earth image satellite”. Still some refinements left to go (like limits on search results – it’s currently possible to return the entire database as a result of a query – not recommended). I also have to play around with handling multiple schema – IMS LOM, DublinCore, MPEG7, IMS CP, METS, … for both querying and retrieval.

Basically, eXist is pretty darned good, and Wolfgang (the developer) is quite responsive. Not sure how it compares to XStreamDB performance-wise, but it does beat it on the cost…

eXistDB and WebObjects

Filed under: Uncategorized. Tags: . | Leave a Comment 

I’ve spent the morning building a prototype WebObjects app to act as an xml metadata server. I’ve embedded eXistDB into the application, and it created the necessary database files and indices for me.

Then, I wrote a short method to import xml documents from a path (and added the added bonus of importing a whole directory if that was given). 3600+ records in the embedded database.

And boy, is it fast. Queries are almost instantaneous (~100ms typical), but document retrievals are a wee bit slower, increasing linearly with the number of hits. I haven’t added any limits, so you can do a query for something lame like “*a*” and get the whole database back in one page.

The embedded eXist database doesn’t use the XML-RPC API like the standalone database does, so there isn’t any marshalling/unmarshalling overhead. Just native java calls.

When considering the document retrieval isn’t optimized (and is just basically a debug “dump the entire LOM as the item to display”), performance is quite acceptable already.

Here’s the stats from a simple search for “biology”
Query: //text() &= ‘biology’
Hits: 377
Query Time: 124 ms
Retrieval Time: 6059 ms

That retrieval time includes pulling the ENTIRE LOM for each and every one of the 377 results.

UPDATE: Just ran some more tests, and cracked open the debug log file. Here’s what I found:

09 Dec 2003 18:42:00,401 – loading 3647 documents from 2collections took 3ms.
09 Dec 2003 18:42:00,411 – found image: 2800 in 4ms.
09 Dec 2003 18:42:00,414 – found nasa: 9 in 0ms.
09 Dec 2003 18:42:00,417 – found space: 13 in 1ms.
query: //text() &= ‘image nasa space’
hits: 3
query time: 86
retrieve time: 22

Finding 2800 records containing the string “image” took 4ms. Holy freaking cow.

From the various test queries I’ve run, on average the vast majority of the time is spent retrieving the documents out of the database. The query runs extremely fast, but yanking the entire LOM out takes some time. I’m going to look at ways to only pull various XPATH values rather than the full record – that may be faster…

eXist XPath Extensions

Filed under: Uncategorized. Tags: . | Leave a Comment 

One of the really cool things about eXist is the XPath extensions for fulltext searching. They mimic (using XPath) the stuff that is done in XStreamDB via XQuery.

I can do stuff like:

document(*)//text() &= "*image*"

and eXist will return me any xml document (from it’s entire set of collections) that contains the string “image” somewhere in it (could be in /lom/general/title/langstring/Images Of Bangalore, or /lom/technical/format/image/jpeg, or whatever. It doesn’t care. And, it’s very fast.

What’s more, I can do stuff like:

document(*)/*[ //format &= "*image*" and //text() &= "*earth*"]

which says “find me xml documents that have “image” somewhere in a “format” element (could be, say, /lom/technical/format), and contain the string “earth” somewhere (like, say, /lom/general/title/langstring/Earth At Night or /lom/general/title/langstring/Earthquakes )

I can also do something like:

document(*)//text() &="*image* *kyoto*"

Which will give me different results than

document(*)//text() &= "*image* *kyoto* *relig*"

because the second query will restrict the search to stuff to do with “relig” – religion, religious, whatever (in this case, a Buddhist temple in Kyoto is returned, as opposed to the Kyoto Accord presentations at the University of Calgary, which are returned by the query before it…)

The fulltext extension – based queries (using the &= qualifier to indicate “boolean and” – you can also use the := qualifier to indicate “boolean or”) are amazingly fast. I’m getting results from rather complicated test queries on the entire 3600+ CAREO record set in a fraction of a second. Nice.