Using Solr for searching research data in early Arabic poems

Posted by s.aragon on 26 May 2017 - 11:00am

By Swithun Crowe, Research Computing, University Library, University of St Andrews

This article is part of our series: A day in the software life, in which researchers from all disciplines discuss the tools that make their or someone else’s research possible.

Most of the data I work with is in XML format—Text Encoding Initiative (TEI)— either handwritten or edited using XForms: XML exported from other programs such as Zotero, or taken from third party web services, such as the Library of Congress authority files. To search these files, I use Apache's Solr document search engine, usually communicating with it via PHP's Solr extension. The source XML documents are transformed into a form which Solr can ingest using XSLT.

The examples in this article come from the Arab Cultural Semantics in Transition (ARSEM) project at the University of St Andrews, which links a corpus of early Arabic poems with a dictionary containing words and meanings used in the poems. With Solr, I've been able to create search interfaces for the dictionary (Arabic or transliterated stems and lemmata, and meanings in various languages) and poems (biographical information, poem features, and stems/lemmata/translations within lines or within poems).

Coming from relational databases, there are several features of Solr which are hard to get one's head around. Solr documents are flat lists of pairs of field names and values. For example, the text of each line in a poem can be stored in a poem document in Solr, but it isn't possible to specify which lines to search. To allow searching within specific lines in poems, I had to create documents for each line.

Solr also doesn't provide the same ability to join tables together, so one can forget about aggressive normalisation. For instance, information about a whole poem has to be duplicated in each line document. Since version 4, Solr does offer some degree of freedom to joining tables, so one can obtain one set of documents by searching for another set of documents and specifying how the former is joined to the latter. This means that one doesn't have to fully denormalise one's data, so that a change to one’s record doesn't require updating all Solr documents which refer to that record. Results can be grouped based on the values in various fields. For instance, poem lines can be searched, but then grouped, so that the results show one line per poem.

Solr provides faceting of search results. For example, users can search the full text of poems and then further refine their results by choosing to filter on various facets, such as the floruit period of the poet, or where the poet lived. Facet data—for each faceted field a list of values and counts for each value—appears in Solr's search result document, so can be transformed into form controls or links by the XSLT which generates the search results page.

Another nice feature which can help in putting together a page of search results is Solr's ability to highlight where it has matched one's search terms. Solr comes with its own small Java web application server. I have it just listening to localhost, so there doesn't have to be an extra layer of security. Query strings are built up from form data from inside PHP and the object oriented Solr extension can easily configure things like facets, grouping and highlighting.

The Solr schema document is where one defines the type of each field. Fields can be stored but not indexed, so that they appear in search results, e.g. HTML can be pre-generated avoiding the need to store the data needed for generating it every time. Conversely, fields can be indexed but not stored, i.e. if one wants to search them but don't need them to be present in the search results.

One can define one's own field types. For instance, I created a custom filter (in Java) to remove vocalisations (vowels) from Arabic text. The source documents were all vocalised, but people often want to search without having to type in the vowels. Or text which may contain diacritics can be ASCII-folded to remove these. For each field type, one can specify the tokenisers and filters to be used, for both the indexing stage and the searching stage. Stemming and stop words are available for many languages.

The schema can specify that certain fields should be duplicated. A common use of this is to copy many fields into a general 'text' field which can then be used in a simple search, while advanced searches can be done on specific fields. Or fields can be named dynamically (with wildcards), so, for example, all fields ending in ‘_ar’ can be treated as containing Arabic text.

There are other pure XML databases—I moved to Solr from an early version of Oracle’s DB XML—but I don’t think they are as fast or flexible as Solr. Similarly, for document-centric data, Solr is a much better fit than a relational database. With it I can work with a pure XML pipeline (XForms, XSLT etc.) and my original data is stored in plain text documents (think digital preservation!). Without Solr, I would have to store my original data in binary form or spend much more time converting XML documents to relational tabular data.