Calliope Sounds: Lucene boosting

Lucene has a powerful query language. The same language is used by Solr and Elasticsearch. I have found that users often don't utilize it well as they mistakenly apply their SQL experiences to it. With SQL your queries return exact results. Nothing in the result set is irrelevant. SQL query performance rewards tight queries consisting of few terms, few indexes, and few joins. With Lucene yours queries return ranked results. Much in the result set is actually peripheral. In fact, unless you limit the result set Lucene will return all the documents. A Lucene query is about getting a good ranking of results rather than exact results. To this end your query and indexes need to work together to achieve this. In general this means you need to make good use of broadening and narrowing terms, and boosting matches.

For example, if your query simply looked for the term "mouse" and you indexed your documents verbatim you should not expect to find any "mice". (Recall that Lucene sees your words as numbers and so "mouse" might be #23 and "mice" might be #6078.) It is therefore better to search for

mouse mice

When searching for "mouse mice" your results will be ordered so that either term gives equal weight to the document's rank. This is unlikely the correct course. The query was for "mouse" and you broadened it to include "mice". Documents matching "mouse" should be ranked higher than documents matching "mice". In Lucene's query language you do this by boosting the weight of terms. Documents matching "mouse" should be boosted orders of magnitude higher than "mice", ie

mouse^1000 mice

You will often see small boost values in other peoples examples. My experience has been that small boosts do not adequately differentiate documents. Big boosts do.

You likely noticed that my query was for "mouse mice" and not "mouse or mice". With Lucene, as with SQL, a boolean "or" kills performance. By not using "or" in your Lucene query you are allowing it to rank higher documents that contain both terms over documents that contain only one of them. Since the higher documents do contain the wanted term, "mouse", I don't see a need to repress ranking them equally to documents containing only "mouse" (ie, no use of "mice"). The performance cost is usually not worth it, especially when your queriers will be more complex than this simple example.