Calliope Sounds: Lucene, shadow query classes, and the visitor pattern

A good Lucene result is achieved from an index and a query working together.

Defining the schema for the index is mostly an upfront design task: What are the fields? What fields are stored and what are indexed? How are terms in fields parsed? Are found terms supplemented? Are multiple fields combined? Etc.

Once you have made these decisions revisiting them can be prohibitive without planning. Ie, reindexing is found to be too expensive in time or resources, or, worse, the source documents are no longer available. Most Lucene uses error on the side of having a cautious schema where a lot more is stored and indexed than is needed now with the hope of a successful schema refactoring in the future. My experience has been that you don't know enough now about your data, how to index it, how to query it, and how users want to access it to have a viable cautious schema. It is better to store all your sources so you can reindex when you know better how the indexes and queries can cooperate. This is just a cost to using a new technology (and, as always, it can be mitigated).

The upshot in the near term is that the index schema is static. You have to apply flexibility with the query.

As mentioned in a previous posting, there is a tendency to think of Lucene queries like SQL queries. That is, there is a single, correct rendition. Discard that thinking. There are no correct results; there are only better results. To achieve better results you need to watch what your users searches and workout how the queries need to adjust. For example, perhaps you discover that there is a shift in vocabulary happening. What was once "OS X" is now "macOS". When a user queries for OS X you need to also include macOS in the query.

The Lucene API contains a number of query subtypes. These are combined to construct an expression that characterizes the user's search intent. This nested data structure should be considered a starting point. This data structure will be augmented and reformed to reflect your current understanding of how to best use the indexes. In the example above, you want to include macOS when OS X is used.
The Lucene API query subclasses are too rigid for direct augmentation and reforming. In the past the API was downright unbending and so I developed a set of shadow query classes that were amenable for use with the Visitor pattern. For example, this shows two visitors, the first adds the macOS variant and the second converts the query to a Solr expression:

Query q = ... new TermQuery(10.0f, "f", "osx") ...
...
Map<String, List<String>> variants = new HashMap<>();
variants.put("a", Arrays.asList("osx, "macos"));
VariantsQueryVistor vistor = new VariantsQueryVistor(0.0001f, variants);
...
q = vistor.visitQuery(q);
...
s = new SolrLuceneQueryVistor().visitQuery(q).toString();

and s is

( "osx" OR "macos" ^ 0.0001 ) ^ 10.0