Julius Caesar started his famous text on the Gallic wars with the sentence “Gallia est omnis divisa in partes tres.”
Like Gaul, query understanding is, as a whole, divided into three parts:
- Holistic understanding of the query to ascertain its topic / category or establish the searcher’s high-level intent.
- Reductionist understanding to segment the query into components and determine what those components mean.
- Resolution to transform the results of holistic and reductionist understanding into the query that is executed against the search engine.
Let’s walk through this break-down, starting with Caesar’s “Gallia est omnis divisa in partes tres” as an example of a search query. We’ll assume that the search application provides access to a broad set of educational materials (textbooks, literature, etc).
Holistic understanding is the first step in query understanding. The goal of holistic understanding is to broadly — but not deeply — classify the query.
Here are some typical classes of holistic query understanding:
- Language identification. Our example query is in Latin (Classical Latin to be precise). For search applications that support multiple languages, language identification is a critical to enable further query processing.
- Query categorization, which generally maps the query into a category taxonomy. Our example query could be mapped to History or History of the Roman Empire, depending on the granularity of the categorization.
- Establishing the searcher’s high-level intent, which presumes a categorization of such intents. For example, if we know that searchers often look up quotations to find the original source, we might recognize the example query as an example of this intent. Other intents might include finding a biography of a person, a textbook about a subject, etc.
Implementing holistic query understanding requires building classifiers, either through rules (e.g., regular expressions) or by training a machine learned classification model. A machine learning approach is generally more accurate, flexible, and scalable.
To train a classifier, each holistic query understanding component (for language identification, query categorization, etc.) requires a collection of labeled training data that maps queries to their associated classes. The labels can come from explicit human judgements or from historical search behavior (e.g., mapping a query to a category based on clicks). Since search queries are strings — and often short string — it’s a good idea to use a character-level embedding like fastText to represent each query as a feature vector.
To summarize, holistic understanding looks at the query as a whole. It aims to be broad rather than deep, laying the foundation for later query processing.
The second step in query understanding is a reductionist understanding that breaks down the query into parts and tries to understand those parts.
Reductionist query understanding performs two related tasks: query segmentation and entity recognition. Query segmentation divides the search query into a sequence of semantic units, each of which consists of one or more tokens. Entity recognition classifies each segment into an entity type.
Our example query doesn’t yield an interesting segmentation, so let’s use this one instead (assuming the same search application): roman empire poetry.
This query should be segmented into two segments: roman empire and poetry. Then the first segment can be recognized as a subject area familiar from the previous example, while the second segment can be recognized as a genre. The segmented, recognized query is Subject: “roman empire”, Genre: “poetry”.
Like classification, segmentation and entity recognition generally depend on machine learned models, which in turn depend on labeled training data.
It’s possible to train a single model for this task, but a model that covers all queries is likely to be very complex — since entity types vary significantly across categories. Instead, we can take advantage of holistic query understanding coming before reductionist query understanding and build a collection of models for segmentation and entity recognition. Holistic understanding makes it possible to select the right model for reductionist understanding — one that corresponds to the right language, category, etc.
As with classification, labels for segmentation and entity recognition can come from explicit human judgements or from historical search behavior. But inferring segments and entity types from clicks is a bit trickier than for whole-query classification. In order to directly infer segmentation and entity recognition label from a query-document pair, each query token has to uniquely match one structured document field, and each multi-word segment has to correspond to a phrase in the matching field. It’s possible to relax this requirement, but doing so generally leads to a more complex approach.
Traditionally, people used hidden Markov models (HMM) and conditional random fields (CRF) for segmentation and entity recognition. A more modern approach uses deep learning — specifically, a Bidirectional LSTM-CRF model.
To summarize, reductionist understanding breaks down the query into parts and tries to understand those parts. It often relies on holistic understanding to select an appropriate machine learning model, and then applies that model to perform segmentation and entity recognition.
Together, holistic and reductionist query understand should yield a precise understanding of the searcher’s intent. The last step in query understanding is resolution. Resolution uses the results of the previous two steps to assemble a query for the back-end search engine.
Resolution has two parts. The first maps the recognized entities to query elements. The second is assembles these elements into a query.
The first part ideally maps each recognized entity maps to an entity in a structured data knowledge base, which is typically a taxonomy, an ontology, or a faceted classification. These representations and variants of them are sometimes called knowledge graphs. A modern search indexed documents by their structured data entities, assigning each structured data entity a unique identifier. In most cases, the combination of an entity type and a string should be enough to uniquely match an entity. In other cases, the matching may require building a classifier — but that’s beyond the scope of this post.
Returning to our roman empire poetry example, each of the two segments, Subject: “roman empire” and Genre: “poetry”, should map to an entity in the structured data knowledge base.
The second step assembles the entities — as well as any segments that couldn’t be recognized as entities — into a query that is executed against the search engine. This query may be a simple conjunction — that is, an AND of all of the entities and unmatched keywords. In our example, that would be an AND of the structured data entities corresponding to Subject: Roman Empire and Genre: Poetry.
For many queries, that’s all that’s necessary. But query assembly may be more complicated. It might expand or relax some of the entities to increase recall — which is especially useful for long queries that might otherwise return no or few results. Query assembly can also make decisions based on high-level intent, such as picking a ranking model or targeting a particular segment of the document collection. It can even determine other aspects of the search experience, such as which facets to present to the searcher.
In summary, resolution isn’t so much about understanding the query as about translating that understanding into a strategy for retrieving, ranking, and presenting results.
Rome ne fut pas faite toute en un jour
Like Rome, query understanding can’t be built in one day. Implementing holistic understanding, reductionist understanding, and resolution is a lot of work, and as a search team you can always find room to improve all of these. But if you’re not already looking at query understanding in this framework — or if you’re not looking at query understanding at all — I urge you to consider it. It won’t reduce the challenges, but it will help you tackle them in stages.