Broad and Ambiguous Search Queries

Recognizing When Search Results Need Diversification

Can a search engine automatically determine when a search query is broad or ambiguous intent? No approach is perfect, but here are some useful signals:

  • Number of results. Specific queries tend to have small result sets. Conversely, broad and ambiguous queries tend to have large result sets. But a large result set may simply reflect an aggressive matching strategy. A more nuanced approach is to count the results with high relevance scores. If this number is high, then the query is probably broad or ambiguous.
  • Variance of results. A stronger signal than the size of the result set is its variance. This variance can be computed from pairwise result similarity (e.g., cosine distance using a word embedding model), or from a histogram that summarizes the result set (e.g., the entropy of the category distribution). A high variance indicates a broad or ambiguous query.
  • Distinctiveness of results. Another signal is the distinctiveness of the results relative to those of the overall document collection, typically measured using Kullback–Leibler divergence. For a deeper dive into this and related approaches, I recommend Claudia Hauff’s dissertation on “Predicting the Effectiveness of Queries and Retrieval Systems”.
  • Query analysis. Short search queries tend to be broad, and they are also more likely to be ambiguous. Processing the query with a part-of-speech or entity recognition tagger can yield a more precise analysis. Hauff discusses these kinds of strategies in her section on pre-retrieval predictors. A more modern approach would take advantage of word embeddings, e.g., comparing the query with a collection of queries of known specificity.
  • Historical searcher behavior. For frequent queries, the search engine can learn from historical searcher behavior. Specific queries tend to have high click-through rate, and the clicks tend to be top-ranked results. In contrast, broad and ambiguous queries have lower click-through rates and fewer clicks from top-ranked results. Broad and ambiguous queries also have higher rates of pagination, query refinement, and query reformulation. Finally, it’s possible to use labeled queries to train a machine learning model that recognizes broad and ambiguous queries — though any approach based on historical searcher behavior is vulnerable to presentation bias.

Broad Queries vs. Ambiguous Queries

All of these signals are ways to identify broad and ambiguous search queries. But these two classes of queries have important differences.

  • Modality of distribution. The results for a broad query center around a single mode that represents the “average” result. In contrast, an ambiguous query returns a mixture of results with two or more modes. There are various statistical tests to measure the modality of a distribution.
  • Top-level vs. lower-level category variance. A broad query generally has results within a single top-level category, e.g., shirts are all in clothing. The results vary within the children of that top-level category. In contrast, results for an ambiguous query split among multiple top-level categories, e.g., mixers are split among kitchen appliances, audio equipment, etc.
  • Entity recognition. Entity recognition for an unambiguous query typically yields a single sequence of tags with a high confidence score. In contrast, the lack of a single dominant tag sequence indicates an ambiguous query.
  • Historical searcher behavior. If the query is frequent, then it’s possible to apply previously cited statistical tests for the modality of distribution to the results that searchers have historically engaged with.

Search User Interface Implications

All of the discussion so far has been about recognizing broad and ambiguous queries. But what should a search engine do differently if it does recognize such a query?

Summary

Many search queries only require the traditional approach of ranking a set of matching results. But some queries require a more complex approach, either because they are broad or ambiguous. It’s important for a search engine to detect such queries, as well as to distinguish broad queries from ambiguous ones. Fortunately, there are a variety of signals that search engines can use to do so. Doing so allows the search engine to help the searcher disambiguate or refine the query as appropriate.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store