Measuring Search Effectiveness
Delivering effective search starts with defining how to measure effectiveness.
The information retrieval community has been evaluating search for decades. Established measures include precision (fraction of returned results that are relevant) and recall (fraction of relevant results retrieved), as well as graded and position-biased variations, such as discounted cumulative gain (DCG).
These evaluation measures should be part of every search practitioner’s toolkit, but they are not sufficient. Search is not an end in itself: it is a tool that enables an information-seeking process. Search effectiveness measures also need to consider how well search supports that process.
The post explores a few ways that evaluating search effectiveness can to look beyond traditional information retrieval evaluation measures to consider sessions, users, and overall impact.
Queries vs. Sessions
One challenge with traditional evaluation measures is that they focus on individual search queries. Users interact with search engines through sessions that often comprise multiple search queries. As a result, evaluating search sessions tends to be more meaningful than evaluating search queries.
Unfortunately, the academic work on session evaluation is less mature than the work on query evaluation. There have been efforts to generalize measures like DCG to sessions, but modeling sessions is tricky. The different queries in the session may not express identical intent. Moreover, queries within a session are not independent of one another, since searchers adapt their behavior from query to query. Current research suggests evaluating sessions based on the last and the worst-performing queries in the session.
Another aspect of session evaluation is that modern search engines do more that return ranked list of results. A modern search experience includes contextually appropriate navigation options, such as category and facet refinements. A search engine should recognize where searchers are in their search journey and adapt the search experience accordingly, guiding the searchers towards queries that better express the searchers’ intent.
In general, session evaluation is an opportunity to consider the return that the search engine provides for the user’s investment of effort. Clicks and conversions (e.g., purchases) tend to be good measures of return, while queries, keystrokes, and time spent on the search results page tend to be good measures of investment. By modeling the return and investment in terms of these or similar factors, it is possible to estimate session ROI for the user.
Sessions vs. Users
Evaluating sessions provides a more holistic perspective than only evaluating queries, but sometimes a search journey spans multiple sessions. A shopper might spend days or weeks before deciding on an expensive purchase. Or a researcher might return to a site several times to access content, often using the search engine to re-find previously discovered results.
A popular approach for evaluating multiple-session journeys is to attribute some search engagement to previous sessions and adjust session ROI measures accordingly. For example, a purchase on an ecommerce website can be attributed to a search in a previous session that discovered the product. While no attribution model is perfect, extending attribution across sessions mitigate the tendency for with-session models to favor low-consideration decisions over high-consideration decisions. Of course, this approach requires the ability to track users across sessions.
A more subtle consideration is that user behavior evolves across sessions as users learn from interacting with a search engine. In particular, searchers tend to return to a search engine more often when they have positive experiences, which means that session evaluation exhibits survivorship bias. Similarly, searchers learn from experience to do more of certain kinds of searches and less of others, so their behavior no longer represents an unbiased expression of their intent.
Detecting and correcting for this survivorship bias is difficult, since the search engine doesn’t know what it doesn’t know. It’s possible, however, to focus on new users who have not yet had a chance to learn from interacting with the search engine. The behavior of these users — and the changes in that behavior — provides strong signals of what users expect from the search engine, and of which of those expectations the search engine is succeeding to address.
Components vs. Overall Impact
A search engine consists of many interconnected components. These components operate at different levels of the search stack, determining how content is indexed, retrieved, ranked, organized, and presented. Some of the most important components contribute to query understanding.
It’s important to evaluate each component independently. For example, if indexing includes a document classifier that assigns a category to each document, then this classifier should be evaluated for its precision and recall. There should be similar evaluation for other components, such as those performing query classification, query expansion, and spelling correction, as well as the various signals used for retrieval and ranking. Evaluating components independently is essential to iteratively improving them.
At the same time, it’s important to measure how changes to a component affect the overall search experience. While an improvement in query classification may be impressive when it is evaluated in isolation, it may have minimal overall impact because it is redundant to other parts of the search stack, such as ranking. The total is often less than the sum of its parts.
It’s useful to perform a sensitivity analysis to determine how modifying a component affects the overall experience. It is also helpful to segment this analysis based on different kinds of queries or users. For example, changes that increase recall tend to have more of an impact on queries that return few results, while changes that increase precisions tend to have more of an impact on queries that return many results. Other changes may disproportionate impact a subset of user segments, e.g., experienced users vs. new users.
It’s still important to evaluate individual components — otherwise, it’s very hard to measure incremental improvements to them. But these improvements should ultimately contribute positively to the overall search experience.
Summary: Make Evaluation Effective
George Box famously said that “All models are wrong, but some are useful.” This aphorism surely applies to evaluating search effectiveness. No evaluation methodology is perfect, but evaluation is nonetheless useful and necessary.
Search evaluation starts with traditional information retrieval measures, but it shouldn’t stop there. It should consider not only queries, but sessions and users. It should encompass both individual components and overall impact.
Most importantly, evaluation should be at the heart of the development process. Lord Kelvin said that you can only improve what you measure. But measurements are only meaningful if you use them to drive improvement.
So go forth, evaluate, and make search more effective!