A/B Testing for Search is Different
A/B testing is a critical tool for improving search products. While A/B testing has limitations and shouldn’t be the only analytical method you use, it’s the single most valuable and versatile tool for determining whether the impact of a change to your search product is positive.
Deciding to A/B test search is just the first step.
There’s no single way to perform A/B testing for search. You have to decide whether the test will compare click-through rate (CTR), mean reciprocal rank (MRR) of clicks, conversion rate, revenue, or some other search success metric. You also have to determine how long to run each test, considering not only the need to establish statistical significance but also the possibility of a novelty effect. A/B testing search isn’t just a switch that you flip on — it’s a science.
How you do A/B testing will affect which changes you decide to test.
If you use A/B testing to determine which search changes to launch, then your A/B testing approach will have significant implications for what kinds of changes you decide to develop, test, and ultimately launch.
In particular, the target you set for an A/B test determines how long the test has to run before you can evaluate its success with statistical significance. More aggressive targets require less time to test: for example, it’s much easier — on the scale of days vs. months — to test whether a test doubles conversion than whether it raises conversion by 1%.
However, it’s difficult to develop changes that meet aggressive targets: after all, if it were easy to double your conversion rate, you would have done so already! As a result, the highest ROI often comes from testing lots of small, incremental improvements that each target a small fraction of search queries. You have to kiss a lot of frogs to find one prince, so the best strategy is to kiss as many frogs as you can, as quickly as possible.
But we don’t want to run a test for months just to determine whether it’s successful. How do we use A/B testing to keep ourselves honest while still achieving the benefits of rapid, targeted, incremental improvement?
Solution: scope search A/B tests by the search queries they affect.
A/B tests for search randomly assign users to treatment groups and compare the performance of the groups. But A/B tests for search have an important nuance: not all search queries are affected by the test. As we just discussed, some of the highest-ROI work on improving search succeeds by targeting only a small fraction of search queries.
To make this nuance concrete, let’s consider an A/B test that targets 10% of search queries with the goal of achieving a 5% conversion lift for those queries. That would translate into an overall 0.5% conversion lift for the site. A 0.5% conversion lift may not sound like a lot, but for a major retailer that translates into millions of dollars a year.
As we discussed earlier, the size of the improvement target determines how long the test has to run before you can evaluate its success with statistical significance. In our example, establishing whether a test achieves a 5% conversion lift on 10% of queries takes far less time than establishing whether the test achieves a 0.5% conversion lift on 100% of queries — days as opposed to months. You can explore these numbers yourself using a nifty online A/B testing calculator.
Be careful: determining the scope of an A/B test can be subtle.
Given the above, it’s clear that we want to scope the analysis of an A/B test as narrowly as possible. But we have to be careful not to be so narrow as to invalidate the test.
In particular, you should scope a search A/B test in terms of search sessions rather than search queries. Otherwise, you’ll often find that the improvements on search queries affected by your test often come at the expense of performance on other queries in those same search sessions. You don’t want to rob Peter to pay Paul.
In general, you need to consider the ways that your test may affect queries outside the test’s intended scope. A/B tests can affect behavior within a search session and can even have long-term effects on searcher behavior. But don’t let theoretical objections paralyze you. Keep the scope of your A/B tests as narrow as possible, so you can deliver rapid, targeted improvements.
Summary
A/B testing is the single most valuable tool for using data to improve search. But how you do A/B testing affects the kinds of changes you decide to test and ultimately launch. More aggressive targets require less time to test, but it’s difficult to develop changes that meet aggressive targets. The solution is to test lots of incremental improvements that target particular kinds of search queries, and then to scope A/B tests as narrowly as possible. But be careful about scope, since it’s possible for a test to affect queries outside its intended scope — especially other queries within the same search sessions.
Keep calm, and carry on A/B testing search!