Let Bigrams be Bigrams

Daniel Tunkelang
3 min readMay 31, 2016

--

I’ve spent most of my career advocating for the use of query understanding to improve search relevance. It’s not that I don’t value search ranking, but I prefer to work as far upstream as possible. After all, ranking is pretty tough when you don’t understand what the searcher is looking for.

Query Segmentation

A critical part of query understanding is query segmentation — that is, splitting a multi-word query into a sequence of words and phrases that each represent atomic concepts. It’s important to treat consecutive words as a phrase when the searcher intends them as such.

For example, someone looking for a New York apartment wouldn’t be happy to see a new York apartment in the search results. In an age where people are debating whether to fear or welcome the coming singularity, we at least expect a search engine to interpret “new york” as New York! Conversely, a search for “york apartment” should return results in York, not New York.

Query segmentation also helps determine when word order matters. There’s no significant difference in intent between a search for “black wallet” and a search for “wallet black”. But there’s a big difference between Central Park and the Park Central. Breaking up multi-word concepts destroys their meaning.

Indexing Bigrams

Not surprisingly, most search engines make some attempt to account for bigrams. The scoring function used for ranking typically favors results containing all (or at least some) of the query terms as a contiguous phrase.

In order to do so efficiently, the search engine indexes common phrases, which are known as n-grams. Since n-grams can significantly increase the size of the index, a common strategy is to only index two-word phrases, also called bigrams. The additional storage for common bigrams is a small price to pay in order to dramatically reduce the cost of retrieving bigram matches.

How do you choose which bigrams to index? A typical approach is to look for bigrams with at least moderate frequency and high pointwise mutual information. If you have a substantial query log to mine, that’s better than having to rely solely on the corpus. For more information, search on Google for collocation detection.

Query Understanding

So search engines index bigrams and use them to improve ranking. But why wait for ranking? A search result that doesn’t respect a bigram shouldn’t just have a low rank; it shouldn’t be in the results at all. Search engines should try hard not to return results so irrelevant that they undermine searchers’ confidence and trust. We don’t expect search engines to be perfect, but that’s no excuse for them to make unforced errors.

Once it’s done the hard work of recognizing a bigram in a query, a search engine should only retrieve results that match the bigram. Given that bigram identification isn’t perfect, the engine should inform the user of how it interpreted the query and provide the user with an escape hatch to address the occasional error. If searches actually use the escape hatch, that’s useful feedback to improve bigram identification.

So, if you’re developing a search engine, let bigrams be bigrams. Your users will thank you.

--

--