Categories are Fundamental for Search

3 min readNov 18, 2024

As a search consultant, I have learned to be flexible about structured data. However, I do insist on content being classified into categories.

What are categories? Here are a some dictionary definitions:

a class or division of people or things regarded as having particular shared characteristics
one of a possibly exhaustive set of classes among which all things might be distributed
a grouping or classification used to organize items, concepts, or entities that share common characteristics or properties

Substitutability

These are all reasonable definitions. However, I prefer to frame categories in terms of similarity or substitutability. In this framing, the goal to bring together documents (or products) that are substitutable for one another.

For example, in an e-commerce setting, content similarity can be defined as the probability that a buyer wanting to buy product X would buy product Y if X were unavailable, or as the fraction of utility that a buyer wanting to buy X would obtain from buying Y instead.

Objectives

This notion of similarity or substitutability applies at the level of a pair of documents. Categories build on document similarity to produce clusters. The quality of the clustering depends on three objectives:

Coverage. Every document should be associated with a category.
Coherence. All documents associated with a given category should be substitutable for one another.
Distinctiveness. Documents associated with different categories should not be substitutable to one another.

Coverage is straightforward. Coherence and distinctiveness build on the document similarity metric. It is also helpful for the definition of each category to be clear to searchers but difficult to quantify this objective.

Content and Query Classification

If documents are not already associated with categories, then it is important to obtain or establish a set of categories and perform content classification, whether by applying rule-based heuristics or by training a machine-learning model.

Once the content is classified, it is easy to use query logs to train a query classifier that maps queries to categories.

Benefits

Why are categories so important for a search application? In short, they are key considerations for the 3 Rs of search: relevance, recall, and ranking.

Relevance. Relevance is the prime directive of search: the guiding principle for a search engine is to return results that satisfy the searcher’s information need. Given a robust categorization, matching the result category against the query category as establishing the most significant bit of the score used to compute relevance.
Recall. Recall measures the fraction of relevant results that are retrieved. If relevance is about search returning “nothing but the truth”, then recall is about returning “the whole truth”. Given a robust categorization, a natural strategy to improve recall is to retrieve candidate results from the same category as the query.
Ranking. Relevance is necessary but not sufficient to optimize the search experience. Ranking orders relevant results based on their desirability, using factors like popularity, recency, or price. Ranking can also be personalized, reflecting user-specific preferences. Using category matching helps ensure that ranking separates relevance from desirability and other query-independent factors.

Summary

In short, categories are fundamental for search applications. Prioritizing investments in content understanding, query understanding, retrieval and ranking can be challenging. However, robust categorization is a critical foundation upon which everything else depends.

Categories are Fundamental for Search

Substitutability

Objectives

Content and Query Classification

Benefits

Summary

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Daniel Tunkelang

Responses (2)