Where Do Categories Come From?
In my previous post, I argued that categories are fundamental for search applications. I characterized a robust set of categories as offering coverage, coherence, and distinctiveness.
However, that post did not address the question of how to obtain categories in the first place. I do not mean how to perform content classification, but how to establish the categories themselves.
There are two strategies to obtain categories: top-down and bottom-up. Then there is the iterative process of debugging the categories.
Top-Down
A top-down strategy aims to identify a primary dimension of documents or products that most determines substitutability and then populates values for that dimension. In other words, it starts with the organizing principle of the category as a dimension and then works out the categories as values.
For example, most e-commerce sites use product type for categorization, rather than brand, color, or other dimensions that are far less predictive of substitutability. In general, a substitute a product will be of the same product type. In contrast, most libraries use topic for categorization, since a substitute for a book will most likely focus on the same topic.
Once the category dimension is established, the remaining work is to populate a list of values for it. These can come from dictionaries, subject-matter experts, or other search applications.
In short, a top-down strategy starts with identifying a primary dimension and then proceeds by populating its values.
Bottom-Up
In contrast, a bottom-up strategy starts by identifying the entities that best summarize documents and then organizes those entities into categories, with the goal of obtaining a collection of categories that comprise a primary dimension. This process, essentially the reverse of the top-down strategy, is similar to what user experience designers call card sorting.
Identifying the entities requires some form of entity extraction, whether from content or query logs. While it is possible to extract entities from content, queries tend to be a more robust source — since they represent searcher demand. The trickier part is selecting a subset of entities that comprise a consistent dimension. While it may be possible to do so automatically, the process is likely to involve an element of human input.
Debugging Categories
Regardless of the process to obtain an initial set of categories, debugging categories is an iterative process guided by the objectives of coverage, coherence, and distinctiveness.
- Coverage. If a meaningful fraction of documents are not assigned a category, then there is probably a need to either add a category or extend an existing one.
- Coherence. Documents assigned to a category should be similar to one another. One way to evaluate the coherence is to use the bag-of-documents model to obtain the category’s specificity: computing the mean of the vectors for documents assigned to the category and then computing the mean of the cosines between each document and their mean — essentially a variance based on cosine similarity. If a category has low coherence, it should probably be split.
- Distinctiveness. Documents are associated with different categories should not be substitutable to one another. The bag-of-documents model makes it possible to compute category similarity as the cosine between two category vectors, where each category vector is mean of the vectors for documents assigned to it. If two categories are very similar to each other, they should probably be merged.
It is also helpful for the definition of each category to be clear to searchers and represented in a short, simple name. Unfortunately, it is difficult to quantify this objective, let alone optimize for it algorithmically. However, categories derived from query logs are likely to be clear to searchers.
Summary
In summary, categories originate from either a top-down or bottom-up strategy. A top-down approach begins with an organizing principle to define categories, while a bottom-up approach identifies entities and organizes them into a coherent dimension. Regardless of the strategy, creating robust categories requires iterative debugging to ensure coverage, coherence, and distinctiveness. The goal is effective categorization that serves as a foundation for a successful search application.