Content Annotation

Daniel Tunkelang
Content Understanding
4 min readMar 17, 2022

--

Content understanding requires both holistic and reductionist approaches, just like query understanding. The previous post focused on content classification as a holistic approach. This post explores content annotation as a reductionist approach.

While content classification assigns labels to the content as a whole, content annotation focuses on specific words or phrases within the content. These are also called “spans”, because each represents a span of words or tokens.

Entity Recognition

The most common form of content annotation is entity recognition. There’s no universal definition of an entity, but thankfully there doesn’t have to be. We can think of entities simply as members of controlled vocabulary. A controlled vocabulary can be associated with a particular entity type (e.g., company names, locations) or it can be a collection of untyped entities (e.g., technical terms).

As with content classification, content annotation in general — and entity recognition in particular — can be based on either rules or machine learning.

Rules-Based Approaches

Rules-based approaches generally involve matching strings (e.g., from a table or dictionary) or regular expressions.

Matching strings from a table, which is the simplest approach for entity recognition, can be surprisingly effective. For example, matching a table of United States cities of the form “San Francisco, CA” is works quite well. It won’t catch less common patterns like “San Francisco, California”— but those wouldn’t be hard to include. Matching city names without states (e.g., “San Francisco”) is a bit trickier, since many city names are ambiguous as strings (e.g., “Phoenix” could refer to the mythical animal), and there are over 40 United States cities named “Springfield”. Still, as with most things in search, you have to make trade-offs between precision, recall, and complexity.

But matching strings from a table won’t work for all content annotation tasks. For example, a table-based approach to recognize quantities of measurement, such as “128 GB” or “3 ft”, would only work for the more common quantity-measure pairs and is unlikely to include unusual ones like “3.14 ft”. A more general solution has to recognize a pattern that includes the quantity and the unit of measure. It’s possible to implement entity recognition for these cases using regular expressions, but you’ll almost certainly struggle with false positives, such as the “2 in” in the string “2 in 1” being interpreted as 2 inches, or the “6s” in “iPhone 6s” being interpreted as 6 seconds.

Regular expressions are useful for content annotation tasks beyond entity recognition. But it’s easy to get yourself lost in their complexity.

For example, if you are trying to detect US phone numbers within your content, you might try a simple expression like this:

[0–9]{3}-[0–9]{3}-[0–9]{4}

That would catch a number like “800–555–1212” — but not “(800) 555–1212” or other variations, let alone phone numbers with country codes.

Tuning a regular expression to catch every conceivable variant of a US phone number is a thankless task, but you might end up with something like this:

(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2–9]1[02–9]|[2–9][02–8]1|[2–9][02–8][02–9])\s*\)|([2–9]1[02–9]|[2–9][02–8]1|[2–9][02–8][02–9]))\s*(?:[.-]\s*)?)?([2–9]1[02–9]|[2–9][02–9]1|[2–9][02–9]{2})\s*(?:[.-]\s*)?([0–9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?

Again, you have to make trade-offs between precision, recall, and complexity.

Nonetheless, compared to using rules for classification, using rules for content annotation has a big advantage: annotation focuses on tokens or spans of tokens, rather than on the document as a whole. That generally makes it easier to craft reasonably accurate rules based on strings or regular expressions. Moreover, an error in an annotation probably won’t do as much damage as an error in the classification of a whole document.

Machine Learning Approaches

As with classification, you can use machine learning for content annotation. But machine-learned annotation is a bit trickier in practice than machine-learned classification — since it has to make decisions about every token.

Let’s consider a relatively simple case of machine-learned annotation: part-of-speech (POS) tagging. If it’s been a while since you studied grammar in school, here’s a quick refresher: every word in a sentence plays a grammatical role known as a part of speech, and the most common ones are nouns, verbs, and adjectives. Given a text document, a part-of-speech tagger annotates each token with a part of speech. For example, in the sentence “Cats eat raw fish.” the tokens would be tagged (cats:noun, eat: verb, raw: adjective, fish: noun).

Libraries like spaCy and NLTK provide free, open-source part-of-speech taggers for a wide variety of languages. They work well on clean, correctly capitalized, correctly punctuated, grammatical text that can be split up into sentences (which is generally the first step in the process). If your text isn’t so clean, a part-of-speech tagger is likely to struggle with it.

Other machine-learned content annotation models may be more or less robust. But, in general, applying machine learning to content annotation is harder than than applying it to content classification, because there are far more decisions (one per token). Content annotation also tends to be harder than query annotation — again, because of the amount of text. Annotation tends to work best for short fields, such as document or product names.

That said, there is a long history of using machine learning for entity recognition. Traditionally, people used hidden Markov models (HMM) and conditional random fields (CRF); but a more modern approach would use deep learning — specifically, a Bidirectional LSTM-CRF model. And the general excitement around sequence-to-sequence (seq2seq) models bodes well for the improvement of these methods.

Of course, if you’re going to use a machine learning approach, you’ll need to invest in generating labeled training data — unless you can get away with using a pre-trained model that sufficiently aligns with your content and target annotations. There are free, open-source tools like Prodigy that can help streamline the annotation work. But, unless you can find some way to obtain implicit judgments from user behavior, you’ll still have to pay for labor.

Summary

While classification is the most fundamental form of holistic content understanding, annotation complements it with a reductionist approach. As with classification, you can use simple rules-based approaches or more robust machine-learning approaches; but simple approaches are often good enough for many annotation problems. Still, this is an hot area for machine learning, so be sure to keep track of innovations in sequence learning methods.

Previous: Content Classification
Next:
Content Similarity

--

--