This paper, which I co-authored with Joyce Wang and Vladimir Zelevinsky when we were all at Endeca, originally appeared in the proceedings of the 2008 Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2008). It’s a bit dated, but I’ve found myself referring to it frequently enough that I felt it deserved a bit more visibility.
Folksonomies improve search and navigation of documents by allowing users to collaboratively tag documents. Unfortunately, the number of tags can be overwhelming to users who are seeking information, even when the tags are restricted to those that occur in the search results. In this paper, we describe a novel approach for highlighting tags of interest for users, based on the premise that tags can be useful because they either summarize or refine the current set of results. We also present a treemap interface that visually communicates both kinds of tags to users. Finally, we present the results of a user study designed to test the validity of our approach.
Folksonomies  are an increasingly popular way to enrich content and thus provide people with more effective ways to find information. In a folksonomy, a broad collection of people collaboratively tag documents. Folksonomies are also known as user-generated taxonomies.
One of the challenges in using tags to navigate a folksonomy is that the large number of tags quickly becomes overwhelming. In order to narrow the space of tags, we would like to highlight specific tags in order to help users both understand the data and find the tags that slice the data in interesting ways.
Measuring the Utility of Tags
We measure the utility of tags along two dimensions: how well a tag summarizes the information in a set of documents, and how well a tag refines that set into a useful subset. We consider two factors to inform a tag’s inclusion in either of these sets: frequency with respect to the given set, and the distinctiveness of the subset of documents assigned that tag.
In a perfectly tagged collection, a tag would represent a perfect summary of a given set of documents if it were assigned to all of the documents in that set. Although folksonomies are not perfectly tagged, we hypothesize that a tag’s effectiveness at summarizing a given set of documents is positively correlated to its frequency within the set.
It is harder to relate frequency to the utility of a tag as a refinement. What is clear is that the frequency should neither be too low, thus representing an insufficient fraction of the results, nor too high, thus not significantly narrowing from the given set.
Given a collection of tagged documents, we compute the distinctiveness of a given set of documents relative to a baseline set by comparing the distribution of tags in the given set to that of the baseline. Specifically, we take a normalized Kullback-Leibler divergence (aka relative entropy, information gain). This normalization, which we accomplish by taking random subsets of the given set, is necessary to avoid confounding distinctiveness with set size, since smaller sets tend to have higher Kullback-Leibler divergence. This distinctiveness measure is inspired by Cronen-Townsend and Croft’s “query clarity” measure .
As a short-hand, we refer to distinctiveness of a tag in a given set of documents as the distinctiveness of the subset of the given set that is assigned that tag, relative to the given set.
We now hypothesize that a tag with low distinctiveness will be useful for summarizing a given set. In particular, we conjecture that good summarization tag will have lower distinctiveness than good refinement tags.
In order to simultaneously communicate the frequency and distinctiveness of tags, we implemented a tree map visualization. The tree map, a space-filling visualization technique developed by Ben Shneiderman, allows the visualization of two simultaneous attributes of a set of objects through the visual dimensions of cell size and color .
In our tree maps, the size of a cell corresponds to the frequency of the tag associated with that cell, while color corresponds to the position of the tag on the distinctiveness spectrum (darker being more distinctive and lighter being less distinctive).
Restating our earlier hypotheses in terms of the tree map, we expect that good summarization tags will correspond to large light-colored cells, while good refinement tags will correspond to medium-sized darker-colored cells.
We conducted a user study to empirically validate our hypotheses about frequency and distinctiveness determining the utility of tags for summarization and refinement. Specifically, the test was designed to explore whether subjective user judgments confirm those hypotheses. The user study also tested the effect of presenting users with the tree map visualization described above.
For our study, we used a subset of the ACM Digital Library which includes only author tagged documents. This data collection comprises over a quarter million articles, consisting of articles from ACM journals, conference proceedings, and newsletters .
In order to tag the corpus, we distilled a controlled vocabulary from the author tags assigned to the documents, keeping those with sufficient corpus frequency (assigned to at least 10 documents) and positive Residual IDF (RIDF) scores in accordance with a technique inspired by Church and Gale . We then assigned tags to documents that contained the text of those tags (allowing for stemming) with sufficiently high TF-IDF scores. We note that this test set simulates a folksonomy by bootstrapping on a collective vocabulary, a technique we have applied in related work .
For each of 20 sets of ACM articles corresponding to search queries, we presented the user with two tasks: selecting the tags that best described the entire set, and selecting the tags that best described some of the articles (i.e., served as good refinements).
In the first task, we asked users to identify these two kinds of tags based on article titles and their author-selected keywords. In the second task, we asked users the same question, but instead showed them the search term that generated the set of articles and the tree map visualization described above.
To avoid ordering biases, we shuffled the displayed documents, and presented the list of possible tags in alphabetical order. Since we could not display all of the available tags without overwhelming users, we showed those tags that occurred in at least 3.5% of the documents in the set. In the first task, we further limited the number of tags to 20 if needed (the 20 most frequent) in order to avoid presenting the user with too much information. There was no such limitation on the number of tags in the second task, where we presented the user with the tree map visualization.
We also gave the user the option of displaying more documents from the given set (effectively paging through the shuffled ordering), as well as the option of viewing the abstract of a specific document, rather than just its title (Figure 2).
We note that there were no “right answers” for the test queries, since users were making their own judgments regarding how well tags summarized or refined the sets of documents. Rather, we were using their subjective judgments as ground truth.
We now formalize the hypotheses our user study aimed to validate regarding relationships between tag frequency, tag distinctiveness, utility for summarization, and utility for refinement:
- Good summarization tags have high frequency.
- Good summarization tags have low distinctiveness.
- Good summarization tags have lower distinctiveness than good refinement tags.
- Users’ accuracy and efficiency in identify the tags with the highest utility for summarization and refinement will increase when presented with a tree map visualization of frequency and distinctiveness.
We had 36 total participants in the user study, all with at least a bachelor’s degree in computer science or comparable background. 24 of the participants completed the roughly one-hour user study.
For each set of articles, each user response consists of an unordered set of tags that the user found most suitable to 1) describe the entire set (“summarize”), and 2) describe some of the articles in the set (“refine”). Aggregating these responses gave us the number of times a particular tag was chosen for the set. Each of these tags has a frequency and a distinctiveness score associated with it.
To analyze the results of our user study, we took the averages of the frequency and distinctiveness scores in the user responses for the first task. We used as our baseline the average frequency and distinctiveness scores for all tags displayed to the user in a given set. Table 1 show example scores for three of the 20 test queries.
One-tailed t-tests show statistically significant results at the 0.05 level for the following hypotheses:
- Frequency of user-selected summarization tags > baseline frequency.
- Distinctiveness of user-selected summarization tags < baseline distinctiveness.
- Distinctiveness of user-selected summarization tags < refinement distinctiveness.
These tests support our first three hypotheses; that is, good summarization tags have high frequency and low distinctiveness, and in particular lower distinctiveness than good refinement tags.
Unfortunately, we were not able to establish useful criteria to distinguish between good refinement tags and the baseline based on frequency and distinctiveness, other than their not being good summarization tags. We did find that refinement frequency was higher than baseline frequency (statistically significant at the 0.05 level), but all we can infer from this result is the obvious fact that good refinement tags should not be too infrequent.
Finally, we were not able to draw quantitative conclusions from our second task to validate our fourth hypothesis. As we realized from post-study discussions with our participants, it was impossible to present the visualization without those participants trying to reverse engineer what it meant.
Our user study validated our basic hypotheses regarding relationships between tag frequency, tag distinctiveness, utility for summarization, and utility for refinement. We hope to follow up this experiment with a larger-scale study that uses ground truth data (e.g., from trained assessors) to establish summarization and refinement utility.
 Vanderwal, T. (2005). Off the Top: Folksonomy Entries. http://www.vanderwal.net/random/category.php?cat=153
 Cronen-Townsend, S. and Croft, W.B. 2002 Quantifying query ambiguity. Proceedings of the Second International Conference on Human Language Technology Research (March 2002), 104–109.
 Shneiderman, B. (1991). Tree visualization with treemaps: a 2-d space-filling approach. ACM Transactions of Graphics, vol 11, 1 (January 1992), 92–99.
 ACM Portal: http://portal.acm.org/
 Church, K. and Gale, W. (1995). Inverse Document Frequency (IDF): A Measure of Deviation from Poisson. In Proceedings of the Third Workshop on Very Large Corpora, 121–130.
 Zelevinsky, V., Wang, J., and Tunkelang, D. (2008). Supporting Exploratory Search for the ACM Digital Library. Second Workshop on Human Computer Information Retrieval (HCIR ’08).