"Too Many Jelly Beans?"

Tag Hierarchies

Background

Recently I have been interested in trying to create hierarchical taxonomies from flat tag data. Tagging systems like del.icio.us, Flickr, and CiteULike tend to have (relatively) flat tags. This means that while one can easily browse by a tag, like photography, one cannot as easily see tags which are more or less broad than that tag. It is also difficult to get a broad overview of what tags exist in these sorts of systems as a result, aside from frequency based displays like tag clouds.

Some commentators have suggested that ontology is overrated, even irrelevant. That there is no hierarchy in ideas, only links:

'Just Links' image courtesy of Clay Shirky.

This may be overstating the point a little bit. While often many hierarchies can be created for any given set of data, hierarchies are indisputably useful for a major type of information retrieval task: browsing. When we do not know exactly what we are looking for, it is much easier to be able to broaden and narrow our area of interest than to perform some sort of random walk from idea to idea. The top few categories of a traditional hierarchy give us a much better idea of the contents of a media collection than thousands of individual tags, even if these tags are ranked by their frequency in the collection.

Tagging systems are excellent at the task that they were designed for---allowing a large, disparate group of users to collaboratively label massive, dynamic information systems like the web, media collections of millions of images, and so on. We are working to make these systems better by automating production of hierarchical taxonomies that describe the data from the raw flat tags generated by users.

I've found some interesting features of tagging datasets from del.icio.us and CiteULike which have in turn suggested reasonably good ways to create hierarchies. An example hierarchy generated using some of these methods from del.icio.us is here: mgfgsm-hierarchy.

Update (2008-02-14)

This paper seems to have spawned a number of other interesting publications. See for example, Google Scholar. However, I am increasingly unsure of the right way for users to navigate tagging systems or for such systems to organize tags. For example, I think facets might be a really good fit, but then the question becomes: how do we determine groups of tags to call a facet? In any case, I think that the right solution to organization of tagging systems is likely to be fertile ground for research for some time to come.

Some interesting research in this area includes:

"Getting our head in the clouds: toward evaluation studies of tagclouds" by A. W. Rivadeneira, Daniel M. Gruen, Michael J. Muller, David R. Millen (DOI) (Scholar)
"Visualizing tags over time" by Micah Dubinko, Ravi Kumar, Joseph Magnani, Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins (DOI) (Scholar)
"Ontologies are us: A unified model of social networks and semantics" by Peter Mika (Scholar)
"Mining the Structure of Tag Spaces for User Modeling" by E Schwarzkopf, D Heckmann, D Dengler, A Kroner (Scholar)

Papers

Title:

Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems

Authors:

Paul Heymann and Hector Garcia-Molina

Type:

Preliminary Technical Report

Keywords:

Tags, Hierarchies, Tag Hierarchy, Ontology

Accessible:

(info) (ps) (pdf)

Description:

This paper describes a simple algorithm for constructing hierarchies in social tagging systems that usually works reasonably well. The main contribution is a notion of generality in social tagging systems based on centrality in a similarity graph.