Can Social Bookmarking
Search is probably the most important application on the web. Arguably, both Google and Yahoo! are currently sustained by selling sponsored search ads next to search results.
The history of web search has been a series of big jumps in search quality followed by relative lulls. For example, the introduction of link structure to determine authority (e.g., PageRank and HITS) around 1998 was one such big improvement.
This leads to the big question: what will be the next big thing that substantially improves search quality?
One of the big contenders for this "next big thing" is social search: the idea of adding user annotations or other metadata to the process. Of course, arguably "search" has been "social" ever since people began incorporating anchortext into search engines, or ever since users started contributing the first web pages.
However, usually what is meant by social search is something like incorporating user-generated tags or ratings into an index or ranking function.
As a result, in the paper below, we ask the question: "Can Social Bookmarking Improve Web Search?". By extension, we also ask, if so, in what ways, and if not, why not?
This page attempts to give an informal introduction and description for the results that we recently presented at the Web Search and Data Mining conference.
In the paper, we present eleven main results about the social bookmarking system del.icio.us. These results form the basis of any search functionality that might be built on top of social bookmarking.
The data gathering for our results is fairly involved, and is discussed in the paper. However, the summary is that our results are based on:
We divide the results into positive (+) and negative (-) for the impact social bookmarking might have on web search. We further subdivide the results into those that are about the URLs being bookmarked (URLs), and the tags that people apply (tags).
Pages posted to del.icio.us are often recently modified. The histogram below (click for larger version) shows how recently a page just posted to del.icio.us was modified.
For example, if a page (say, Slashdot) was just posted by a user (say, "Bob") and when it was posted, that page was changed one day ago, it would slightly increase the height of the bar above one day. Time is in log-scale.
What this graph shows is that the bulk of posts to del.icio.us are modified either about a day ago, or about ten days ago, at the time when they are posted.
Of course, information like this is useless without context. So the question is, what kinds of pages do users usually see, and how recently are those pages generally modified?
We compared to two things:
For ODP, we randomly sampled URLs from the data dump. We then searched for those URLs with the Yahoo! Search API, and recorded the ModificationDate attribute (see here and here for ModificationDate attribute information). Because HTTP/1.1's Last-Modified header turns out to be infrequently used and often incorrect, it's more sensible to use a search engine's information on recency. We determine the recency for del.icio.us URLs in the same way.
For Yahoo! Search results, we randomly sampled queries from the AOL query log dataset. We then submitted the queries to the Yahoo! Search API. Lastly, we looked at the ModificationDate attribute for either the top 1, top 10, or top 100 results returned for each query.
The end result of these two processes is a general picture of recency of web directories and search results that we can compare to del.icio.us (click for larger version):
Looking at these histograms side-by-side, three things are apparent.
The broad conclusion is: del.icio.us users post interesting pages that are actively updated or have been recently created. What this means is that search engines could probably change their crawl scheduling (see here) to produce better, more up-to-date results.
At any given time, a search engine can only crawl and index a portion of the web as a whole. The challenge is to determine which parts are worth crawling (as they are created) and to crawl those parts as quickly as possible. Approximately 25% of URLs posted to del.icio.us by users are new, unindexed pages which will later be indexed.
In order to get this result, we searched for URLs using the Yahoo! Search API as they were posted. In 37.5% of cases, URLs were missing when we searched for them. Over the next six months, we continuously searched for a sample of the missing URLs.
We assumed that if a URL was not returned by a search, it was currently unindexed. If at a later point the URL was returned by a search, it had been indexed, and was worth indexing. On the other hand, if a URL was never returned by a search, we don't know anything. It could be a useless URL, or it could be a URL which still was not found by crawlers six months after posting.
Our results were:
|URLs Found Initially||57.5%|
|URLs Indexed Within Four Weeks||12.75%|
|URLs Indexed After Four Weeks (Within 6 Months)||12.75%|
The broad conclusion is: del.icio.us can serve as a (small) data source for new web pages and to help crawl ordering.
One big question is how the URLs that people choose to post about relate to the sorts of URLs that are returned as search results.
For example, if a search engine has tagging or bookmark information for all of the top 10 results for a query, it might be able to change the ordering of those results based on the information. However, if most tagged or bookmarked URLs are less popular search results, then social bookmarking is less useful for search.
In order to answer this question, we:
What we found was that overlap between del.icio.us and search results was much higher than you might expect by (uniformly distributed) chance.
Based on the size of del.icio.us, one might expect one thousandth of the
search results to also have been posted to del.icio.us.
However, we found much greater frequency of URLs posted to del.icio.us in
|Top 10 Search Results in del.icio.us:||19%|
|Top 100 Search Results in del.icio.us:||9%|
The broad conclusion is: del.icio.us URLs are disproportionately common in search results compared to their coverage. This makes sense, especially in light of recent news that Yahoo! is integrating del.icio.us into their search results user interface.
A reasonable question is whether del.icio.us is similarly concentrated. If a significant number of users left del.icio.us, would it disappear? Would the content change substantially?
For a one month period, we looked at the proportion of posts contributed by the top n users (click for larger version):
For example, the top 100,000 users contribute a little over 80% of the content. The graph shows that there are a number of users who are highly active, core users. However, in order to cover over 50% of the content posted, one needs to include over 30,000 users (a far cry from Digg's 100).
The broad conclusion is: del.icio.us is not highly reliant on a relatively small group of users (e.g., <30,000 users).
Users of social bookmarking sites are probably acting largely out of self interest rather than in concert. This means that their actions are largely uncoordinated.
The figure below (click for larger version) shows the distribution of the number of times a URL just posted is previously in del.icio.us:
For example, about 20% of posts refer to URLs which are posted more than 50 times.
The interesting result here is that 30-40% of posts (once we account for filtering, discussed in the paper) are the first post of a URL to del.icio.us. This suggests both that del.icio.us has a large amount of information for popular URLs, and a broad collection of less popular URLs.
In order to avoid counting many URLs from the same site, we also looked at new domains. However, our numbers there are approximate. We found that approximately one in eight posts referred to URLs whose domains did not appear to have been in del.icio.us previously.
Our broad conclusion was that del.icio.us has relatively little redundancy in page information for perhaps 50% of URLs and high redundancy for perhaps 20%.
A search engine needs to do two things to return a list of results:
We looked at the overlap between the queries in the AOL query log dataset and tags annotating URLs in del.icio.us.
We found that there doesn't seem to be much correlation between the popularity of a tag and the popularity of a query term. However, there is relatively high overlap between popular query terms and popular tags.
For example, a term that occurs tens of thousands of times in queries but only occurs a few times as a tag would be represented by a dot in the lower right corner.
It might be possible that if you eliminated rare queries and rare tags that there might be a little bit of correlation. However, in general, the most popular queries seem to be heavily navigational, in contrast to the (arguably) categorical nature of tags.
For example, "google" is one of the most popular queries, while "design" and "web" are very popular tags.
The broad conclusion is: del.icio.us may be able to help with queries where tags overlap with query terms, and there is a reasonably high overlap (if not correlation) between tags and query terms.
One early complaint that many people have about tags is that users can just add whatever they want. If users can add whatever they want, they'll probably add a bunch of junk. By junk, we might mean spam, or we might mean how the user feels about the page, or just something that is irrelevant.
Not being able to determine whether a tag is junk in an automated way, we did a user study. We gave each user a sampling of postings, with some postings overlapping with other users to determine whether the users agreed with each other. There were ten users and between 100 and 150 posts per user.
We asked users whether each tag was subjective, irrelevant, and a few other questions. For example, "funny" is subjective. The tag "c++" might be irrelevant for a page about sports. We also asked if the users found any spam, and we did not find any (possibly due to filtering).
|% of Tags Users Found Irrelevant:||7%|
|% of Tags Users Found Subjective:||<5%|
Our broad conclusion was that tags are on the whole accurate. This suggests that tags are good data, even if they may be redundant given other information that search engines might have (see Results 10 and 11).
How fast are users adding bookmarks to del.icio.us? This is a relevant question both because it determines how big a phenomenon del.icio.us is, and how fast it is growing.
In around June of 2007, the answer was about 120,000 public posts per day. This number was fairly constant for the year.
We gathered our own data to determine this, though we also compared to some data gathered by Philipp Keller.
The result for our month long dataset from May to June is shown here. (It's too big to fit inline in this text, so click the "here" link and scroll along.)
The solid blue lines show our data, while the red-dashed lines show Philipp Keller's data for the same time period.
The posts per day go up during mid-day and go down at night. They also drop on weekends. The two large spikes are outages when del.icio.us went down.
While the posts per day appears quite variable in this graph, it actually ends up being quite stable when looking at months of data.
Of course, without any contrast, the 120,000 posts per day is not particularly meaningful. A reasonable contrast is blogs, which are also user-created and also have new and recent URLs and other information.
Sifry's State of the Live Web for April 2007 suggests that there are over 1.5 million blog posts per day. This suggests that social bookmarking may still be about an order of magnitude smaller in posts/day than blogging.
Our broad conclusion is that the number of posts per day is relatively small, perhaps one tenth of the blogosophere.
The web is a huge place. No one agrees on how huge, and search engines have stopped counting.
The question depends a lot on how one counts dynamic content. For example, is Digg one page? Hundreds of pages? Millions of pages?
A reasonable ballpark estimate of the number of "pages" that search engines are currently indexing is perhaps tens of billions of pages.
We do a bunch of math to approximate the current size of del.icio.us, based on historical assumptions and based on Philipp Keller's data. In the end, we conclude that there were roughly 115 million public posts, coinciding with about 30-50 million unique URLs in del.icio.us as of June 2007.
Tens of millions of unique URLs is certainly a lot! However, it would seem to be a relatively small portion of a web that has grown to tens of billions of pages and is still exploding.
In the context of such a young phenomenon as social bookmarking, perhaps a better question than "How big is it?" is "How fast is it growing?". However, this question is more uncertain.
We analyzed Philipp Keller's data, which tells two different stories.
From August 2005 to August 2006, the story is rapid, linear growth (click for larger version):
In the space of a year, the number of posts per day triples, from 30,000 posts per day to 90,000 posts per day. Much of this expansion seems to be due to Yahoo!, which purchased del.icio.us on December 9th (shown on the graph).
From November 2006 to July 2007, the story is relative stability (click for larger version):
Every day, around 120,000 posts are being posted. When we combine the two time periods, we end up with two totally different trends (click for larger version):
This suggests that the growth of social bookmarking probably depends much more on marketing and other external factors than on organic growth. In other words, we are not really comfortable predicting where social bookmarking will be months from now, or years from now.
Our broad conclusion is that the number of total URLs is relatively small; it is a small portion (perhaps one thousandth) of the web as a whole. Furthermore, the future growth of social bookmarking is difficult to predict given the information we have now.
Since tags became prevalent several years ago, people have wondered about what they represent. In particular, are tags only useful for sites like Flickr and YouTube where there is no other text available? Or are tags useful even when substantial additional text is available, which is often the case on the web?
With our crawl data, we were finally able to compare the tags to the text of the pages they annotate.
Perhaps the most striking result is that in about one sixth of cases, a tag annotating a page will occur in the title of that page.
For example, the del.icio.us page for CNN shows that the two most popular tags for CNN are "news" (11473 times) and "cnn" (2499 times) two tags which are present in the title ("CNN.com - Breaking News, U.S., World, Weather, Entertainment & Video News").
Next most interesting is that tags occur in the page text of about half the pages they annotate. For example, a page tagged "cows" will have the word "cows" somewhere in it.
Finally, if we look at the page and all of its surrounding pages (pages that link to that page and pages that are linked from that page), a tag will be in all of that text 80% of the time.
|Tag in Title||16%|
|Tag in Page Text||50%|
|Tag in Page Text + Surrounding Text||80%|
What this suggests is that in a substantial number of cases, tags are not really significantly different information than the page text already being indexed by search engines.
Furthermore, search engines already heavily emphasize titles when returning results.
However, tags may have the function of emphasizing some aspect of the text which might not have been emphasized by the page itself. So in that respect they may be useful.
For example, a page about "oil prices increasing" might have a small note about "inflation" at the bottom, but if a user tags that page with "inflation", it might imply that "inflation" is a more important concept for the page than is stated in the text.
Our broad conclusion is that a substantial proportion of tags are obvious in context, and many tagged pages would be discovered by a search engine. In terms of lessons for tagging systems in general, this may mean that while tagging for media sharing sites like Flickr and YouTube works well, tagging may be less informative for systems which already have full text.
On the web, sites tend to be organized into domains. This in turn means that domains tend to be topical.
Furthermore, when a site outgrows a single domain, it tends to create a number of subdomains. For example, Slashdot has a number of subdomains pertaining to different topics, like games and development.
This means that domains already contain some of the categorical or topical information that tags are supposed to be adding.
For instance, in one of our datasets, more than 10% of the URLs tagged "java"
are at one of five java-related domains:
|Host||Host % of Tag||Tag % of Host|
In the paper, we go into some depth predicting tags based on domains in one of our datasets.
We also asked our users in our user study whether tags applied only to the URL involved or whether they applied to the entire domain. Our users said that about one in five tags applied not just to the URL in question, but to the domain as a whole.
Both of these statements, both in terms of prediction and in terms of our user study suggest that a well financed search engine could pay "librarians" to tag entire domains rather than single URLs. Our broad conclusion was that paying such librarians would be more efficient, and might obviate the need for about 20% of the tags on URLs bookmarked.
However, there are some aspects of social bookmarking that mean that it could be really useful for web search, like the recency of the pages posted (Results 1 and 2) and the overlap of tags and bookmarks with queries and search results (Results 3 and 6).
|Title:||Can Social Bookmarking Improve Web Search?|
|Authors:||Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina|
|Keywords:||Social Bookmarking, Collaborative Tagging, Web Search|
(conference paper pdf)
(dbpubs info) (doi) (tech report pdf)
|Description:||This paper was presented at the First ACM International Conference on Web Search and Data Mining (WSDM'08). The main contribution is eleven experiments evaluating different aspects of social bookmarking and their impact on web search.|