"The web, man, it's like you can just dump stuff on it."  
(overheard spam comment at WSDM'08)  


Clay Shirky defines social systems as "stuff that gets spammed." Eventually, any sufficiently interesting method of communication or repository of user-contributed data will end up being the target of spammers.

My advisor has been working on fighting spam for a long time, including the TrustRank and Spam Mass methods developed with Zoltán Gyöngyi. I also have had an interest in spam (particularly, e-mail spam) for some period of time, and wrote a little in my undergraduate honor's thesis about my own pet method to fight it (with Prof. Alexander Hartemink).

With this background, it only made sense to see if we could combat spam within tagging systems, and within social websites in general.

Fighting Spam on Social Websites

E-mail spam has existed for at least 15 years now, and arguably over 25. This seems to be due to the combination of an overly naive protocol (SMTP) and the profitability of exceedingly low response rates.

The one benefit of this long history is that a plethora of responses to e-mail spam have been proposed. In fact, it's gotten to the point where new proposals are even met with form responses, like this (mirrored here).

There are really a lot of things people have proposed. For example:

Wikipedia has a fairly complete list here.

While there has been a great deal of work figuring out how to fight e-mail spam, there has been relatively little thinking about which solutions are general to all social systems, and which solutions are not.

My main contribution was to try to make a taxonomy of spam fighting solutions, in particular, as they related to social web sites. In the magazine paper below (from IEEE Internet Computing) I lay out the following taxonomy:

  • Detection-based methods: In the first phase, likely spam is identified. Either users can manually identify spam, or pattern-based classification can be used. In the second phase, the system interfaces take into account the likely spam. They may either delete spam and then compute their results, or they may display particular results as likely spam.
  • Prevention-based methods: Prevention methods try to make contribution of spam content more difficult. This can either be by limiting user contribution, or changing interfaces to make automated access more difficult.
  • Demotion-based methods: Demotion methods try to provide better orderings of results for interfaces that produce lists of results.
The magazine article goes in to much more detail.

In the case of e-mail spam, it has only really been possible to implement detection-based methods, because of the nature of SMTP. In the case of web search, it has only really been common to implement demotion-based methods.

The really exciting thing about social websites is that the system owner now can use a wide variety of spam fighting tools that have been developed previously. For example, while it has been really difficult to get users to use proof-of-work for e-mail systems, the owner of a social website could instantly implement such a feature (for example, using WP Hashcash).

There is a lot of future work to do here. While it is now possible to take a more general approach to the spam problem, the big question is what sorts of things users will tolerate. Will a user still go to a website with HashCash? With a CAPTCHA? With limits on posting? Only time will tell.

Modeling Spam in Tagging Systems

More directly related to tagging is the question of how to fight spam in tagging systems. In other related work, Georgia Koutrika and Frans Effendi in particular (as well as a few others in our lab) have been looking at this problem.

However, one challenge is that the top tagging sites tend to be pretty proactive about fighting tag spam. Furthermore, what tag spam there is tends to be pretty obvious for one reason or another. Unlike the web or e-mail where many iterations between spammers and anti-spammers have taken place, tagging is relatively new.

As a result, it seems to make more sense at this point to simulate tag spam and look at the result of various tag anti-spam measures.

The model in the papers below has a few issues with it (for example, users contribute a uniform number of tags rather than a power law distributed number of tags). However, it leads to interesting results. For example, it leads to a rejection of limit-based techniques because limits impede both good users and bad, and the more good data we have the better we can do at detecting and demoting spam.


Title:Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges
Authors:Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina
Type:Magazine Article
Keywords:Social Web Sites, Spam, Tagging
Accessible: (info) (pdf)
Description: This is a magazine article in IEEE Internet Computing. The main contribution is a taxonomy of spam fighting approaches, and a discussion of their applicability to social websites. Tagging is discussed as a potential use case, and we also give a summary of the previous tag spam work from below.
Title:Combating Spam in Tagging Systems
Authors:Georgia Koutrika, Frans Effendi, Zoltan Gyongyi, Paul Heymann, Hector Garcia-Molina
Type:Workshop Paper
Keywords:Tagging, Spam
Accessible: (info)
Description: This paper presents a model for spam in tagging systems. The model is realistic in some respects and less realistic in others, but it nonetheless leads to some interesting insights. For example, limiting tag contributions by all users is unlikely to help in confronting spam from malicious users.
Title:Combating Spam in Tagging Systems: An Evaluation
Authors:Georgia Koutrika, Frans Effendi, Zoltan Gyongyi, Paul Heymann, Hector Garcia-Molina
Type:Technical Report
Keywords:Tagging, Spam
Accessible: (info) (pdf)
Description: This is a technical report version of the workshop paper above.
Title:Combating Spam in Tagging Systems
Authors:Georgia Koutrika, Frans Effendi, Zoltan Gyongyi, Paul Heymann, Hector Garcia-Molina
Type:Technical Report
Keywords:Tagging, Spam
Accessible: (info) (pdf)
Description: This is a technical report version of a journal paper to appear in ACM Transactions on the Web that expands on the workshop paper, above.