Abstract
Email Address Harvesting is a growing problem for both the common user and system administrators alike. The purpose of this research was to find ways of obfuscating email addresses in ways that they will not be harvested and then later spammed, as well as finding a correlation between the page rank of the website that the email addresses were posted on, and the amount of spam received. To begin research, addresses had to be distributed to websites and their page rank was recorded. We found that there seems to be correlation between the page rank of a website and the amount of spam received i.e. more spam was received from addresses that were harvested off of a well traveled page. We also found that certain masking techniques worked better in deterring spam than others.
Introduction
Unsolicited commercial e-mail, known colloquially as “spam”, is a growing problem for system administrators. Spam is used by criminals to sell illicit products or services, deliver malware, or to perform a denial-of-service attack on a system’s mail server. While most research on solving the spam problem has focused on filtering and other server-side techniques, a complementary approach to reducing spam is to target address harvesting behavior.
An understanding of harvesting behavior can enable new techniques for preventing harvesting. Making it more difficult for a spammer to obtain a list of target addresses can have a potentially drastic affect on his or her ability to profit from sending unwanted messages.
The conventional wisdom says that spammers obtain e-mail addresses using several different techniques, but that harvesting addresses from public web sites using “harvesting bots” or “harvesters” is the most common technique. Harvesting scripts comb through web sites, forums, and news groups looking for addresses to target. To be effective, they must separate address-like text from ordinary content. This is not a computationally hard task because e-mail addresses on the Internet are usually displayed in plain text and contain easily identified symbols such as the at sign.
A system administrator or can reduce the likelihood that an address on his site will be harvested by obfuscating the address instead of posting it in plain text. However, very little research has been done on the effectiveness of such camouflaging techniques.
There has been one major, ongoing study that tracks harvesting behaviour on web sites. Project Honey Pot is a “distributed network of decoy web pages website administrators can include on their sites in order to gather information about robots, crawlers, and spiders” [1]. They use a proprietary tagging system and crowd-sourcing to track harvesting bots. The project has identified over 90,000 bots, almost 80 million spam servers, and has received over 1 billion spam messages.
The data collected by the honey pot project has several limitations. First, addresses are not posted on existing web pages, but to special decoy pages that are added to a web site. Second, the results do not include data on the popularity of the harvested web sites.
In this paper, we present research on the effectiveness of five different techniques for camouflaging an address and correlate our results with the popularity of the web site on which the address was posted. We compare addresses both on newly created decoy pages posted on free-hosting sites with addresses posted on established, pre-existing web pages.
Methodology
As a measure of the popularity of a web site, we estimate its Google page rank. Page rank gives a rough approximation of how well trafficked a website is and how easy it is to find in a search engine. Since pages with a high page rank typically receive more traffic than pages with a low page rank, it is likely that addresses posted on pages with a high page rank will be more visible to harvesting bots and thus receive more spam than those on a website with a low page rank.
To track harvesting behavior, we created a program that generates “poisoned” e-mail addresses. A poisoned e-mail address is not associated with an actual user account but maps to an artificial account used to store spam messages for analysis. We embedded these addresses on well over a hundred different web sites. Many of these were decoy pages posted to free-hosting web sites with low page rank, but we also posted addresses to existing pages on well-trafficked sites with high page rank.
All of the e-mail addresses were carefully designed to look like those used by an ordinary user. Each address consists of three random letters, four numbers in order of creation, an “at sign”, and the domain “pamplinhealth.blogsite.org”. This mimics an address format previously used by Longwood University but replaced by a new system a few years ago. By incrementing the four digit numeric portion of the address each time we created a new account, we ensure that addresses are unique.
On each web site, we posted five addresses, each camouflaged in a different way. The first address was posted in plain text. The second address was camouflaged by replacing the characters “@” and “.” with the words “AT” and “DOT”. The third address was written backwards on the web site. The fourth address was obfuscated by placing “#” signs at random locations within the address. The fifth address was posted as a png image.
These variations are illustrated in Figure 1. We also created accounts for 100 control addresses that were not advertised on any website.
Each address we created is aliased to one of six accounts on a mail server. These accounts correspond directly to the obfuscation techniques used to disguise the address. For example an unmasked e-mail address is aliased to the “web1” account, an address using “AT”/”DOT” obfuscation is aliased to the “web2” account, and so forth. The control addresses were mapped to an account named “unadvertized”.
To prevent the server from becoming an open relay, we enabled incoming e-mails, but disabled sending of e-mails. However, we deliberately avoided configuring any other security measures (such as filtering) that might interfere with the collection of spam messages.
We tested each account by sending test e-mails to an alias mapping to that account. The test e-mails were then deleted are not included in our results.
Results
We distributed addresses to just over one hundred free hosting sites. We also distributed addresses to well-trafficked sites. Our preliminary data is shown in figure 2.
As of March 2011, we have received a total of fifty-four messages to five different addresses. The addresses tds2055 and tsu1335 – addresses posted on low page rank free hosting web sites steadywebs.com and 55fast.com – each received one unsolicited commercial e-mail.
Addresses vqj2235, kko2175 and tjf2230 were posted on well trafficked web sites with high page rank and received significantly more spam. Address vqj2235 was posted on “whylongwood.com” and received 1 message. Address kko2175 was posted on “narnia.homeunix.com”, and received 4 messages. Address tjf2230 was posted on “longwood.edu” and received 47 messages.
Since all of these addresses were posted in plain text, the evidence suggests that obfuscation of addresses is very effective at preventing harvesting. Furthermore, sites with higher page rank received much more spam than sites with lower page rank.
One of the interesting characteristics of our data is the time delay between the posting of addresses and receipt of spam messages. Most of the addresses we have posted have been online for over a year and have received no spam. The e-mail address tjf2230 was posted in November of 2010 and received five spam messages within two weeks. This is strong evidence that well posted sites are much more likely to be harvested from than less visited sites, but deserves more analysis.
Most of the spam we received is Nigerian fraud spam. We have also received several phishing e-mails.
Conclusion
Our research suggests a strong link between the page rank of an address and its likelihood to be harvested. It also demonstrates that obfuscation of e-mail addresses is a successful way to prevent address harvesting.
In future work, we would like to further expand the number of addresses which are posted on highly ranked pages. This is actually fairly challenging, because system administrators are reluctant to post addresses on existing pages. We would also like to explore other relationships in the data, such as whether there is a connection between the IP address of the spammer and the IP address of the harvester.
Bibliography
- Project Honey Pot. Matthew Prince, Eric Langheinrich, Lee Holloway, Rachelle Milbank, et. al. http://projecthoneypot.org