By Brian Krebs
washingtonpost.com Staff Writer Tuesday, November 26, 2002; 1:39 PM
CipherTrust wants garbage -- e-mail garbage to be exact. It wants every e-mail from Nigeria promising millions and every one of those
e-mail solicitations for "free" pornography, Viagra and adult services.
The folks at CipherTrust aren't e-mail masochists. They're out to build the Dewey Decimal System of "spam."
The Alpharetta, Ga.-based security company will use the detritus of offensive e-mail marketing to open an online library -- www.spamarchive.org -- that programmers and researchers can use in the never-ending fight against
spam.
The company hopes to amass at least 10 million spam samples within a year, said Paul Judge, the company's director of research and
development. The project is already well on its way there, he said, thanks to dozens of anti-spam activists who have donated junk e-mails from their in-boxes -- from a few hundred to a few hundred thousand.
Unsolicited bulk e-mail is at an all-time high, according to firms that track it. Spam now accounts for roughly 40 percent of all e-mail,
up from less than 10 percent early last year, according to anti-spam service provider Brightmail.
A public spam library could be a huge boon to the anti-spam community, not just to commercial software vendors; most spam-fighting tools are
developed by independent programmers who give away their wares.
"This should help eliminate one of the big bottlenecks for people who want to make anti-spam tools," said Paul Graham, a computer programmer who has developed open-source mail filtering programs. "You can write all the code you want, but it won't do a whole lot of good unless you have a large amount of spam to test your algorithms on."
Graham is leading the "statistical filtering" approach to trapping spam, a method whose accuracy depends on the amount of spam used in
the testing process.
Traditional anti-spam programs target junk e-mails by searching for specific words or catchphrases commonly found in junk e-mail, such as "teen," "click here" and "Dear Friend." The text-based approach usually stops at least 80 percent of spam.
Yet attempts to increase spam software's level of accuracy by adding more words to the watch list often result in "false positives," where innocent e-mail is treated like spam and sent to the virtual trash can, Graham said.
"For most users, missing legitimate email is far worse than receiving spam," he said.
With statistical filtering, a mathematical formula determines the prevalence of common spam terms within a collection of junk e-mail,
and examines how frequently those same terms appear in a body of legitimate messages.
"So, if you notice that a particular word shows up in 20 percent of spam and .0001 percent of good e-mail, it means when you find that
term in a newly-arrived e-mail there's a good bet it's spam," Graham said.
Eric S. Raymond, a programmer and unofficial spokesman for the open-source software movement, called the spam archive a good idea, but questioned the need for such a huge database, given the limited vocabulary of the average
junk e-mail.
"You can build an effective (statistical) spam filter with a few thousand spam samples because the language spammers use is very stereotyped," Raymond said.
Still, a huge spam archive would yield interesting and useful patterns, such as the Web sites, P.O. Boxes and 1-800 numbers spammers use to ply their trade -- all of which can be used to locate spammers in the real world, Graham said.
Many companies have solicited spam from the public, but most don't share their collections.
The Federal Trade Commission has compiled about 23 million junk e-mail
messages, which it uses for fraud investigations and consumer
education campaigns. The commission has refused requests to open the
record to the public, saying that it is difficult to remove private
data from the messages, such as the e-mail addresses of people who
sent their spam e-mails to the FTC.
The commission cannot take action against most people who send unsolicited junk e-mail, because it is illegal only if they defraud consumers or solicit illegal activity. Twenty-six states have laws that curb junk
e-mail by outlawing bogus return addresses and requiring marketers to identify advertisements with labels such as "ADV:" in a message's
subject line.
But even the strongest laws or the best junk e-mail filters won't stop
the most ardent spammers, said Raymond.
"It's like a whack-a-mole game: you shut down (spam e-mail) servers in
one place, and the same spammers pop up again in another place running
a shoestring operation out of their basement," Raymond said. "But in a
weird way, that sort of highlights one of the Internet's strengths,
that it's very hard to lock someone out of communication or suppress
speech."
Lest spammers try to harvest new victims from the database,
CipherTrust will "scrub" all messages forwarded to the archive to
remove any e-mail addresses. Judge said. Each spam specimen will be
tagged with a reference number and assigned to a category based on the
message's content.
CipherTrust said spamarchive.org will be a free service, but the
company said that having a huge archive of spam will help its
technicians improve "IronMail," CipherTrust's proprietary anti-spam
product.