Sakupljam spam poruke

[ stinger @ 05.07.2004. 16:51 ] @

pogledaj ovde:
http://spamassassin.org/publiccorpus/

to su poruke koje poseduju najcesce segmente spam poruka upakovane zajedno sa regularnim porukama koje sadrze neke kriterijume, sto je idealno za testiranje anti-spam filtera, ali ne bih ti savetovao da "izmisljas toplu vodu" jer ima zaista veliki broj AS sistema. Bolje se pridruzi nekom od developer timova pa im pomozi da postojeci software naprave jos boljim, naravno pod uslovom da nemas neku revolucionarnu ideju za antispam algoritme koju zelis da implementiras u svoj AS sistem.

-- readme snip --
Welcome to the SpamAssassin public mail corpus. This is a selection of mail
messages, suitable for use in testing spam filtering systems. Pertinent
points:

- All headers are reproduced in full. Some address obfuscation has taken
place, and hostnames in some cases have been replaced with
"spamassassin.taint.org" (which has a valid MX record). In most cases
though, the headers appear as they were received.

- All of these messages were posted to public fora, were sent to me in the
knowledge that they may be made public, were sent by me, or originated as
newsletters from public news web sites.

- relying on data from public networked blacklists like DNSBLs, Razor, DCC
or Pyzor for identification of these messages is not recommended, as a
previous downloader of this corpus might have reported them!

- Copyright for the text in the messages remains with the original senders.

OK, now onto the corpus description. It's split into three parts, as follows:

- spam: 500 spam messages, all received from non-spam-trap sources.

- easy_ham: 2500 non-spam messages. These are typically quite easy to
differentiate from spam, since they frequently do not contain any spammish
signatures (like HTML etc).

- hard_ham: 250 non-spam messages which are closer in many respects to
typical spam: use of HTML, unusual HTML markup, coloured text,
"spammish-sounding" phrases etc.

- easy_ham_2: 1400 non-spam messages. A more recent addition to the set.

- spam_2: 1397 spam messages. Again, more recent.

Total count: 6047 messages, with about a 31% spam ratio.

The corpora are prefixed with the date they were assembled. They are
compressed using "bzip2". The messages are named by a message number and
their MD5 checksum.

The "obsolete" dir contains old versions of the corpus, for reference,
in case you need to correlate test results using these older versions
against the source messages. The messages in those corpora are generally
included in the fresher corpora.

This corpus lives at http://spamassassin.org/publiccorpus/ . Mail
jm - public - corpus AT jmason dot org if you have questions.

Note: if you write a paper or similar using this corpus, and it's available
for download, we'd love to hear about it! Mail spamassassin-devel AT lists
dot sourceforge dot net. cheers!
-- readme snap --