Real-time DNS blocklist accuracy figures

Spam: DNS blocklists are the oldest means of spam-blocking, and are still exceedingly useful; nowadays, many of these are fully automated systems, using proxy-detection algorithms and sensing patterns in mailer behaviour indicative of spam.

A few months back on the ASRG list, there was a discussion of DNSBL accuracy; I posted some SpamAssassin figures, based on our ‘mass-check’ tests, but noted that they were computed using current DNSBL contents against a corpus of saved mail, so due to the time delta, were not 100% representative.

These figures are a lot better. Since August, I’ve been collecting real-time DNSBL hit data on my mail, as it is delivered at my SpamAssassin installation. In other words, it’s live accuracy data — it’s using just what the DNSBLs had listed at scan time.

(DNS blocklist accuracy figures continued…)

Note, however, that it’s still incomplete:

  • some DNSBLs were not measured; these are just the default DNSBL list in SpamAssassin 2.60, excluding RCVD_IN_NJABL_DIALUP (which I had to remove because I can’t parse out accurate data).
  • it’s only 1 person’s hand-classified mail.
  • SpamAssassin tests more than just the ‘delivering’ SMTP relay; it’ll also look backwards through the headers, at earlier relays, to catch spam sent via mailing lists. This is different from what’s used with most traditional DNSBL-supporting systems.

But the results should still be quite useful.

The time period covered:

  • Thu, 21 Aug 2003 17:11:30 -0700 (PDT)
  • Sat, 25 Oct 2003 23:11:52 -0700 (PDT)

Recap of the fields:

  • SPAM% = percentage of messages hit that were spam
  • HAM% = percentage of messages hit that were spam
  • S/O = Spam/Overall = Bayesian probability of spam
  • RANK = artificial ranking figure, ignore this!
  • SCORE = default SpamAssassin 2.60 score
  • NAME = name of test. Figuring out the exactly DNSBL should be pretty obvious ;)

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
21839     1993    19846    0.091   0.00    0.00  (all messages)
100.000   9.1259  90.8741    0.091   0.00    0.00  (all messages as %)
5.989  59.0567   0.6601    0.989   1.00    2.25  RCVD_IN_BL_SPAMCOP_NET
3.869  37.7822   0.4636    0.988   0.96    1.10  RCVD_IN_DSBL
0.751   8.2288   0.0000    1.000   0.95    4.30  RCVD_IN_OPM_HTTP
1.964  20.2709   0.1260    0.994   0.95    1.10  RCVD_IN_NJABL_PROXY
0.659   7.1751   0.0050    0.999   0.95    0.64  RCVD_IN_NJABL_SPAM
0.614   0.0000   0.6752    0.000   0.94   -0.10  RCVD_IN_BSP_OTHER
0.050   0.5519   0.0000    1.000   0.94    4.30  RCVD_IN_OPM_SOCKS
0.027   0.3011   0.0000    1.000   0.94    4.30  RCVD_IN_OPM_WINGATE
0.119   0.0000   0.1310    0.000   0.94   -4.30  RCVD_IN_BSP_TRUSTED
0.939   9.7341   0.0554    0.994   0.94    4.30  RCVD_IN_OPM
1.081  10.9383   0.0907    0.992   0.93    1.52  RCVD_IN_SORBS_SOCKS
1.062  10.7376   0.0907    0.992   0.93    1.27  RCVD_IN_SBL
0.229   2.4084   0.0101    0.996   0.93    1.10  RCVD_IN_SORBS_MISC
0.618   6.3221   0.0453    0.993   0.93    1.10  RCVD_IN_SORBS_HTTP
0.595   5.9709   0.0554    0.991   0.92    4.30  RCVD_IN_OPM_HTTP_POST
0.078   0.7526   0.0101    0.987   0.90    2.60  RCVD_IN_SORBS_ZOMBIE
0.815   7.5263   0.1411    0.982   0.89    1.39  DNS_FROM_RFCI_DSN
3.594  24.8369   1.4613    0.944   0.81    2.55  RCVD_IN_DYNABLOCK
1.685  11.4400   0.7054    0.942   0.78    0.10  RCVD_IN_RFCI
0.380   2.4586   0.1713    0.935   0.75    1.31  RCVD_IN_NJABL_RELAY
6.182  33.9689   3.3911    0.909   0.73    0.10  RCVD_IN_NJABL
10.422  44.4054   7.0090    0.864   0.63    0.10  RCVD_IN_SORBS
0.037   0.1505   0.0252    0.857   0.54    2.80  RCVD_IN_SORBS_WEB
2.344   4.1144   2.1667    0.655   0.17    0.00  RCVD_IN_SORBS_SPAM

This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

3 Comments

  1. Posted November 5, 2006 at 13:36 | Permalink

    Hi Justin,

    How do you extract these statistics? From SA logfiles of from your mailbox? Do you have a script for this you would like to share with the world? If so, i’m very interested…

    I work for a small isp in the netherlands, serving 10.000+ mailboxes.. We are looking for ways to maximize SA’s efficiency… Some stats on our situation concerning rbl hits would help a lot!!

    Regards, bas janssen Amsterdam, the netherlands

  2. Posted November 6, 2006 at 17:05 | Permalink

    Bas — those are results from SA’s “mass-check” tool.

    As SA receives mails, it records the rules hit in the X-Spam-Status header; we “mass-checkers” then move the mails into “ham” or “spam” folders. Months later, when we run “mass-check” on those folders, it knows it can reuse the lookup results to get an idea of the accuracy of those network rules.

    See the SA wiki, esp http://wiki.apache.org/spamassassin/MassCheck , for more details.

    Note that one key potential issue for you guys would be that you have to capture the mails even if they hit an RBL. SA does this, but most large-scale sites using DNS blocklists cannot afford to do so…

  3. Posted November 6, 2006 at 17:06 | Permalink

    oh — also, these are pretty old. there are newer mass-check results on the SA wiki. search for “DNSBL accuracy”…