Test my auto-generated ruleset
(I posted this to the SA users and dev lists, too.)
I’ve been working on a new way to auto-generate body rules recently (see previous posts). The results are checked into SVN trunk daily in the “rulesrc/sandbox/jm/20_sought.cf” file.
We haven’t had much time to figure out how to produce auto-generated 3.2.x rule updates for our entire ruleset at updates.SpamAssassin.org, so instead of dealing with that, I’ve taken a shortcut around it ;) I’m now making just the “20_sought.cf” ruleset available as a standalone, unofficial sa-update ruleset at sought.rules.yerp.org.
Before using it, you’ll need the GPG key:
wget http://yerp.org/rules/GPG.KEY sudo sa-update --import GPG.KEY
then use this to update:
sudo sa-update \
--gpgkey 6C6191E3 --channel sought.rules.yerp.org \
[...other channels...] \
--channel updates.spamassassin.org
(similar to how you’d use Daryl’s sa-update version of the SARE rulesets.)
Feel free to run sa-update as frequently as you like.
Please consider it alpha; I may take it down in a few months depending on how it goes, or if we can get it working as part of the core updates. In the meantime though, I’m curious to hear how you get on with it. (In particular, copies of false positives would be very welcome.)
Update: it’s been very successful, so I’d now consider it in production.
Tags: rules, rulesets, sa-update, spamassassin

James said,
August 21, 2007 @ 10:59 am
Hi Justin,
Thanks for all the work you do on SpamAssassin!
I’ve been very interested in reading this approach to generating rules.
I’ve been testing these rules on a fairly small company server (not using sa-update: I’ve been downloading them from SVN and checking them myself before putting them into production). I’ve been getting SA scores on OEM spam into the 150 range, with no noticed FPs.
One thing I have noticed: JM_SEEK_UTAGQU is nearly identical to a local rule I added about a year back. (My rule only fires if the text is in the subject, as it always is in my experience). It’s a very good spam sign (and very unlikely to FP), so between your rule and mine, an e-mail can get 5.5 points just by having that one spam sign. In this case, it isn’t a problem.
It’s just that daily automatically-updated rules are likely to clash with other rules, and an e-mail get scored twice for the same spam-sign. (Revision 567659 has two copies of JM_SEEK_7HOHFW — since the name is identical, SA knows what to do, of course).
Given the potential complexity of Perl regexps, I’m not sure what can be done about this.
Justin said,
August 21, 2007 @ 1:19 pm
Yes, I noticed the duplicate rule — I need to look into that. Thanks for the feedback!
Slava said,
February 25, 2008 @ 5:23 am
Hello Justin,
Will this rule (sought.cf) work with SA v.3.1.9? Thank you, Slava.
Justin said,
February 25, 2008 @ 11:01 am
hi Slava –
it should. It would be more efficient on 3.2.x, since you can use “sa-compile” to compile the ruleset there, but should work fine on 3.1.x as well.
Bubuk said,
February 27, 2008 @ 2:48 am
Hi, Excuse my nubiness in all these. I have my root cron like so:
41 * * * * /usr/local/bin/freshclam –quiet
00 * * * * /usr/bin/sa-update –gpgkey D1C035168C1EBC08464946DA258CDB3ABDE9DC10 –channel saupdates.openprotect.com –channel updates.spamassassin.org –allowplugins
and I would like to add sought.cf into my rulesets.
1) So I login as root. And I ran
wget http://yerp.org/rules/GPG.KEY
sa-update –import GPG.KEY
2) Now how do I incorporate these lines (from your instruction above) sa-update \ –gpgkey 6C6191E3 –channel sought.rules.yerp.org \ [...other channels...] \ –channel updates.spamassassin.org
into my root cron? Would this do the job:
41 * * * * /usr/local/bin/freshclam –quiet
00 * * * * /usr/bin/sa-update –gpgkey D1C035168C1EBC08464946DA258CDB3ABDE9DC10 –channel saupdates.openprotect.com –gpgkey 6C6191E3 –channel sought.rules.yerp.org –channel updates.spamassassin.org –allowplugins
Please help. TIA.
Sujit Choudhury said,
February 29, 2008 @ 12:26 pm
Hi Justin, I am running SA 3.1.7 with SARE rules as well as the original SA rulesets. I also use sagrey.cf,botnet.cf and imageinfo.cf. Will the addition of this ruleset of yours give more false positives?
Justin said,
February 29, 2008 @ 12:30 pm
Sujit: I would not expect it to do so.
John said,
May 20, 2008 @ 5:29 am
Just seeing if i understand correctly, 20_sought.cf will list all current automatically generate rules in the file when updated, and uses your own spam and ham lists to generate, not ones from the end user… amirite?
John said,
May 20, 2008 @ 6:46 am
well still kinda vague, i mean the spam and ham lists to generate are yours (justin) not the local administrator
Justin said,
May 20, 2008 @ 12:29 pm
hi John — yep, that’s the case.
Robert LeBlanc said,
May 29, 2008 @ 4:48 am
Justin, is there any chance you could add some ‘describe’ lines for the meta rules in 20_sought.cf? JM_SOUGHT_[1-3] doesn’t add much to the readability of the spam report, but even a generic description would be useful, particularly for users of web GUIs (e.g. Maia Mailguard) that try to present a detailed report to the user.
John Hardin said,
June 6, 2008 @ 2:04 pm
Justin:
Can you include a generation timestamp in a comment at the top of the file?
Thanks!
Justin said,
June 9, 2008 @ 4:19 pm
hi John –
unfortunately I’d prefer not to, as it makes a lot of noise in the SVN changes mails… I’m trying to minimize the noise in those.
John Hardin said,
June 15, 2008 @ 5:24 pm
Justin, do you know of anyone using your tools to produce rulesets from corpa focused on a specific type of spam (e.g. 419 and lottery fraud)?
For example, I’m using the SARE fraud ruleset but certain recent variants (e.g. the ATM Card ones) aren’t hitting at all. I suggested to them that using your tools against a fraud-specific corpus might be a good idea to keep the rules current.
I poked around in your latest ruleset and didn’t see much that looked fraud-related. Do you have much fraud spam in your spamtrap corpus?
Justin said,
June 16, 2008 @ 10:10 am
John — no, I don’t. I agree though it’d be a natural.
‘I poked around in your latest ruleset and didn’t see much that looked fraud-related. Do you have much fraud spam in your spamtrap corpus?’
Unfortunately, relative rates of different types of spam differ between different accounts — you can see this on our ruleQA site, http://ruleqa.spamassassin.org/ . if you click on a rule’s “display hits over time” graph, you can see that different mass-check contributors have entirely different hitrates on spam for various rules! This means that what makes up 10% of one contributor’s corpus might make up only 0.3% of another’s. it’s very odd. I think it’s a side effect of where spammers find addresses.
I’ve been trying to find a way to compensate for that, for a long time. it’s tricky. :(
I would love to see multiple “sought” rulesets, run against different corpora. I’ve responded to your mail on the SA users list….
Patrick said,
July 3, 2008 @ 4:23 pm
sa-compile did not work, so I checked JM_SOUGHT_1, JM_SOUGHT_2 and JM_SOUGHT_3 one by one. The issue was caused by JM_SOUGHT_3, the other meta-tests work fine. I verified on two platforms with different releases of gcc and spamassassin.
Justin said,
July 3, 2008 @ 6:03 pm
@Patrick: please check the SA FAQ — this problem is frequently caused by using the wrong version of re2c. If that’s not it I suggest mailing the SA users mailing list…
coreyva said,
July 9, 2008 @ 1:05 am
I use sa-update with the sare rules and now your sought.cf. I figured my config my be helpful to some so I’ll post it here.
crontab 11 2 * * * /usr/bin/sa-update –gpgkeyfile /etc/spamassassin/sare-sa-update-gpgkeys.txt –channelfile /etc/spamassassin/sare-sa-update-channels.txt
sare-sa-update-gpgkeys.txt contains the gpg keys needed
856AA88A
6C6191E3
sare-sa-update-channels.txt contains the channels to update.
updates.spamassassin.org sought.rules.yerp.org 72_sare_redirect_post3.0.0.cf.sare.sa-update.dostech.net 70_sare_evilnum0.cf.sare.sa-update.dostech.net 70_sare_bayes_poison_nxm.cf.sare.sa-update.dostech.net 70_sare_html0.cf.sare.sa-update.dostech.net 70_sare_html_eng.cf.sare.sa-update.dostech.net 70_sare_header0.cf.sare.sa-update.dostech.net 70_sare_header_eng.cf.sare.sa-update.dostech.net 70_sare_specific.cf.sare.sa-update.dostech.net 70_sare_adult.cf.sare.sa-update.dostech.net 72_sare_bml_post25x.cf.sare.sa-update.dostech.net 99_sare_fraud_post25x.cf.sare.sa-update.dostech.net 70_sare_spoof.cf.sare.sa-update.dostech.net 70_sare_random.cf.sare.sa-update.dostech.net 70_sare_oem.cf.sare.sa-update.dostech.net 70_sare_genlsubj0.cf.sare.sa-update.dostech.net 70_sare_genlsubj_eng.cf.sare.sa-update.dostech.net 70_sare_unsub.cf.sare.sa-update.dostech.net 70_sare_uri0.cf.sare.sa-update.dostech.net 70_sare_obfu0.cf.sare.sa-update.dostech.net 70_sare_stocks.cf.sare.sa-update.dostech.net 70_sare_whitelist.cf.sare.sa-update.dostech.net 70_sare_evilnum1.cf.sare.sa-update.dostech.net 70_sare_evilnum2.cf.sare.sa-update.dostech.net 70_sare_whitelist_spf.cf.sare.sa-update.dostech.net 70_sare_whitelist_rcvd.cf.sare.sa-update.dostech.net 70_sare_header1.cf.sare.sa-update.dostech.net 70_sare_header2.cf.sare.sa-update.dostech.net 70_sare_header_eng.cf.sare.sa-update.dostech.net 70_sare_uri_eng.cf.sare.sa-update.dostech.net 70_sare_uri1.cf.sare.sa-update.dostech.net 70_sare_uri2.cf.sare.sa-update.dostech.net 70_sare_obfu4.cf.sare.sa-update.dostech.net 70_sare_obfu3.cf.sare.sa-update.dostech.net 70_sare_obfu2.cf.sare.sa-update.dostech.net 70_sare_obfu1.cf.sare.sa-update.dostech.net 70_sare_genlsubj2.cf.sare.sa-update.dostech.net 70_sare_genlsubj1.cf.sare.sa-update.dostech.net 70_sare_genlsubj_eng.cf.sare.sa-update.dostech.net 70_sare_html2.cf.sare.sa-update.dostech.net 70_sare_html_eng.cf.sare.sa-update.dostech.net chickenpox.cf.sare.sa-update.dostech.net
This way you can add channels and keys easier. Hope that’s helpful to some.
Patrick said,
July 11, 2008 @ 3:15 pm
SA FAQ doesn’t mention re2c, but when searching the wiki I found SaCompileRefSymbolError which suggests to update to version 0.12.0 or later. Therefore I’ve been using 0.12.2 for ages, but your rules didn’t work on July 3, 2008 anyway. However, I tested today’s rules which work fine. Since nobody else reported problems, I just don’t bother.
Jens Schleusener said,
July 13, 2008 @ 4:50 pm
The mechanismn behind 20_sought.cf seems a successful idea since it helps me to detect some otherwise unmarked spams.
Til July 8th the file 20_sought.cf was updated roughly 2-4 times a day but the last update seems now to be made at that day 4 pm UTC. Any reason or a problem of my update script?.
Justin said,
July 14, 2008 @ 8:18 pm
@Jens — that should be fixed now.
Jens Schleusener said,
July 14, 2008 @ 9:21 pm
Many thanks, it works again! Extract of my script’s log:
=== Tue Jul 8 08:05:07 2008: 69459 Jul 8 08:05 20_sought.cf
=== Tue Jul 8 18:05:08 2008: 80330 Jul 8 18:05 20_sought.cf
=== Mon Jul 14 19:05:09 2008: 80666 Jul 14 19:05 20_sought.cf
=== Mon Jul 14 22:05:09 2008: 82708 Jul 14 22:05 20_sought.cf
Jens Schleusener said,
August 6, 2008 @ 4:04 pm
The last update of “20_sought.cf” seems now be made at August 3rd at 1 am UTC (roughly three days ago). Again a small problem or just a delay?
Justin said,
August 6, 2008 @ 8:22 pm
Jens, I’m in the middle of a server move. it should be back again shortly.
Jens Schleusener said,
August 6, 2008 @ 8:58 pm
Ok, no hurry. Good things are worth waiting for …
Rich Wales said,
November 16, 2008 @ 2:03 am
These rules appear to be broken, as of version 320713383 (12 Nov. 2008). Starting with that version, my sa-compile has been failing in the middle of one of the re2c’s.
I tried paring down the affected scanner###.re file and running re2c by hand, and it looks like the offender is __SEEK_CCNRJL, which was introduced in version 320713383. There might be other problems after this point, but hopefully this can serve as a starting point for someone to be able to find and fix the problem(s).
In the meantime, I’m going to try reinstalling the previous version (320713360), and disabling sa-update until I hear that things are working properly again.
Sebastian said,
November 17, 2008 @ 1:16 pm
I can confirm this problem, it’s not working for me either. I removed the compiled rules rather then disabling sa-update.
Rich Wales said,
November 19, 2008 @ 12:42 am
The problem I reported the other day went away after I upgraded my SpamAssassin from version 3.2.3 to 3.2.5.
I am once again successfully using Justin’s rules (currently version 320718672).
Justin said,
November 19, 2008 @ 11:36 am
Sorry — I should have mentioned that there are indeed some fixes to sa-compile in 3.2.4 and 3.2.5; if you have problems, an upgrade of that is strongly recommended.
Mark said,
April 17, 2009 @ 7:50 pm
These rules put a heavy load on my busy servers. The fraud ruleset is over 300k, which Spamassassin says is too large to run. Can this be fixed somehow? If not, just consider this a warning to others who are seeing performance problems.
Marcelo said,
May 7, 2009 @ 9:54 am
Hello Justin,
For many reasons I don´t use sa-update but I would kindly ask to use your rules. Is it possible I get it using wget one time per week ? I am trying to discover the url to wget :))
Can you give me it ?
Thanks and Best Regards, Marcelo
Justin Mason said,
May 7, 2009 @ 10:34 pm
Marcelo, you’ll have to do a few steps to determine the URL, to emulate sa-update.
First, “dig mirrors.sought.rules.yerp.org. TXT” to get the mirrors file’s URL — http://yerp.org/rules/MIRRORED.BY. curl that, and you’ll see that the base URL for updates is http://yerp.org/rules/stage/ . next, “dig 0.2.3.sought.rules.yerp.org txt” to get the current version: ‘320772726′. put that together with the base URL, add .tar.gz , and that’s the tarball: http://yerp.org/rules/stage/320772726.tar.gz .
it’d be easier to use sa-update. ;)
Nick Urbanik said,
May 11, 2009 @ 4:52 am
Dear Justin,
Thanks for this. Are there plans to incorporate your rules into SA 3.2.+? I would like to be aware so that if/when that happens, I can stop incorporating these into my SA setup to avoid double-scoring particular email attributes.
Justin said,
May 11, 2009 @ 9:48 am
hi Nick — currently, no. it’ll stay as an optional add-on ruleset, separately sa-update’d.
Stan said,
June 8, 2009 @ 11:24 am
Dear Justin
I would like create a unofficial channel as you
I create my package tar.gz:
cd /home/rules/updates
ls -lrt
*.cf
tar -cf 2.tar *.cf
gzip -9v 2.tar
sha1sum 2.tar.gz
gpg -bas 2.tar.gz
I export my public key
gpg –export -a key > /home/rules/PUB.key
I create a file MIRRORED.BY in /home/rules/ with url http://myserver.domain.com/rules/updates/
And after how to configure DNS to use for example –channel updates.myserver.domain.com with sa-update
regards
SpamZombie said,
August 5, 2009 @ 5:26 pm
Heya, Justin. Are these rules still being updated every 4 hours? The timestamps on the files sometimes seem to indicate they’re not.
Also, I did read through the comments (and some of the SA wiki on the SoughtRules, which says they’re updated every 4 hours) but didn’t see if you said how often it was okay to check for updates? I’m currently connecting 4 times in 24 hours, but would like to get them more often if possible and don’t want to overload the server…
Thanks!
TonyMaro said,
January 27, 2010 @ 4:24 pm
It appears that this list is no longer available…
Justin said,
January 27, 2010 @ 5:23 pm
what gives you that idea?
SpamZombie said,
January 27, 2010 @ 6:37 pm
We were getting 404’s for several hours trying to download the rules yesterday via sa-update. Maybe that’s what Tony means? I thought maybe you’d discontinued ‘em, but didn’t see a post here so figured it was a glitch.
Seems to be working fine today.
Paul Fisher said,
March 21, 2010 @ 1:33 am
Now that SpamAssassin 3.3.1 is out, is “sought” going to be updated to support the sa-update in 3.3.1? DNS for yerp.org needs to be updated.
dbg: dns: query failed: 1.3.3.sought.rules.yerp.org => NXDOMAIN channel: no updates available, skipping channel
Thanks.
Justin said,
March 21, 2010 @ 11:21 pm
Paul, thanks for noticing that. fixed.
Steve Rawlinson said,
April 22, 2010 @ 8:50 pm
Sorry to bother you with what is probably a dumb question but following your intructions for using sa-update to get these rules everything works fine until:
sa-update –gpgkey 6C6191E3 –channel sought.rules.yerp.org
http: request failed: 500 Can’t connect to yerp.org:8080 (connect: timeout): 500 Can’t connect to yerp.org:8080 (connect: timeout)
Am I doing something stupid?
Justin Mason said,
April 22, 2010 @ 11:09 pm
hmm. it sounds like port 8080 isn’t going to work at your site; I recently moved the hosting to a server running on that port, but firewall issues may mean I need to reconsider that.
Steve Rawlinson said,
April 23, 2010 @ 12:47 pm
You’re right it was a firewall at my end. Fixed now, many thanks.
D. Stussy said,
August 16, 2010 @ 9:03 am
“Feel free to run sa-update as frequently as you like.”
Although you do say this, you (and every other SA channel provider) should indicate a suggested update frequency. Otherwise, one may end up with some a$$ who decides a cron job to check every 5 minutes is adequate (but clearly inappropriate for a service that updates once per hour or per day). Therefore, the frequency in which YOU update the rules you offer should be the disclosed (and be the minimum that is used by everyone to check for updates).
marcin said,
January 14, 2011 @ 2:46 pm
I can’t use this rules, i’m getting: GET http://rules.yerp.org.s3.amazonaws.com/rules/stage/3301058947.tar.gz request failed, retrying: 500 Can’t connect to rules.yerp.org.s3.amazonaws.com:80 (Bad hostname ‘rules.yerp.org.s3.amazonaws.com’): 500 Can’t connect to rules.yerp.org.s3.amazonaws.com:80 (Bad hostname ‘rules.yerp.org.s3.amazonaws.com’)
Is this channel still active?
Justin Mason said,
January 14, 2011 @ 4:08 pm
hey Marcin — that should work ok:
Marcin said,
January 15, 2011 @ 4:32 pm
Thanks Justin, now everething works. I’ve downloaded rules succesfully. Regards.