Bayesian filtering

Discussions on webmail and the Professional version.
abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

Bayesian filtering

Post by abbynormal1 » Mon May 08, 2006 9:55 pm

A couple of questions. First, why aren't their pre-fab spam dictionaries that other people have made that can be used generically for everyone? In any case, I'm saving up spam emails in outlook. I should soon have 1000. Once this is done, I suppose I can use some tool to make a dictionary from them?

Second, as for the "ham" dictionary, this is made up of emails sent TO a user, or emails sent FROM a user? I for example have thousands of emails in my sent folder that offer a wide variety for a dictionary. Could these outlook emails be used?

abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

Post by abbynormal1 » Wed May 10, 2006 7:38 am

Hello?

ECHO
e c h o
echo
eko

winter
Posts: 26
Joined: Sun Apr 03, 2005 5:46 pm

Post by winter » Tue May 16, 2006 2:59 pm

I would also be interested in a copy of an effective dictionary that I could instantly copy to my server, and not have to deal with the time and work of building my own. Is it possible to use someone else's dictionary? Are there any available? How about for a small paypal donation for your efforts to build one and share it?

abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

Post by abbynormal1 » Tue May 23, 2006 12:36 am

Apparently the reason you can't have a pre-fab dictionary is that every mail server receives different spam. So, a dictionary for my mail server might be different than the dictionary for yours.

I've saved up 1300 spam messages to create a dictionary from - but I'm still not sure how to take my MS Outlook saved up messages and make a dictionary form them.

MailEnable support has mentioned "IMAP" service, but I have no idea what that is how how to make it do what I want in regard to creating a dictionary.

abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

Post by abbynormal1 » Sat May 27, 2006 9:51 am

Just to update anyone who's interested -

IMAP is pretty simple. It's already enabled by default in ME, so all you have to do is set the account up in Outlook. To do this, just create a new account and select IMAP instead of POP. Your IMAP server is the same as your pop server.

Once the IMAP accoutn is set up, create a folder in ms outlook called "spam" and "ham". Drag/drop all your spam messages into the spam folder and your ham messages (valid messages sent FROM your email address) in the ham folder. This causes the messages to be uploaded to the mail server.

Once this is done, just follow these instructions to make the dictionary out of your MAI files:

http://www.mailenable.com/kb/Content/Ar ... D=me020346

abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

success!

Post by abbynormal1 » Mon May 29, 2006 9:01 am

Finally finished setting up mailenable spam filtering. Now, when I get spam (down to 2-3 per day from 50-100 per day) I just drop it in the IMAP spam folder, and the Bayesian auto-learning kicks in so I won't get that particular spam any more. I have my users forward their spam to that mailbox.

I also have set up *@mydomains.com to auto-learn ham messages.

The last step in the process I had to figure out is to set up a global filter and enable the option to filter based on spam percentage and then add action for users to be notified when this happens, so that they could make sure good mail wasn't being blocked.

I may turn off notification after I'm pretty sure ham isn't being caught in the spam filter.

Thanks for all the support getting this to work MailEnable!

chris.velo
Posts: 16
Joined: Tue Sep 13, 2005 3:48 pm
Location: USA

last step how? - the global filter

Post by chris.velo » Thu Jun 08, 2006 6:57 pm

So how did you do the last step?

abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

Post by abbynormal1 » Thu Jun 08, 2006 7:37 pm

Well, first, I disabled "auto-training" because it doesn't work right. Instead I scheduled a batch file to run every night that does the following:

xcopy C:\Progra~2\MailEn~1\Postoffices\mydomain.com\MAILROOT\spammailbox\Inbox\*.mai
C:\Progra~2\MailEn~1\Dictionaries\Spam\
del C:\Progra~2\MailEn~1\Postoffices\mydomain.com\MAILROOT\spammailbox\Inbox\*.mai
move C:\Progra~2\MailEn~1\Postoffices\mydomain.com\MAILROOT\ham\inbox\*.mai C:\Progra~2\MailEn~1\Dictionaries\NoSpam\
net stop memtas
mespamcmd -m "C:\Progra~2\MailEn~1\Dictionaries\MailEn~1.TAB" "C:\Program Files (x86)\Mail Enable\Dictionaries\Spam" "C:\Program Files (x86)\Mail Enable\Dictionaries\NoSpam"
net start memtas

Basically, the preceding batch file copies all the messages from the spam IMAP inbox (where I drop all spam mail that still gets through) into the spam dictionary folder. It then deletes all the messages in the spam IMAP inbox so that they aren't used again in the future.

I have a filter set up that takes all email coming from *@mydomains.com and forwards a copy to "ham@mydomain.com". That way, I have lots of ham mail to sample from. The next part of the batch file takes all the messages in the ham inbox and moves them to the "NoSpam" dictionary folder.

Finally, you can see where I stop memtas, run the mespamcmd -m (-m is merge), and then start memtas again. This is scheduled to run nightly, so the dictionaries are updated every night.


As for how to create the global filter is found by going to:
MailEnable Management Console > Messaging Manager > Filters > New > New Filter > Name Filter > Set criteria "Where the message has a certain Spam Probability" to enabled (I set it to 95%). If you want, you can add an action so that the recipient receives a notice that a spam message was filtered (in case the mssage was actually ham, in which case the user would want to know it was filtered).

This is also where you create the filter to collect ham messages - create a filter called "Ham Collection" and enable criteria "Where the From header line contains specific words". You can then define *@yourdomain.com as the words it looks for and set an action to forward message to specified recipient. The specified recipient of course is ham@yourdomain.com. This is where your batch file will pull ham messages from for the dictionary.

abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

Effectiveness

Post by abbynormal1 » Mon Jun 12, 2006 5:40 am

After having Bayesian up and running as it is supposed to for about 4 days, as of today it is filtering 55.2% of spam. This percentage will (hopefully) increase as the dictionary grows over time.

It is currently 824 KB with 17707 spam tokens and 13318 ham tokens.

abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

Bayesian filter now 7 days old

Post by abbynormal1 » Fri Jun 16, 2006 12:43 am

Now, after 1 full week of Bayesian running, we're still at about 58.2% of our spam being filtered with 41.8% still getting through. I am hoping for it to improve to 90%-99%.

Our dictionary now has 30,226 spam tokens to 27,401 ham tokens. The size of the dictionary is 953KB.

lunix
Posts: 60
Joined: Wed Feb 09, 2005 4:26 pm

Post by lunix » Tue Jun 20, 2006 8:55 pm

So man - you are the guy.
Why dont you share your dictionary with us?

abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

share dictionary

Post by abbynormal1 » Wed Jun 21, 2006 12:12 am

Well, when I began this process, I wondered why there weren't large dictionaries for everyone to use too. Now I understand why - spam and ham is different for every mail server. If you were to use my dictionaries, you would likely have ham being filtered as spam and vice versa. So, the only way for Bayesian to work really well for you is to set up the auto leraning yourself.

Thats why I have updated this thread so often - so people won't have as hard a time as I did. I found the MailEnable documentation to be very insufficient. However, MailEnable support was great in helping me get this working!

abbynormal1
Posts: 33
Joined: Mon May 08, 2006 7:54 am

Post by abbynormal1 » Wed Jun 28, 2006 6:16 am

So, I am now at 64% of spam being filtered. It's slowly getting better, but not quite as fast as I had hoped. I'm currently adding about 175 spam emails per day to the dictionary.

Dictionary Size: 1.146 megs
Tokens: Spam- 76970 Ham- 104859

keithc
Posts: 23
Joined: Tue Jun 20, 2006 3:05 pm

Post by keithc » Wed Jun 28, 2006 1:44 pm

How in the world do you have so much ham? I've been collecting for a week and my balance looks like this :

spam : 700, ham : 190

This is on a small alternate server. My primary is a monster with several thousand active mailboxes. The alternate is easy to collect samples from, as all the accounts are IT types, while my primary is mostly very non-technical users. I know I'm going to have problems getting them to help train.

So my question really was : how are you keeping your ham balance up with the spam one?

LumTech
Posts: 6
Joined: Wed Jun 28, 2006 4:16 am

Post by LumTech » Wed Jun 28, 2006 1:48 pm

abbynormal1,

Just wanted to say thanks. Been following this topic and finally got to setup some of it last night. Seems to be working decently well. The only alteration I made was for the ham address. Not wanted a valid ham@mydomain.com address that a spammer could actually send e-mail to, I have a domain setup called "home". I'm not sure if it's default or if Plesk created it (I just know I didn't). Anyway, nobody from outside the box running ME can see that domain since it's not DNS'ed anywhere and I setup the ham@home address and forward to that, works great so far.

How are you getting your statistics of blocked messages to know what percent you are at?

David

Post Reply