Simple email analysis
July 27, 2008 – 4:45 pmI was asked a simple question this week about the composition of email….basically, what percentage of email is spam? I have seen stats on this several times before…but I decided to look into this myself.
I decided to run some simple tests and produce some stats on a portion of the infrastructure that I use to run the emailcloud platform. During the filtration process the emailcloud platform will scan an email twice using two different server clusters. The first cluster is rather crude and will remove the vast majority of spam emails. I analysed this cluster first. The graph of the first cluster is here:
Firstly, the yellow line is the number of SMTP connections per 5 minute interval, which peaked at around 2500 connections per five minute interval at around midday. The black line is the number of emails that were allowed to pass through this mail cluster. The various lines in the middle describe which test was used to stop the particular emails. This shows that we filtered out 95% of all email using the first cluster!
The second cluster performs a very resource intensive scanning process which mathematically analyses the content and composition of the email. I was surprised at that and then went on to analyse the second cluster. The Graph is as follows:
In this cluster I only analysed the total email traffic and the “clean” email. You will notice that it peaks at 280 connections per five minute interval at around mid-day. I was really surprised that this cluster blocked 47% of the email that it processed. This basic analysis (on traffic that passed through these particular clusters on a Saturday) shows that around 97.5% of email is spam…..rather high don’t you think!