Machine Learning; Spam mail (also known as Junk Mail) is a type of electronic spam where unsolicited messages are sent by email. Many email spam messages are generated for commercial purpose in general but it may also contain malicious content which looks like a popular website, but in fact, it may be a phishing attack. Malicious content may include malware, scripts or executable file attachments. Actually, when the user recognizes a spam mail, he/she can add that mail source to a blacklist easily, but some emails are created professionally and most of the time it can’t be recognized easily as spam for standard users. For this case, every mail service producer uses spam filter applications which are developed with machine learning techniques. One of the most commonly known algorithm for spam detection is Naive Bayes algorithm which is based on statistical approach. In this section, we will explain how Naive Bayes algorithms works.
Spam filtering problem can be solved using supervised learning approaches. So Naive Bayes algorithm is one of the most well-known supervised algorithm. As we explained before, every machine learning algorithm has two phases; training and testing. Because of the nature of the supervised problem, Naive Bayes algorithm uses dataset which has labeled samples.
Basically, Naive Bayes algorithm uses word frequency in the email text. Training dataset has words, count of this words and class information for every sample. Basic dataset example has given below. Every row represents a single mail information.
[id1, ham, word1, word1_Count, word2, word2_Count………….]
[id2, spam, word1, word1_Count, word2, word2_Count………..]
[id3, ham, word1, word1_Count, word2, word2_Count………….]
Naive Bayes algorithm is based on the Bayesian Theorem and it calculates following two steps for training phase.
Initially calculates the probabilities of ham and spam classes.
Next, calculates the probabilities of ham and spam for each word.
After the training phase, we calculate the probability of ham and spam for every sample using the words in that sample for the test phase. For this calculation, the equation used is given below.
Finally, pHam and pSpam are compared and ranked. And test sample is assigned to that class.
Traditional filter mechanism uses textual information for filtering mechanism, for example, Naive Bayes Algorithm (which is explained above). In response to this, attackers starting to send spam with the image instead of using words. Sent pictures have textual information but these informations cannot be processed, because of inappropriate format of textual based machine learning algorithms. Researchers responded this action using image processing techniques in order to gather words from images, before running machine learning algorithms. Of course, attackers responded this action too. This time, attackers locate text in images in different angles to make it harder to recognize. As you can imagine, researchers solve this problem too. In the next step, attackers use images with angled words which are created letters with different colors. The war continues in this way. New techniques are emerging day by day to bypass filter mechanisms. In response to this, new techniques are developed day by day to prevent spams, too.
There is an open source tool spam detection with name SpamAssassin. SpamAssassin is a mail filter to identify spam mails. It is an intelligent email filter which uses a diverse range of tests to identify undesirable email messages, more commonly known as Spam. These tests are applied to email headers and contents in order to classify emails using advanced statistical methods of machine learning. In addition, SpamAssassin has a modular architecture that allows other technologies to be quickly utilized against spam mails and it is designed for easy integration into any email system, virtually.
New techniques are developed to strengthen the filtering mechanism in time. But spammers are also able to bypass spam filtering systems by generating more sophisticated spam. This tool is updated in a regular basis. So, anyone can download and implement the tool freely. SpamAssassin uses the combined score from multiple types of checks to determine whether a given message is spam or not. Its primary features are:
- Header tests
- Body phrase tests (SpamAssassinRules.)
- Bayesian filtering (BayesFaq)
- Automatic address whitelist/blacklist (AutoWhitelist)
- Automatic sender reputation system (TxRep)
- Manual address whitelist/blacklist (ManualWhitelist)
- Collaborative spam identification databases (DCC, Pyzor, Razor2) (UsingNetworkTests).
- DNS Blocklists, also known as “RBLs” or “Realtime Blackhole Lists” (DnsBlocklists)
- Character sets and locales
Even though these tests may misidentify a Ham or Spam by themselves, but with their combined score, it is hard to be mistaken. SpamAssassin is starting to use a Perceptron model since released version 3.0.0 to perform the same task in order to process faster. Perceptron is one of the neural network technique. In the new algorithm, the training phase is performed with Stochastic Gradient Descent method. It uses a single perceptron with a logsig activation function and maps the weights to SpamAssassin score space.