Guanglu Sun,Hongyue Sun,Yingcai Ma,and Yuewu Shen
(1.Research Instituteof Information Technology,Tsinghua University,Beijing 100084,China;
2.ZTECorporation,Shenzhen 518057,China;
3.School of Computer Scienceand Technology,Harbin University of Science and Technology,Harbin 150080,China)
Abstract The naive Bayes(NB)model has been successfully used to tack?le spam,and is very accurate.However,there is still room for improvement.We use a train on or near error(TONE)method in online NB to enhance the performance of NB and reduce the number of training emails.We conducted an experiment to de?termine the performance of the improved algorithm by plotting(1-ROCA)%curves.The results show that the proposed method improves the performance of original NB.
Keyw ords spam filtering;online naive Bayes;train-on or near error
E mail is an efficient communication technology and one of the most widely used internet applications.However,spam is a drain on network resources and is often detrimental to user experience.Some peo?ple use spam for malicious purposes,so spam filtering is a hot?spot in current research.
The body of the email contains essential information and is arguably the most important part of the email.Content-based filtering is a reliable method for combating spam.Machine learning provides more accurate prediction and is an attractive solution for content-based filtering.However,there is no con?sensusabout which learningalgorithmsarebest[1].
Machine learning techniques are usually based on genera?tive models,such as naive Bayes(NB),or discriminative mod?els,such as support vector machines(SVMs).For most large-scale tasks,discriminative models perform better than generative models[2].This is especially true when there is suf?ficient training data.In TREC spam track,the Bogofilter is a fast Bayesian spam filter that is used as the baseline[3].Many researchers have achieved state-of-the-art spam filtering us?ing SVMs;however,SVMs typically require training time that is quadratic in the number of training examples[4].SVMs are not suitable for online filtering because they are not updated in real time.With the Bayesian method,filtering is inaccurate,but only linear training time is required,and robustness is less likely to be affected by bad data[5],[6].The Bayesian filtering system is easy to deploy because it is simple and lightweight[7].
In this paper,we propose an improved online NB classifier.Online NB is often used in spam filtering,but unlike SMV,it can update itself in real time according to spam behavior.In the original NBmodel,training data is passively accepted.Up?dating the training data is expensive for most classifiers,and this practice has been strongly discouraged by industry[8].Train on or near error(TONE)is a sample-selection method that can be used to discard useless examples[9].Only parts of examples are trained using TONE.When TONE is applied to the online NB model and tested with several large spam data sets,the online model performs better than the original NB model.In particular,the number of examples needed to train an effective classifier decreases.
In section 2,we review the framework of an online learning model for spam filtering.In section 3,we describe online NB models based on TONE.In section 4,we show that the im?proved algorithm is much more efficient than the original NB algorithm.Section 5 concludesthe paper.
Many models used in traditional machine-learning applica?tions operate in pool-based(offline)mode[10].The model is trained on a large data set and the examples are reclassified without retraining.The process of an offline learning model tends to be optimal for all the training data,but the online mode has an online learning process that can adapt to a chang?ing environment.Online learning algorithms update the learn?er with new received examples;that is,they can use an old hy?pothesis(if one exists)as the starting point for retraining and adapting to changes in data.
Spam filtering is typically done online(Fig.1).Emails are viewed as a stream,not as a pool,when entering the system one by one.The filter makes a spam or ham prediction for each email.Next,the user reads the message and perhaps gives feedback to the learning-based filter.The filter uses a label to update the feature library and retrain the learner.Ideally,this improves future predictive performance.Large-scale and on?line classification problems can be solved with a classifier that allows online training and classification[11].In a changing en?vironment,an online NB learner is typically used for spam fil?tering,which proceeds incrementally[12].An online NBlearn?er only has linear training time and can be easily deployed in an online settingwith incremental updates.

▲Figure1.Theonlinespam filtering scenario.
Naive Bayes is popular in industry probably because of its simplicity and the ease with which it can be implemented.Its linear computational complexity and high accuracy are compa?rabletothat of moreelaboratelearningalgorithms.
Here,we give notations for the NBmodel.In an example da?taset{(X(1),y(1))...(X(m),y(m))...},X(m)denotesavector contain?ing features of the m th example.The corresponding label is y(m).The spam likelihood P(y=spam|X)is calculated using the Bayesian formula:

Similarly,the hamlikelihood is calculated using

To model P(y|X),xiis conditionally independent for a giv?en y.This assumption is called the NBassumption.The result?ing algorithmiscalled the NBclassifier and isgive by

In spam filtering,there is no need to estimate P(X).The quotient of(1)and(2)isgiven by
We can use(4)to classify the email as spam or something else.In(4),P(spam)is the a priori probability of spam,and P(xi|y=spam)is expressed as a frequency in the spam cate?gory.Theasprioriprobability of spamisgiven by

and P(xi|y=spam)isgiven by

where Nspamis the number of spams,and Nhamis the number of hams.Thesituation for hamissimilar tothat for spam.
The proposed NB algorithm works fairly well,but there is a simple tweak that makes it work much better,especially for text classification.If a feature only occurs in ham,then P(xi|y=spam)may be zero.To avoid this,we can use Laplace smoothing,given by

To avoid underflow in the practical calculation,we use a log?arithm.Therefore,(4)istransformed into

We can now classify the email by Pprime.If Pprimeis greater than 0,themail ispredicted tobespam;otherwise,it isham.
To apply TONE algorithm,we use the logistic function to convert Pprimeinto a score of 0~1.Equation(9)maps Pprimeto a score of between 0 and approximately 1.The scale parameter ensures that Pprimeis not too big:

To meet the spam filter's requirements,the online filter should update itself at the appropriate time.Spam filtering needs to be highly scalable because it involves large amounts of high-dimensional data.Content-based spam detection often requires training the learner.In original NB,there is no need to update the learner;however,in improved online NB,TONE can be applied to the training process(called thick threshold training)[13].TONE is developed from train on error(TOE).There are two scenarios in which the learner training mode can be activated using this approach:1)when samples have been misclassified by the filter and 2)when correctly classified sam?plesfall within apredefined boundary.Weimprovethepredict?ing ability of NB by introducing an online NB method based on TONE.The improved algorithm,called called NB-TONE,is cheap and doesnot result in performance loss.

With TONE,examples that have the least classification con?fidence are chosen.The parameter c is a thick threshold for training.Regardless of how the email is classified,if the score does not exceed the thick threshold,the email is not well clas?sified,and the learner has to be trained and updated.On the other hand,TONE can also make a classifier more robust so that overfitting is averted.If the example is far from the hyper?plane,the classifier predicts the example with higher confi?dence,and such examples do not need training.TONE is a sample-selection method that reduces the number of training examplesand cutsdown trainingtime.

Algorithm1.NB-TONE 1:for each mail{<X(i),y(i)>,...}i=1,2,...2:A new message arrives 3:Eq.8//calculate the P prime 5:Eq.9//P prime mappingtoscore 7:if(score>0.5)then 8: X(i)isspam 9:else 10: X(i)is ham 11:endif 12:if(|score-0.5|<c or X(i)ismisclassified)then 13: train model by<X(i),y(i)>14:endif 15:end for
In section 3,online NB spam filtering based on TONE was proposed.Here,we test the algorithm on large benchmark sets of email data.
Two benchmark data sets were used,both of which were from TREC spam filtering competitions.These data sets were trec05p,which contained 92,189 messages in English[3],and trec06p,which contained 37,822 messages in English[14].For a fair comparison,each data set was ordered,and we compared our method with the original model.(1-ROCA)%was used as the standard performance measure.
Feature extraction is important for machine learning.An ap?propriate feature extraction method greatly improves the accu?racy of the learner.In[15],character-level n-grams are valid and robust for a variety of spam detection methods[16].Here,an email is represented as a vector that has a unique dimen?sion for each possible substring of n characters.With the 4-gram feature extraction method,only the first 3000 features of each email were extracted,and the same features were re?pure NB.Moreover,NB-TONE can cut down the number of training examplesand reduce computational cost.moved from each email.
In our experiments on NB-TONE,we found that the sam?pling threshold c ranged from0.01 to 0.50.Weused theparam?eterε=10-5,and thescaleparameter was2500.
We examined the difference between pure NB and NB-TONE.For NB-TONE,c=0.15 and c=0.25.Note that if c=0.5,the algorithm degenerates into pure NB.The train%represents the overall percentage of training data.The results in Table 1 show that NB-TONE comprehensively outperforms

▼Table1.NB-TONEbeatspure NBon trec05p and trec06p
Fig.2 shows the effect of c on(1-ROCA)%performance.The results indicate a that(1-ROCA)%performance varies with respect to c.From c=0 to 0.15,the number of examples increases and performance improves.However,as c approach?es0.5,performanceworsens.

▲Figure2.NB-TONEon data set resultsreported as(1-ROCA)%by threshold c.
We improved traditional online naive Bayes by using TONE.In the online process,the classifier updates itself at the appro?priatetime.Thismethod improvesclassification and low-confi?dence method and reduces the number of labels needed for high performance.Furthermore,the approach is well suited to this domain because spam filtering is inherently an online task.Our experiment showsthat our NB-TONEisreliable.