999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Spam Filtering:Online Naive Bayes Based on TONE

2013-06-06 04:19:34GuangluSunHongyueSunYingcaiMaandYuewuShen
ZTE Communications 2013年2期

Guanglu Sun,Hongyue Sun,Yingcai Ma,and Yuewu Shen

(1.Research Instituteof Information Technology,Tsinghua University,Beijing 100084,China;

2.ZTECorporation,Shenzhen 518057,China;

3.School of Computer Scienceand Technology,Harbin University of Science and Technology,Harbin 150080,China)

Abstract The naive Bayes(NB)model has been successfully used to tack?le spam,and is very accurate.However,there is still room for improvement.We use a train on or near error(TONE)method in online NB to enhance the performance of NB and reduce the number of training emails.We conducted an experiment to de?termine the performance of the improved algorithm by plotting(1-ROCA)%curves.The results show that the proposed method improves the performance of original NB.

Keyw ords spam filtering;online naive Bayes;train-on or near error

1 Introduction

E mail is an efficient communication technology and one of the most widely used internet applications.However,spam is a drain on network resources and is often detrimental to user experience.Some peo?ple use spam for malicious purposes,so spam filtering is a hot?spot in current research.

The body of the email contains essential information and is arguably the most important part of the email.Content-based filtering is a reliable method for combating spam.Machine learning provides more accurate prediction and is an attractive solution for content-based filtering.However,there is no con?sensusabout which learningalgorithmsarebest[1].

Machine learning techniques are usually based on genera?tive models,such as naive Bayes(NB),or discriminative mod?els,such as support vector machines(SVMs).For most large-scale tasks,discriminative models perform better than generative models[2].This is especially true when there is suf?ficient training data.In TREC spam track,the Bogofilter is a fast Bayesian spam filter that is used as the baseline[3].Many researchers have achieved state-of-the-art spam filtering us?ing SVMs;however,SVMs typically require training time that is quadratic in the number of training examples[4].SVMs are not suitable for online filtering because they are not updated in real time.With the Bayesian method,filtering is inaccurate,but only linear training time is required,and robustness is less likely to be affected by bad data[5],[6].The Bayesian filtering system is easy to deploy because it is simple and lightweight[7].

In this paper,we propose an improved online NB classifier.Online NB is often used in spam filtering,but unlike SMV,it can update itself in real time according to spam behavior.In the original NBmodel,training data is passively accepted.Up?dating the training data is expensive for most classifiers,and this practice has been strongly discouraged by industry[8].Train on or near error(TONE)is a sample-selection method that can be used to discard useless examples[9].Only parts of examples are trained using TONE.When TONE is applied to the online NB model and tested with several large spam data sets,the online model performs better than the original NB model.In particular,the number of examples needed to train an effective classifier decreases.

In section 2,we review the framework of an online learning model for spam filtering.In section 3,we describe online NB models based on TONE.In section 4,we show that the im?proved algorithm is much more efficient than the original NB algorithm.Section 5 concludesthe paper.

2 Online Learning Model for Spam Filtering

Many models used in traditional machine-learning applica?tions operate in pool-based(offline)mode[10].The model is trained on a large data set and the examples are reclassified without retraining.The process of an offline learning model tends to be optimal for all the training data,but the online mode has an online learning process that can adapt to a chang?ing environment.Online learning algorithms update the learn?er with new received examples;that is,they can use an old hy?pothesis(if one exists)as the starting point for retraining and adapting to changes in data.

Spam filtering is typically done online(Fig.1).Emails are viewed as a stream,not as a pool,when entering the system one by one.The filter makes a spam or ham prediction for each email.Next,the user reads the message and perhaps gives feedback to the learning-based filter.The filter uses a label to update the feature library and retrain the learner.Ideally,this improves future predictive performance.Large-scale and on?line classification problems can be solved with a classifier that allows online training and classification[11].In a changing en?vironment,an online NB learner is typically used for spam fil?tering,which proceeds incrementally[12].An online NBlearn?er only has linear training time and can be easily deployed in an online settingwith incremental updates.

▲Figure1.Theonlinespam filtering scenario.

3 Online NB Model Based on TONE

3.1 Bayesian Spam Filtering

Naive Bayes is popular in industry probably because of its simplicity and the ease with which it can be implemented.Its linear computational complexity and high accuracy are compa?rabletothat of moreelaboratelearningalgorithms.

Here,we give notations for the NBmodel.In an example da?taset{(X(1),y(1))...(X(m),y(m))...},X(m)denotesavector contain?ing features of the m th example.The corresponding label is y(m).The spam likelihood P(y=spam|X)is calculated using the Bayesian formula:

Similarly,the hamlikelihood is calculated using

To model P(y|X),xiis conditionally independent for a giv?en y.This assumption is called the NBassumption.The result?ing algorithmiscalled the NBclassifier and isgive by

In spam filtering,there is no need to estimate P(X).The quotient of(1)and(2)isgiven by

We can use(4)to classify the email as spam or something else.In(4),P(spam)is the a priori probability of spam,and P(xi|y=spam)is expressed as a frequency in the spam cate?gory.Theasprioriprobability of spamisgiven by

and P(xi|y=spam)isgiven by

where Nspamis the number of spams,and Nhamis the number of hams.Thesituation for hamissimilar tothat for spam.

3.2 The Model

The proposed NB algorithm works fairly well,but there is a simple tweak that makes it work much better,especially for text classification.If a feature only occurs in ham,then P(xi|y=spam)may be zero.To avoid this,we can use Laplace smoothing,given by

To avoid underflow in the practical calculation,we use a log?arithm.Therefore,(4)istransformed into

We can now classify the email by Pprime.If Pprimeis greater than 0,themail ispredicted tobespam;otherwise,it isham.

To apply TONE algorithm,we use the logistic function to convert Pprimeinto a score of 0~1.Equation(9)maps Pprimeto a score of between 0 and approximately 1.The scale parameter ensures that Pprimeis not too big:

To meet the spam filter's requirements,the online filter should update itself at the appropriate time.Spam filtering needs to be highly scalable because it involves large amounts of high-dimensional data.Content-based spam detection often requires training the learner.In original NB,there is no need to update the learner;however,in improved online NB,TONE can be applied to the training process(called thick threshold training)[13].TONE is developed from train on error(TOE).There are two scenarios in which the learner training mode can be activated using this approach:1)when samples have been misclassified by the filter and 2)when correctly classified sam?plesfall within apredefined boundary.Weimprovethepredict?ing ability of NB by introducing an online NB method based on TONE.The improved algorithm,called called NB-TONE,is cheap and doesnot result in performance loss.

With TONE,examples that have the least classification con?fidence are chosen.The parameter c is a thick threshold for training.Regardless of how the email is classified,if the score does not exceed the thick threshold,the email is not well clas?sified,and the learner has to be trained and updated.On the other hand,TONE can also make a classifier more robust so that overfitting is averted.If the example is far from the hyper?plane,the classifier predicts the example with higher confi?dence,and such examples do not need training.TONE is a sample-selection method that reduces the number of training examplesand cutsdown trainingtime.

Algorithm1.NB-TONE 1:for each mail{<X(i),y(i)>,...}i=1,2,...2:A new message arrives 3:Eq.8//calculate the P prime 5:Eq.9//P prime mappingtoscore 7:if(score>0.5)then 8: X(i)isspam 9:else 10: X(i)is ham 11:endif 12:if(|score-0.5|<c or X(i)ismisclassified)then 13: train model by<X(i),y(i)>14:endif 15:end for

4 Evaluationsand Results

In section 3,online NB spam filtering based on TONE was proposed.Here,we test the algorithm on large benchmark sets of email data.

4.1 Data Sets

Two benchmark data sets were used,both of which were from TREC spam filtering competitions.These data sets were trec05p,which contained 92,189 messages in English[3],and trec06p,which contained 37,822 messages in English[14].For a fair comparison,each data set was ordered,and we compared our method with the original model.(1-ROCA)%was used as the standard performance measure.

4.2 Feature Space

Feature extraction is important for machine learning.An ap?propriate feature extraction method greatly improves the accu?racy of the learner.In[15],character-level n-grams are valid and robust for a variety of spam detection methods[16].Here,an email is represented as a vector that has a unique dimen?sion for each possible substring of n characters.With the 4-gram feature extraction method,only the first 3000 features of each email were extracted,and the same features were re?pure NB.Moreover,NB-TONE can cut down the number of training examplesand reduce computational cost.moved from each email.

4.3 Classification Performance

In our experiments on NB-TONE,we found that the sam?pling threshold c ranged from0.01 to 0.50.Weused theparam?eterε=10-5,and thescaleparameter was2500.

We examined the difference between pure NB and NB-TONE.For NB-TONE,c=0.15 and c=0.25.Note that if c=0.5,the algorithm degenerates into pure NB.The train%represents the overall percentage of training data.The results in Table 1 show that NB-TONE comprehensively outperforms

▼Table1.NB-TONEbeatspure NBon trec05p and trec06p

4.4 Parameter Sensitivity

Fig.2 shows the effect of c on(1-ROCA)%performance.The results indicate a that(1-ROCA)%performance varies with respect to c.From c=0 to 0.15,the number of examples increases and performance improves.However,as c approach?es0.5,performanceworsens.

5 Conclusion

▲Figure2.NB-TONEon data set resultsreported as(1-ROCA)%by threshold c.

We improved traditional online naive Bayes by using TONE.In the online process,the classifier updates itself at the appro?priatetime.Thismethod improvesclassification and low-confi?dence method and reduces the number of labels needed for high performance.Furthermore,the approach is well suited to this domain because spam filtering is inherently an online task.Our experiment showsthat our NB-TONEisreliable.

主站蜘蛛池模板: 婷婷综合亚洲| 亚洲综合中文字幕国产精品欧美| 2022精品国偷自产免费观看| 亚洲最大综合网| 中文无码精品A∨在线观看不卡 | 熟女日韩精品2区| 国产网站一区二区三区| 国模在线视频一区二区三区| 久久精品这里只有国产中文精品| 四虎亚洲国产成人久久精品| 国产精品成人免费视频99| a在线亚洲男人的天堂试看| 国产国语一级毛片| 尤物视频一区| 免费av一区二区三区在线| 国产精品美女免费视频大全| 国产成人精品高清在线| 国产美女一级毛片| 国产精品尹人在线观看| 中文毛片无遮挡播放免费| 日日噜噜夜夜狠狠视频| 亚洲av色吊丝无码| 国产超碰一区二区三区| 18禁色诱爆乳网站| 国产欧美日韩综合在线第一| 国产成人毛片| 国产9191精品免费观看| 成年人国产网站| 亚洲欧洲综合| 成人毛片免费在线观看| 激情乱人伦| 国产va欧美va在线观看| 精品人妻AV区| 在线观看av永久| 色婷婷成人网| 这里只有精品在线| a天堂视频| 小说 亚洲 无码 精品| 国产日本一线在线观看免费| 亚洲中文字幕国产av| 中国一级毛片免费观看| 国产二级毛片| 欧美在线网| 国产视频一二三区| 国产情精品嫩草影院88av| 日韩成人免费网站| 无码视频国产精品一区二区| 午夜视频www| 中文天堂在线视频| 亚洲av无码牛牛影视在线二区| 不卡国产视频第一页| 福利姬国产精品一区在线| 国产性爱网站| 亚洲高清免费在线观看| 免费一级毛片完整版在线看| 亚洲综合一区国产精品| 人妻21p大胆| 国产成人a毛片在线| 国产乱子伦一区二区=| AV在线天堂进入| 欧美在线天堂| 直接黄91麻豆网站| 国产一在线观看| 欧美www在线观看| 日韩美一区二区| 91人人妻人人做人人爽男同| 亚洲精品777| 2021天堂在线亚洲精品专区| 国产精品入口麻豆| 亚洲国产精品VA在线看黑人| 国产成人精品一区二区三区| 韩日午夜在线资源一区二区| 亚洲色图欧美| 女人18一级毛片免费观看| 性欧美在线| 青青国产成人免费精品视频| 久久人搡人人玩人妻精品一| 成人午夜视频免费看欧美| 久久亚洲国产视频| 国产尤物在线播放| 久久精品中文字幕免费| 亚洲精品第一页不卡|