張宗堂 陳喆 戴衛國
摘 要:針對傳統集成算法不適用于不平衡數據分類的問題,提出基于間隔理論的AdaBoost算法(MOSBoost)。首先通過預訓練得到原始樣本的間隔; 然后依據間隔排序對少類樣本進行啟發式復制,從而形成新的平衡樣本集; 最后將平衡樣本集輸入AdaBoost算法進行訓練以得到最終集成分類器。在UCI數據集上進行測試實驗,利用Fmeasure和Gmean兩個準則對MOSBoost、AdaBoost、隨機過采樣AdaBoost(ROSBoost)和隨機降采樣AdaBoost(RDSBoost)四種算法進行評價。實驗結果表明,MOSBoost算法分類性能優于其他三種算法,其中,相對于AdaBoost算法,MOSBoost算法在Fmeasure和Gmean準則下分別提升了8.4%和6.2%。
關鍵詞:不平衡數據;間隔理論;過采樣方法;集成分類器;機器學習
中圖分類號:TP181
文獻標志碼:A
Abstract: In order to solve the problem that traditional ensemble algorithms are not suitable for imbalanced data classification, Over Sampling AdaBoost based on Margin theory (MOSBoost) was proposed. Firstly, the margins of original samples were obtained by pretraining. Then, the minority class samples were heuristic duplicated by margin sorting thus forming a new balanced sample set. Finally, the finall ensemble classifier was obtained by the trained AdaBoost with the balanced sample set as the input. In the experiment on UCI dataset, Fmeasure and Gmean were used to evaluate MOSBoost, AdaBoost, Random OverSampling AdaBoost (ROSBoost) and Random UnderSampling AdaBoost (RDSBoost). The experimental results show that MOSBoost is superior to other three algorithm. Compared with AdaBoost, MOSBoost improves 8.4% and 6.2% respctively under Fmeasure and Gmean criteria.
英文關鍵詞Key words: imbalanced data; margin theory; over sampling method; ensemble classifier; machine learning
0 引言
近些年,不平衡數據分類問題成為了機器學習的熱點問題,它廣泛存在于現實生產生活中,例如郵件過濾[1]、圖像分類[2]、軟件缺陷預測[3]、醫療診斷[4]、基因數據分析[5]等。對于二分類問題,不平衡數據中多類的樣本數量遠大于少類。傳統的分類方法以總體分類精度為目標,忽視了類別不平衡性,從而導致少類樣本分類準確率降低,然而少類樣本往往具有較高的價值,這使得錯分代價較大。
針對不平衡數據的處理方法大致分為算法層面和數據層面: 算法層面指構造新的算法或對原有算法進行改造以偏向少類; 數據層面主要是利用重采樣方法獲得平衡樣本集,再結合現有分類器進行分類。重采樣方法,包括欠采樣法和過采樣法,形式上比較簡練,且不影響分類器設計,因此得到了廣泛的研究。根據采取的策略,它又可分為隨機采樣和啟發式采樣: 隨機采樣不依據數據信息,只是簡單地隨機刪除或添加樣本; 啟發式采樣則是在利用數據內部特性的基礎上進行采樣。典型的啟發式欠采樣方法如Tomek links[6]、One sided selection[7]、Neighborhood Cleaning Rule[8]等克服了隨機欠采樣中容易缺失有用信息的缺點,一定程度上提高了算法性能。而啟發式過采樣中比較有代表性的是SMOTE(Synthetic Minority Oversampling TEchnique)[9]方法及其改進算法[10-12]。SMOTE方法的基本假設是相同類別的鄰近數據點所生成的凸集也屬于同一類別。啟發式重采樣方法基本都是在某種準則下對樣本進行篩選,對數據集的依賴性較強,然而不平衡數據集往往存在類內不平衡、小析取項、高噪聲等特點,使得其難以滿足準則要求,進而降低了算法性能。表面上看,這是數據集與準則之間的適配性問題,實際上是這些方法缺乏理論基礎,泛化性較低。
AdaBoost算法是一種經典的集成分類算法,在機器學習中有廣泛的應用[13-15]。AdaBoost以最小化總體分類誤差為目標,忽視了類別間的不平衡性,因而不適用于不平衡數據分類。間隔理論是AdaBoost算法的重要理論基礎,成功解釋了AdaBoost算法不易過擬合等現象。本文從間隔理論出發,定義了少類間隔和多類間隔,對少類間隔樣本依據符號正負進行篩選,對正的少類間隔樣本進行啟發式復制,形成新的平衡樣本集,在此樣本集上進行AdaBoost訓練,形成了MOSBoost算法,從而提高了不平衡數據分類性能。
1 相關工作
1.1 AdaBoost算法
AdaBoost算法將訓練樣本集{(x1,y1),(x2,y2),…,(xN,yN)}作為輸入,其中xi是樣本,yi為其類標,對于二分類問題,yi∈{-1,1}。然后根據已知的基分類算法在t=1,2,…,T輪中不斷地運算。Dt(i)表示第t輪中第i個訓練樣本的權重。基分類算法的任務是在權重分布Dt的基礎上得到基分類器ht來最小化分類誤差。當ht訓練完成,AdaBoost選擇一個參數αt∈R來衡量ht的分類性能。然后更新權重分布Dt。最終的集成分類器F是T個基分類器的加權輸出。具體算法如算法1所示。
參考文獻 (References)
[1] DAI H L. Class imbalance learning via a fuuzy total margin based support vector machine[J]. Applied Soft Computing, 2015, 31(C): 172-184.
[2] 譚潔帆,朱焱,陳同孝,等.基于卷積神經網絡和代價敏感的不平衡圖像分類方法[J].計算機應用,2018,38(7):1862-1865,1871.(TAN J F, ZHU Y, CHEN T X, et al. Imbalanced image classification approach based on convolution network and costsensitivity[J]. Journal of Computer Applications,2018,38(7):1862-1865,1871.)
[3] WANG S, YAO X. Using class imbalance learning for software defect prediction[J]. IEEE Transactions on Reliability, 2013, 62(2): 434-443.
[4] OZCIFT A, GULTEN A. Classifer ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms[J]. Computer Methods and Programs in Biomedicine, 2011, 104(3):443-451.
[5] YU H, NI J, ZHAO J. ACOSampling: an ant colony optimizationbased undersampling method for classifying imbalanced DNA microarray data[J]. Neurocomputing, 2013,101:309-318.
[6] TOMEK I. Two modifications of CNN[J]. IEEE Transactions on Systems, Man and Cybernetics, 1976, SMC6(11): 769-772.
[7] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: onesided selection[C]// Proceedings of the 14th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997: 179-186.
[8] LAURIKKALA J. Improving identification of difficult small classes by balancing class distribution[C]// Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. Berlin: Springer, 2001: 63-66.
[9] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: synthetic minority oversampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
[10] RIVERA W A. Noise reduction a priori synthetic oversampling for class imbalanced data sets[J]. Information Sciences, 2017, 408(C): 146-161.
[11] MA L, FAN S. CURESMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests [J]. BMC Bioinformatics, 2017,18(1): 169.
[12] BOROWSKA, K, STEPANIUK J. Imbalanced data classification: a novel resampling approach combining versatile improved SMOTE and rough sets[C]// CISIM 2016: IFIP International Conference on Computer Information Systems and Industrial Management. Berlin: Springer, 2016: 31-42.
[13] BAIG M M, AWAIS M M, ELALFY E S M. AdaBoostbased artificial neural network learning[J]. Neurocomputing, 2017, 248(C): 120-126.
[14] MINZ A, MAHOBIYA C. MR image classification using Adaboost for brain tumor type[C]// Proceedings of the 2017 IEEE 7th International Advance Computing Conference. Washington, DC: IEEE Computer Society, 2017:701-705.
[15] 王軍,費凱,程勇.基于改進的AdaboostBP模型在降水中的預測[J]. 計算機應用, 2017, 37(9):2689-2693.(WANG J,FEI K,CHENG Y. Prediction of rainfall based on improved AdaboostBP model[J]. Journal of Computer Applications, 2017, 37(9):2689-2693.)
[16] SCHAPIRE R E, FREUND Y, BARTLETT P, et al. Boosting the margin: a new explanation for the effectiveness of voting methods[J]. Annals of Statistics, 1998, 26(5): 1651-1686.
[17] GAO W, ZHOU Z H. On the doubt about margin explanation of boosting[J]. Artificial Intelligence, 2013,203:1-18.
[18] BACHE K, LICHMAN M. UCI repository of machine learning databases[DB/OL].[2018-06-20].http://www.ics.uci.edu/~mlearn/MLRepository.html.
[19] van HULSE J, KHOSHGOFTAAR T M, NAPOLITANO A. Expertimental perspectives on learning from imbalanced data[C]// Proceedings of the 24th International Conference on Machine Learing. New York: ACM, 2007: 935-942.
[20] LIU N, WEI L W, AUNG Z. Handling class imbalance in customer behavior prediction[C]// Proceedings of the 2014 International Conference on Collaboration Technologies and Systems. Piscataway, NJ: IEEE, 2014: 100-103.