999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

基于GAN-AdaBoost-DT不平衡分類(lèi)算法的信用卡欺詐分類(lèi)

2019-08-01 01:57:38莫贊蓋彥蓉樊冠龍
計(jì)算機(jī)應(yīng)用 2019年2期

莫贊 蓋彥蓉 樊冠龍

摘 要:針對(duì)傳統(tǒng)單個(gè)分類(lèi)器在不平衡數(shù)據(jù)上分類(lèi)效果有限的問(wèn)題,基于對(duì)抗生成網(wǎng)絡(luò)(GAN)和集成學(xué)習(xí)方法,提出一種新的針對(duì)二類(lèi)不平衡數(shù)據(jù)集的分類(lèi)方法——對(duì)抗生成網(wǎng)絡(luò)自適應(yīng)增強(qiáng)決策樹(shù)(GAN-AdaBoost-DT)算法。首先,利用GAN訓(xùn)練得到生成模型,生成模型生成少數(shù)類(lèi)樣本,降低數(shù)據(jù)的不平衡性;其次,將生成的少數(shù)類(lèi)樣本代入自適應(yīng)增強(qiáng)(AdaBoost)模型框架,更改權(quán)重,改進(jìn)AdaBoost模型,提升以決策樹(shù)(DT)為基分類(lèi)器的AdaBoost模型的分類(lèi)性能。使用受測(cè)者工作特征曲線下面積(AUC)作為分類(lèi)評(píng)價(jià)指標(biāo),在信用卡詐騙數(shù)據(jù)集上的實(shí)驗(yàn)分析表明,該算法與合成少數(shù)類(lèi)樣本集成學(xué)習(xí)相比,準(zhǔn)確率提高了4.5%,受測(cè)者工作特征曲線下面積提高了6.5%;對(duì)比改進(jìn)的合成少數(shù)類(lèi)樣本集成學(xué)習(xí),準(zhǔn)確率提高了4.9%,AUC值提高了5.9%;對(duì)比隨機(jī)欠采樣集成學(xué)習(xí),準(zhǔn)確率提高了4.5%,受測(cè)者工作特征曲線下面積提高了5.4%。在UCI和KEEL的其他數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果表明,該算法在不平衡二分類(lèi)問(wèn)題上能提高總體的準(zhǔn)確率,優(yōu)化分類(lèi)器性能。

關(guān)鍵詞:對(duì)抗生成網(wǎng)絡(luò); 集成學(xué)習(xí); 不平衡分類(lèi);? 二分類(lèi);自適應(yīng)增強(qiáng);決策樹(shù);信用卡欺詐

中圖分類(lèi)號(hào): TP391

文獻(xiàn)標(biāo)志碼:A

Abstract: Concerning that traditional single classifiers have poor classification effect for imbalanced data classification, a new binary-class imbalanced data classification algorithm was proposed based on Generative Adversarial Nets (GAN) and ensemble learning, namely Generative Adversarial Nets-Adaptive Boosting-Decision Tree (GAN-AdaBoost-DT). Firstly, GAN training was adopted to get a generative model which produced minority class samples to reduce imbalance ratio. Then, the minority class samples were brought into Adaptive Boosting (AdaBoost) learning framework and their weights were changed to improve AdaBoost model and classification performance of AdaBoost with Decision Tree (DT) as base classifier. Area Under the Carve (AUC) was used to evaluate the performance of classifier when dealing with imbalanced classification problems. The experimental results on credit card fraud data set illustrate that compared with synthetic minority over-sampling ensemble learning method, the accuracy of the proposed algorithm was increased by 4.5%, the AUC of it was improved by 6.5%; compared with modified synthetic minority over-sampling ensemble learning method, the accuracy was increased by 4.9%, the AUC was improved by 5.9%; compared with random under-sampling ensemble learning method, the accuracy was increased by 4.5%, the AUC was improved by 5.4%. The experimental results on other data sets of UCI and KEEL illustrate that the proposed algorithm can improve the accuracy of imbalanced classification and the overall classifier performance.

Key words: Generative Adversarial Nets (GAN); ensemble learning; imbalanced classification; binary-class classification; Adaptive Boosting (AdaBoost); Decision Tree (DT); credit card fraud

0 引言

不平衡數(shù)據(jù)是指數(shù)據(jù)集中的某個(gè)或某些類(lèi)的樣本量遠(yuǎn)遠(yuǎn)高于其他類(lèi),而某些類(lèi)樣本量較少,通常把樣本量較多的類(lèi)稱(chēng)為多數(shù)類(lèi),樣本量較少的類(lèi)稱(chēng)為少數(shù)類(lèi)[1]。在不平衡數(shù)據(jù)集中,對(duì)少數(shù)類(lèi)的識(shí)別較為重要,例如故障診斷[2]中,機(jī)器故障屬于少數(shù)類(lèi),如果將故障診斷為正常,就會(huì)造成工程延誤,帶來(lái)不必要的損失。由于不平衡數(shù)據(jù)集的復(fù)雜特性,傳統(tǒng)的分類(lèi)算法預(yù)測(cè)少數(shù)類(lèi)的分類(lèi)規(guī)則比多數(shù)類(lèi)的分類(lèi)規(guī)則少,而且效果差[3],這就是不平衡分類(lèi)問(wèn)題。不平衡分類(lèi)問(wèn)題已經(jīng)成為數(shù)據(jù)挖掘領(lǐng)域的挑戰(zhàn)之一[4],現(xiàn)在這種問(wèn)題普遍存在于銀行信用評(píng)級(jí)[5]、異常檢測(cè)[6]、人臉識(shí)別[7]、醫(yī)學(xué)診斷[8]、電子郵件分類(lèi)[9]等領(lǐng)域。

本文所研究的信用卡欺詐偵測(cè)問(wèn)題也是不平衡分類(lèi)問(wèn)題。信用卡欺詐偵測(cè)就是銀行根據(jù)與客戶信用狀況相關(guān)的特征變量預(yù)測(cè)客戶的支付記錄是否是欺詐交易,欺詐交易雖然是少數(shù)類(lèi),但一個(gè)欺詐交易的分類(lèi)錯(cuò)誤所造成的資金損失,是千百個(gè)正常交易分類(lèi)正確也挽回不了的。為了避免信用風(fēng)險(xiǎn)造成的損失,對(duì)欺詐交易記錄的識(shí)別尤為重要。

目前處理不平衡問(wèn)題的方法可以概括為兩類(lèi)。一種比較普遍的方法是在數(shù)據(jù)層面通過(guò)采用欠采樣或過(guò)采樣的方法,重新分配類(lèi)別分布,例如文獻(xiàn)[10]提出的合成小類(lèi)過(guò)采樣技術(shù)(Synthetic Minority Over-sampling Technique,SMOTE),文獻(xiàn)[11]提出的自適應(yīng)樣本合成方法(Adaptive Synthetic Sampling Approach,ADASYN)。欠采樣方法可以提升模型對(duì)小類(lèi)樣本的分類(lèi)性能,但是這種方法會(huì)造成大類(lèi)樣本數(shù)據(jù)的信息丟失而使模型無(wú)法充分利用已有的信息。傳統(tǒng)的過(guò)采樣方法可以生成少數(shù)類(lèi)樣本的數(shù)據(jù),但是根據(jù)少數(shù)類(lèi)數(shù)據(jù)生成,只是基于當(dāng)前少數(shù)類(lèi)蘊(yùn)含的信息,缺乏數(shù)據(jù)多樣性,一定程度上會(huì)造成過(guò)擬合。

另一種是在算法層面上,包括集成學(xué)習(xí)和代價(jià)敏感學(xué)習(xí)。集成學(xué)習(xí)通過(guò)集成多個(gè)分類(lèi)器來(lái)避免單個(gè)分類(lèi)器對(duì)不平衡數(shù)據(jù)分類(lèi)預(yù)測(cè)造成的偏差[12],如文獻(xiàn)[13]提出的在自適應(yīng)增強(qiáng)模型(Adaptive Boosting,AdaBoost)的每次迭代中引入SMOTE的SMOTEBoost算法,文獻(xiàn)[14]提出的在AdaBoost的每次迭代中引入隨機(jī)欠采樣(Random Under-Sampling method,RUS)的RUSBoosts算法。代價(jià)敏感學(xué)習(xí)是在算法迭代過(guò)程中設(shè)置少數(shù)類(lèi)被錯(cuò)分時(shí)具有較高的代價(jià)損失[15],通常與集成學(xué)習(xí)算法組合使用。代價(jià)敏感方法只是在算法層次進(jìn)行了修改,沒(méi)有增加算法的開(kāi)銷(xiāo),效率較高,能有效提高不平衡數(shù)據(jù)的分類(lèi)效果;但是由于主觀引入代價(jià)敏感損失,損失函數(shù)的設(shè)計(jì)會(huì)影響算法的迭代效果,適用性普遍較弱[16]。

因此,本文擬從數(shù)據(jù)層面生成少數(shù)類(lèi)樣本來(lái)使數(shù)據(jù)達(dá)到平衡,以此提高傳統(tǒng)分類(lèi)算法的分類(lèi)效果。生成式對(duì)抗網(wǎng)絡(luò)(Generative Adversarial Nets,GAN)[17]是2014年提出的生成模型,與傳統(tǒng)的生成模型對(duì)比,不需要基于真實(shí)數(shù)據(jù)就可以生成逼近真實(shí)數(shù)據(jù)的合成數(shù)據(jù),可以擴(kuò)展數(shù)據(jù)多樣性,避免過(guò)擬合。

由于單一方法難以滿足不同不平衡數(shù)據(jù)集的要求,適用性普遍不強(qiáng),同時(shí)組合預(yù)測(cè)模型能發(fā)揮各個(gè)單一預(yù)測(cè)模型的優(yōu)勢(shì),進(jìn)而提高模型整體的預(yù)測(cè)效果,因此,本文提出一種針對(duì)不平衡二分類(lèi)問(wèn)題的對(duì)抗生成網(wǎng)絡(luò)自適應(yīng)增強(qiáng)決策樹(shù)(Generative Adversarial Nets-Adaptive Boosting-Decision Tree,GAN-AdaBoost-DT)算法。該算法首先使用GAN生成少數(shù)類(lèi)樣本,使數(shù)據(jù)達(dá)到平衡,之后使用AdaBoost集成學(xué)習(xí)框架,使用以決策樹(shù)(Decision Tree,DT)作為基分類(lèi)器的AdaBoost算法,利用集成的思想提高DT在不平衡數(shù)據(jù)集中的分類(lèi)能力。采用受測(cè)者工作特征曲線下面積(Area Under the Carve,AUC)作為評(píng)價(jià)標(biāo)準(zhǔn)評(píng)價(jià)分類(lèi)器的效果。

1 相關(guān)工作

1.1 GAN算法

GAN是2014年基于零和博弈理論提出的一種生成式模型,模型包括基于神經(jīng)網(wǎng)絡(luò)的生成模型(G)和判別模型(D),生成模型基于噪聲空間z生成數(shù)據(jù),判別模型判斷數(shù)據(jù)是真實(shí)的還是生成模型生成的。這個(gè)過(guò)程相當(dāng)于一個(gè)二人博弈,G的訓(xùn)練目標(biāo)是使生成的數(shù)據(jù)接近于真實(shí)數(shù)據(jù)的分布,判別器訓(xùn)練目標(biāo)是區(qū)分出真實(shí)數(shù)據(jù)生成數(shù)據(jù),兩者相互迭代優(yōu)化,使D和G的性能得到不斷增強(qiáng),最終使兩個(gè)網(wǎng)絡(luò)達(dá)到一個(gè)動(dòng)態(tài)均衡,判別模型判斷生成模型生成的數(shù)據(jù)為真的概率接近0.5,此時(shí)生成器生成的數(shù)據(jù)近似真實(shí)數(shù)據(jù)。計(jì)算流程如圖1所示。

4 結(jié)語(yǔ)

針對(duì)傳統(tǒng)分類(lèi)算法在不平衡分類(lèi)問(wèn)題性能較差的問(wèn)題,本文提出了一種用于解決不平衡二分類(lèi)問(wèn)題的算法——GAN-AdaBoost-DT算法。該算法基于對(duì)抗生成網(wǎng)絡(luò)改進(jìn)了AdaBoost算法,在AdaBoost每次迭代中使用GAN生成少數(shù)類(lèi)數(shù)據(jù),降低數(shù)據(jù)的不平衡率,從而提高AdaBoost-DT的分類(lèi)性能。在信用卡詐騙數(shù)據(jù)集的實(shí)驗(yàn)結(jié)果表明,該方法對(duì)不平衡數(shù)據(jù)集的識(shí)別率有所提高,綜合提升了分類(lèi)器的性能。在UCI、KEEL的5個(gè)數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果表明,該方法相比其他算法識(shí)別率更高,分類(lèi)性能更優(yōu)。

參考文獻(xiàn):

[1] SEARLE S R. Linear Models for Unbalanced Data [M]. New York: John Wiley & Sons, 1987: 145-153.

[2] YANG Z, TANG W H, SHINTEMIROV A, et al. Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers [J]. IEEE Transactions on Systems, Man & Cybernetics, Part C: Applications and Reviews, 2009, 39(6): 597-610.

[3] SUN Y, KAMEL M S, WONG A K C, et al. Cost-sensitive boosting for classification of imbalanced data [J]. Pattern Recognition,2007,40(12): 3358-3378.

[4] YANG Q, WU X. 10 challenging problems in data mining research [J]. International Journal of Information Technology & Decision Making, 2011, 5(4): 597-604.

[5] BROWN I, MUES C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets [J]. Expert Systems with Applications, 2012, 39(3): 3446-3453.

[6] TAVALLAEE M, STAKHANVA N, GHORBANI A A. Toward credible evaluation of anomaly-based intrusion-detection methods[J]. IEEE Transactions on Systems, Man & Cybernetics, Part C: Applications and Reviews, 2010, 40(5): 516-524.

[7] LIU Y-H, CHEN Y-T. Total margin based adaptive fuzzy support vector machines for multiview face recognition [C]// Proceedings of the 2005 IEEE International Conference on Systems, Man and Cybernetics. Washington, DC: IEEE Computer Society, 2005, 2: 1704-1711.

[8] MAZUROWSKI M A, HABAS P A, ZURADE J M, et al. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance [J]. Neural Networks, 2008, 21(2/3): 427-436.

[9] BERMEJO P, GAMEZ J A, PUERTA J M. Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets [J]. Expert Systems with Applications, 2011, 38(3): 2072-2080.

[10] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique [J]. Journal of Artificial Intelligence Research,2002, 16(1): 321-357.

[11] HE H, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning [C]// Proceeding of the 2008 International Joint Conference on Neural Networks. Piscataway, NJ: IEEE, 2008: 1322-1328.

[12] FREUND Y, SCHAPIRE R E. Experiments with a new boosting algorithm [C]// Proceedings of the Thirteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 1996: 148-156.

[13] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting [C]// Proceedings of the 2003 European Conference on Knowledge Discovery in Databases, LNCS 2838. Berlin: Springer, 2003: 107-119.

[14] SEIFFERT C, KHOSHGOFTAAR T M, van HULSE J, et al. RUSBoost: a hybrid approach to alleviating class imbalance [J]. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 2010, 40(1): 185-197.

[15] FAN W, STOLFO S J, ZHANG J, et al. AdaCost: misclassification cost-sensitive boosting [C]// Proceedings of the 16th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 1999: 97-105.

[16] CATENI S, COLLA V, VANNUCCI M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems [J]. Neurocomputing, 2014, 135: 32-41.

[17] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [C]// NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014, 2: 2672-2680.

[18] GOODFELLOW I. NIPS 2016 tutorial: generative adversarial networks [EB/OL]. (2016-12-31) [2017-09-24]. https://arxiv.org/pdf/1701.00160.pdf.

[19] LI J, MONROE W, SHI T, et al. Adversarial learning for neural dialogue generation [EB/OL].[2017-07-13]? [2018-05-02]. https://arxiv.org/pdf/1701.06547v1.pdf.

[20] YU L, ZHANG W, WANG J, et al. SeqGAN: sequence generative adversarial nets with policy gradient [EB/OL].[2017-08-25] [2018-05-02]. https://arxiv.org/pdf/1609.05473.pdf.

[21] HU WW, TAN Y. Generating adversarial malware examples for black-box attacks based on GAN [EB/OL]. [2017-02-20][2018-05-02]. https://arxiv.org/pdf/1702.05983v1.pdf.

[22] CHIDAMBARAM M, QI Y. Style transfer generative adversarial networks: learning to play chess differently[EB/OL]. [2017-05-07] [2018-07-02]. https://arxiv.org/pdf/1702.06762v1.pdf.

[23] FREUND Y, SCHAPIRE R E. A desicion-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer & System Sciences, 1997, 55(1):119-139.

[24] HUNT E, KRIVANEK J. The effects of pentylenetetrazole and methylphenoxypropane on discrimination learning [J]. Psychopharmacology, 1966, 9(1): 1-16.

[25] BOSE I, FARQUAD M A H. Preprocessing unbalanced data using support vector machine [J]. Decision Support Systems, 2012, 53(1): 226-233.

[26] 張順,張化祥.用于多標(biāo)記學(xué)習(xí)的K近鄰改進(jìn)算法[J].計(jì)算機(jī)應(yīng)用研究,2011,28(12):4445-4450. (ZHANG S, ZHANG H X. Modified KNN algorithm for multi-label learning [J]. Application Research of Computers, 2011, 28(12): 4445-4450.)

[27] 李詒靖,郭海湘,李亞楠,等.一種基于Boosting的集成學(xué)習(xí)算法在不均衡數(shù)據(jù)中的分類(lèi) [J].系統(tǒng)工程理論與實(shí)踐,2016,36(1):189-199. (LI Y J, GUO H X, LI Y N, et al. A boosting based on ensemble learning algorithm in imbalanced data classification [J]. Systems Engineering — Theory & Practice, 2016, 36(1): 189-199.)

主站蜘蛛池模板: 欧美啪啪一区| 97国内精品久久久久不卡| 国产在线观看精品| 国产激情无码一区二区APP| 丝袜久久剧情精品国产| 亚洲码一区二区三区| 欧美一级黄片一区2区| 亚洲人成色77777在线观看| 亚洲精品不卡午夜精品| 国产真实乱人视频| 久久婷婷六月| 亚洲成a人在线观看| 伊人无码视屏| 手机看片1024久久精品你懂的| 国产精品国产三级国产专业不| 亚洲欧美日韩中文字幕一区二区三区 | h视频在线观看网站| 欧美日韩精品一区二区视频| 久久精品亚洲中文字幕乱码| 欧美色99| 国产哺乳奶水91在线播放| 青青青国产视频手机| 香蕉久人久人青草青草| 日韩在线网址| 中国丰满人妻无码束缚啪啪| 天天色天天综合| 色综合热无码热国产| 久久不卡精品| 蜜桃臀无码内射一区二区三区| 国产啪在线| 中文字幕不卡免费高清视频| 免费aa毛片| 国产精品无码一区二区桃花视频| 91年精品国产福利线观看久久 | 国产精品国产主播在线观看| 欧美国产菊爆免费观看| 丁香亚洲综合五月天婷婷| 波多野结衣一区二区三视频| 国产精品无码翘臀在线看纯欲| 国产又粗又爽视频| 久久久久亚洲AV成人网站软件| 亚洲欧美一区二区三区图片 | 中国一级特黄视频| 精品久久香蕉国产线看观看gif| 国产a网站| 国产在线日本| 天天综合色天天综合网| 三级欧美在线| 久久精品国产91久久综合麻豆自制| a亚洲视频| 精品一区二区三区水蜜桃| 中文字幕色站| 成·人免费午夜无码视频在线观看 | 久久毛片网| 国产欧美日韩另类精彩视频| 色综合久久88色综合天天提莫 | 中文字幕日韩欧美| 97青青青国产在线播放| 一级在线毛片| 亚洲天堂久久久| 亚洲美女高潮久久久久久久| 国产精品美女网站| 亚洲最大福利视频网| 欧美亚洲中文精品三区| 亚洲中文字幕无码爆乳| 欧美亚洲中文精品三区| 亚洲中文久久精品无玛| 亚洲毛片在线看| 欧美成人日韩| 国产a v无码专区亚洲av| 青青青草国产| AV在线天堂进入| 青青青草国产| 日韩午夜福利在线观看| 国产国模一区二区三区四区| 欧美翘臀一区二区三区| 丝袜亚洲综合| 最新国产精品鲁鲁免费视频| 成人在线不卡| 亚洲国产中文综合专区在| 日日噜噜夜夜狠狠视频| 国产天天射|