馬捍超 沈杰鑫 徐路強(qiáng) 于海



摘 要:傳統(tǒng)線性可分支持向量機(jī)分組算法面對(duì)海量樣本進(jìn)行分組時(shí),存在正確率較低的問題,對(duì)此提出一種面向海量樣本的線性可分支持向量機(jī)分組算法。面向海量樣本給定一個(gè)樣本集,對(duì)其樣本集進(jìn)行特征提取,利用分值方法進(jìn)行樣本特征選擇,通過線性可分支持向量機(jī)進(jìn)行樣本聚類,并建立線性可分支持向量機(jī)分組框架,利用該框架實(shí)現(xiàn)樣本分組。為了驗(yàn)證面向海量樣本的線性可分支持向量機(jī)分組算法的樣本分組正確率,將該算法與傳統(tǒng)線性可分支持向量機(jī)分組算法進(jìn)行對(duì)比實(shí)驗(yàn),結(jié)果證明該算法的樣本分組正確率更高,說明本研究所提算法更適用于海量樣本的分組。
關(guān)鍵詞:海量樣本;線性可分;支持向量機(jī);分組算法
中圖分類號(hào):TP13
文獻(xiàn)標(biāo)志碼:A
文章編號(hào):1007-757X(2020)11-0082-04
Abstract:The traditional linear divisible support vector machine grouping algorithm has the problem of low accuracy when grouping mass samples. A mass sample set is given to be extracted, and the sample feature selection is performed by the score method. The sample clustering is carried out by the linear divisible support vector machine. In order to verify the sample grouping accuracy of the linear divisible support vector machine grouping algorithm for mass samples, the algorithm is supported with the traditional linearity. The results show that the proposed algorithm has more accuracy, which illustrates that the algorithm is more suitable for the mass sample set.
Key words:massive samples;linear separability;support vector machine;grouping algorithm
0?引言
信息技術(shù)與計(jì)算機(jī)技術(shù)的高速發(fā)展,使互聯(lián)網(wǎng)上流動(dòng)的資源與分布的內(nèi)容呈現(xiàn)出了多元化、海量化的膨脹趨勢,數(shù)據(jù)存儲(chǔ)技術(shù)與數(shù)據(jù)收集技術(shù)的迅速發(fā)展也使各種機(jī)構(gòu)組織能夠積累并獲取大量的數(shù)據(jù)。這些數(shù)據(jù)呈現(xiàn)出使用態(tài)、傳輸態(tài)以及靜態(tài)等多種狀態(tài),既有正常可用的多元應(yīng)用數(shù)據(jù),如政府、教育、醫(yī)學(xué)、市場、金融等行業(yè)數(shù)據(jù)及系統(tǒng)日志、安全策略、流量、即時(shí)通信、微博、新聞組、電子郵件、新聞等系統(tǒng)內(nèi)容資源,也有網(wǎng)絡(luò)經(jīng)濟(jì)犯罪、僵尸病毒網(wǎng)絡(luò)攻擊流、惡意軟件、個(gè)人隱私、欺詐廣告、垃圾郵件、色情網(wǎng)站、釣魚網(wǎng)站、虛假信息等影響國家重大利益、危害社會(huì)穩(wěn)定、誤導(dǎo)公眾、泄露敏感信息、影響資源以及信息可用性的內(nèi)容。……