摘 要: 經(jīng)典ID3決策樹算法適用于離散型數(shù)據(jù)分類,但用于連續(xù)處理時需要數(shù)據(jù)離散化容易導(dǎo)致信息損失。提出鄰域等價關(guān)系從而誘導(dǎo)鄰域ID3(NID3)決策樹算法,NID3算法改進了ID3決策樹算法,能夠直接實施連續(xù)預(yù)測并獲取更好的分類效果。在鄰域決策系統(tǒng)中,挖掘一種鄰域等價關(guān)系;基于鄰域等價?;瑯?gòu)建鄰域信息度量;基于鄰域信息增益,設(shè)計NID3決策樹算法。實例分析與數(shù)據(jù)實驗均表明,NID3算法具有連續(xù)數(shù)據(jù)分類預(yù)測有效性,在分類機器學(xué)習(xí)中優(yōu)于ID3算法。
關(guān)鍵詞: 決策樹; ID3算法; 鄰域粗糙集; 鄰域等價關(guān)系; 鄰域信息增益; 機器學(xué)習(xí)
中圖分類號: TP18"" 文獻標(biāo)志碼: A
文章編號: 1001-3695(2022)01-018-0102-04
doi:10.19734/j.issn.1001-3695.2021.06.0294
Improved ID3 decision tree algorithm induced by neighborhood equivalence relation
Xie Xina,b, Zhang Xianyonga,b, Yang Jilinb,c
(a.School of Mathematical Sciences, b.Institute of Intelligent Information amp; Quantum Information, c.College of Computer Science, Sichuan Normal University, Chengdu 610066, China)
Abstract: This paper applied the classical ID3 decision tree algorithm for discrete data classification. However, it required data discretization for continuous data processing, and this process easily caused information loss. This paper proposed the neighborhood equivalence relationship to induce the neighborhood ID3 (NID3) decision tree algorithm. The NID3 algorithm improved the ID3 decision tree algorithm, which could directly implement continuous prediction and obtained better classification results. In the neighborhood decision system, this paper mined a neighborhood equivalence relationship. Then, based on the equivalent granulation of the neighborhood, this paper constructed the neighborhood information metric. Finally, this paper designed the NID3 decision tree algorithm based on the neighborhood information gain. As verified by both case analyses and data experiments, algorithm NID3 is effective for continuous data classification, and it outperforms ID3 algorithm in classification machine learning.
Key words: decision tree; ID3 algorithm; neighborhood rough set; neighborhood equivalence relation; neighborhood information gain; machine learning
粗糙集是重要的不確定性信息處理方法,廣泛應(yīng)用于數(shù)據(jù)挖掘、模式識別、決策分析等。經(jīng)典粗糙集使用等價關(guān)系及等價劃分,適合離散型數(shù)據(jù)處理,實施連續(xù)數(shù)據(jù)分析時需要數(shù)據(jù)離散化,這容易導(dǎo)致信息丟失。鄰域粗糙集引入距離測量與半徑閾值進行擴張,采用鄰域覆蓋進行信息?;梢灾苯犹幚磉B續(xù)型數(shù)據(jù)甚至混合型數(shù)據(jù),得到深入研究與應(yīng)用[1~4]。同時,鄰域覆蓋伴隨著粒信息冗余,相關(guān)的結(jié)構(gòu)刻畫改進有利于信息提取與智能學(xué)習(xí),值得探討。
針對分類預(yù)測,決策樹算法具有分類速度快、規(guī)則簡單易懂等優(yōu)點,相關(guān)歸納算法與知識發(fā)現(xiàn)成為機器學(xué)習(xí)的基本內(nèi)容[5~10]。決策樹構(gòu)建主要依托度量函數(shù)獲取節(jié)點分裂準(zhǔn)則,其中度量函數(shù)主要測量特征的信息不確定性。例如,經(jīng)典的ID3算法[11]、C4.5算法[12]、CART算法[13]分別借助了信息增益、信息增益率、基尼指數(shù)。針對ID3算法,提出了眾多的改進決策樹算法。例如,文獻[14]提出一種基于斯皮爾曼等級相關(guān)系數(shù)的ID3決策樹構(gòu)造優(yōu)化算法,文獻[15]基于屬性相似度和性價比值的均衡系數(shù)來改進ID3算法,文獻[16]引入粗糙集屬性相容度來構(gòu)造ID3算法??梢?,ID3算法的改進是一個基本發(fā)展方向;同時,ID3算法及其優(yōu)化算法主要還是針對離散型數(shù)據(jù)挖掘,而直接的連續(xù)型數(shù)據(jù)處理還比較欠缺。
綜上,本文主要考慮鄰域粗糙集背景及相關(guān)連續(xù)數(shù)據(jù)分析,構(gòu)建新的?;P(guān)系與信息度量,從而提出鄰域ID3(NID3)決策樹算法,改進提升ID3算法。在鄰域決策系統(tǒng)中,首先由鄰域粒化生成鄰域等價關(guān)系,再模擬經(jīng)典信息度量體系建立鄰域信息度量體系,從而由鄰域信息增益自然設(shè)計NID3算法,最后進行實例說明與實驗驗證。
1 ID3決策樹算法
本章圍繞決策表回顧基于信息增益的經(jīng)典ID3決策樹算法。
四元組DT=(U,AT=C∪D,{Va|a∈AT},{Ia|a∈AT})稱為決策表,簡記為DT=(U,C∪D)。其中,U是非空有限論域,AT是非空有限屬性集,Va是屬性a∈AT對應(yīng)值域,信息函數(shù)Ia:U→Va賦予對象x關(guān)于屬性a的唯一值Ia(x)。設(shè)條件屬性子集BC,其等價關(guān)系IND(B)誘導(dǎo)等價劃分U/IND(B)={Bi|i=1,…,n},設(shè)決策分類U/D={D1,…,Dm}。
4 決策樹算法對比數(shù)據(jù)實驗
本章進行數(shù)據(jù)實驗,驗證NID3算法的有效性。表5提供了10組UCI數(shù)據(jù)集[17],它們主要側(cè)重于連續(xù)型數(shù)據(jù)。為了比較傳統(tǒng)ID3算法得出改進優(yōu)勢,相關(guān)數(shù)據(jù)集采用等距離劃分方法[18]進行離散化(其中區(qū)間數(shù)目采用5)。除此之外,C4.5算法同樣被用來進行對比實驗。C4.5算法在連續(xù)性屬性數(shù)據(jù)離散化方面,采用基于數(shù)據(jù)分布概率假設(shè)的信息增益進行數(shù)據(jù)離散化[19]。決策樹效果評估上,主要考慮準(zhǔn)確度和葉子數(shù)這兩種基本評價指標(biāo),并采用十折交叉法進行學(xué)習(xí)與驗證,其中鄰域閾值采用了0.1、0.3、0.5、0.7、0.9五種。針對多組NID3算法與離散ID3算法,表6統(tǒng)計了相關(guān)實驗結(jié)果數(shù)值,而圖3、4分別提供了準(zhǔn)確度與葉子數(shù)的形象對比。
基于表6與圖3、4可知,NID3算法是有效的,可以直接用于連續(xù)型數(shù)據(jù)的分類預(yù)測,而且通常比基于離散化的ID3和C4.5算法具有更好的分類業(yè)績,特別從分類精度這一主要指標(biāo)來看。事實上基于算術(shù)平均的統(tǒng)計視角,NID3算法在0.1、0.3、0.5、0.7、0.9不同閾值上的算法準(zhǔn)確度都比ID3好,0.1、0.3、0.5、0.9閾值上的算法準(zhǔn)確度同樣都比C4.5好。當(dāng)NID3算法取閾值0.1、0.3、0.5、0.7、0.9時在總共10個數(shù)據(jù)集上準(zhǔn)確度方面表現(xiàn)最優(yōu),而ID3和C4.5只在總共4個數(shù)據(jù)集上表現(xiàn)最優(yōu),其中一些算法擁有并列最優(yōu)。對于算法葉子數(shù),ID3和C4.5算法雖然普遍最優(yōu),但NID3算法的相關(guān)差距并不大,都大體位于一個數(shù)量級或者是處于能夠接受的范圍。
5 結(jié)束語
本文借助鄰域粒化的連續(xù)型數(shù)據(jù)處理優(yōu)勢,提出鄰域決策系統(tǒng)的鄰域等價關(guān)系,由此確定鄰域信息度量與鄰域決策樹算法NID3,解決了傳統(tǒng)ID3算法的離散處理局限并改進了相關(guān)分類業(yè)績。實例分析呈現(xiàn)了相關(guān)的結(jié)構(gòu)機制與理論結(jié)果,而數(shù)據(jù)實驗表明了新建NID3算法的有效性與改進性。基于鄰域等價關(guān)系的不確定性刻畫與決策樹分類學(xué)習(xí)還值得深入研究與應(yīng)用,本團隊正致力于相關(guān)工作并取得了一些初步結(jié)果。
參考文獻:
[1]姚晟,徐風(fēng),趙鵬,等.鄰域粗糙集模型的規(guī)則提取方法研究[J].小型微型計算機系統(tǒng),2018,39(6):1323-1327.(Yao Sheng, Xu Feng, Zhao Peng, et al. Research on rule extraction method of neighborhood rough set model[J].Journal of Chinese Computer Systems,2018,39(6):1323-1327.)
[2]Zhang Xianyong, Gou Hongyuan, Lyu Zhiying, et al. Double-quantitative distance measurement and classification learning based on the tri-level granular structure of neighborhood system[J].Knowledge-Based Systems,2021,217:106799.
[3]葉明全,高凌云,伍長榮,等.基于對稱不確定性和鄰域粗糙集的腫瘤分類信息基因選擇[J].數(shù)據(jù)采集與處理,2018,149(3):426-435.(Ye Mingquan, Gao Lingyun, Wu Changrong, et al. Tumor classification information gene selection based on symmetric uncertainty and neighborhood rough set[J].Journal of Data Acquisition amp; Processing Data,2018,149(3):426-435.)
[4]Chu Xiaoli, Sun Bingzhen, Li Xue . Neighborhood rough set-based three-way clustering considering attribute correlations: an approach to classification of potential gout groups[J].Information Sciences,2020,535:28-41.
[5]張敏,彭紅偉,顏曉玲.基于神經(jīng)網(wǎng)絡(luò)的模糊決策樹改進算法[J].計算機工程與應(yīng)用,2021,57(21):174-179.(Zhang Min, Peng Hongwei, Yan Xiaoling. Improved fuzzy decision tree algorithm based on neural network[J].Computer Engineering and Applications,2021,57(21):174-179.)
[6]姚岳松,張賢勇,陳帥,等.基于屬性純度的決策樹歸納算法[J].計算機工程與設(shè)計,2021,42(1):142-149.(Yao Yuesong, Zhang Xianyong, Chen Shuai, et al. Decision tree induction algorithm based on attribute purity[J].Computer Engineering and Design,2021,42(1):142-149.)
[7]Toth E G, Gibbs D, Moczygemba J, et al. Decision tree modeling in R software to aid clinical decision making[J].Health and Techno-logy,2021,11:535-545.
[8]Li Zhenhua, Zhang Yujie, Ahmed A, et al. Fault diagnosis of transformer windings based on decision tree and fully connected neural network[J].Energies,2021,14(6):1531.
[9]Piao Minghao,Jin Chenghao,Lee J,et al.Decision tree ensemble-based wafer map failure pattern recognition based on radon transform-based features[J].IEEE Trans on Semiconductor Manufactu-ring,2018,31(2):250-257.
[10]Karimi R,Nanopoulos A,Schmidt-Thieme L.A supervised active learning framework for recommender systems based on decision trees[J].User Modeling and User-Adapted Interaction,2015,25(1):39-64.
[11]Quinlan J R. Induction of decision trees[J].Machine Learning,1986,1(1):81-106.
[12]王文霞.數(shù)據(jù)挖掘中改進的C4.5決策樹分類算法[J].吉林大學(xué)學(xué)報:理學(xué)版,2017,55(5):1274-1277.(Wang Wenxia. Improved C4.5 decision tree classification algorithm in data mining[J].Journal of Jilin University :Science Edition,2017,55(5):1274-1277.)
[13]劉亞芬.基于GA的CART決策樹改進算法與應(yīng)用[D].廣州:廣州大學(xué),2020.(Liu Yafen. Improved algorithm and application of CART decision tree based on GA[D].Guangzhou:Guangzhou University,2020.)
[14]吳思博,陳志剛,黃瑞.基于相關(guān)系數(shù)的ID3優(yōu)化算法[J].計算機工程與科學(xué),2016,38(11):2342-2347.(Wu Sibo,Chen Zhigang,Huang Rui. ID3 optimization algorithm based on correlation coefficient[J].Computer Engineering and Science,2016,38(11):2342-2347.)
[15]董躍華,劉力.基于均衡系數(shù)的決策樹優(yōu)化算法[J].計算機應(yīng)用與軟件,2016,33(7):266-272.(Dong Yuehua,Liu Li. Decision tree optimization algorithm based on equilibrium coefficient[J].Computer Applications and Software,2016,33(7):266-272.)
[16]王子京,劉毓.決策樹ID3新屬性選擇方法[J].現(xiàn)代電子技術(shù),2018,41(23):9-12.(Wang Zijing,Liu Yu. Decision tree ID3 new attribute selection method[J].Modern Electronic Technology,2018,41(23):9-12.)
[17]Dua D,Graff C.UCI machine learning repository[DB/OL].(2017-05-10).http://archive.ics.uci.edu/ml.
[18]Benot G.Data discretization for novel relationship discovery in information retrieval[J].Journal of the American Society for Information Science and Technology,2002,53(9):736-746.
[19]徐杏楓.結(jié)合K-means的C4.5算法優(yōu)化研究[D].南昌:江西師范大學(xué),2020.(Xu Xingfeng. Research on optimization of C4.5 algorithm combined with K-means[D].Nanchang:Jiangxi Normal University,2020.)