杜媛 張世偉



摘 要:針對(duì)自然排序樹(CAN-tree)算法構(gòu)建的樹結(jié)構(gòu)節(jié)點(diǎn)個(gè)數(shù)過多、壓縮性不高等問題,提出一種基于重構(gòu)的改進(jìn)CAN-tree算法。首先,使用自然排序法直接構(gòu)建樹結(jié)構(gòu),將頻繁項(xiàng)集挖掘算法實(shí)現(xiàn)中數(shù)據(jù)庫(kù)掃描次數(shù)減少至1;然后,對(duì)構(gòu)建的樹結(jié)構(gòu)以支持度降序方式結(jié)合剪枝操作實(shí)現(xiàn)樹結(jié)構(gòu)的重構(gòu),得到高壓縮性的樹結(jié)構(gòu);最后,對(duì)重構(gòu)的樹結(jié)構(gòu)進(jìn)行頻繁項(xiàng)集挖掘。實(shí)驗(yàn)結(jié)果表明,基于重構(gòu)的改進(jìn)CAN-tree算法所構(gòu)建的樹結(jié)構(gòu)節(jié)點(diǎn)個(gè)數(shù)減少至原來的20%以下,執(zhí)行效率提高了4至6倍,在頻繁項(xiàng)集挖掘中有效地壓縮了樹結(jié)構(gòu),縮短了算法的執(zhí)行時(shí)間。
關(guān)鍵詞:頻繁項(xiàng)集;頻繁項(xiàng)集頭表;重構(gòu);剪枝;最小支持度
中圖分類號(hào): TP301.6
文獻(xiàn)標(biāo)志碼:A
Abstract: In order to solve the problems such as too many nodes and low compressibility in the tree structure constructed by CANonical-order tree (CAN-tree) algorithm, an improved CAN-tree algorithm based on restructure was proposed. Firstly, a tree structure was constructed directly with canonical-order, which scans the database only once in the frequent itemset mining algorithm. Then, in order to get a tree structure with high compressibility, a pruning operation was used with support in desending order to restructure the tree. Finally, frequent itemsets were mined out for the reconstructed tree structure. The experimental results show that compared with original CAN-tree algorithm, the number of nodes constructed by the improved CAN-tree algorithm is reduced to less than 20%,and the execution efficiency is improved by 4 to 6 times. The proposed algorithm shortens the execution time of the frequent itemset mining algorithm and effectively compresses the tree structure in it.
Key words: frequent itemset; frequent itemset header table; restructure; pruning; minimum support
0 引言
在大數(shù)據(jù)時(shí)代,數(shù)據(jù)成為一種資源,數(shù)據(jù)挖掘技術(shù)能夠發(fā)現(xiàn)隱藏在數(shù)據(jù)中的信息,而關(guān)聯(lián)規(guī)則挖掘就是其重要組成部分之一。在關(guān)聯(lián)規(guī)則挖掘應(yīng)用中,頻繁項(xiàng)集挖掘尤為重要。頻繁項(xiàng)集挖掘算法可分成2類:1)Apriori算法[1],需多次掃描庫(kù),影響執(zhí)行效率,同時(shí)還將產(chǎn)生大量候選項(xiàng)集,需大量存儲(chǔ)空間;2)頻繁模式增長(zhǎng)(Frequent Pattern growth, FP-growth)算法[2],僅需掃描2次數(shù)據(jù)庫(kù),使用頻繁模式樹(Frequent Pattern tree,F(xiàn)P-tree)存儲(chǔ)壓縮后的原始數(shù)據(jù),相比Apriori,F(xiàn)P-growth減少了數(shù)據(jù)庫(kù)掃描次數(shù),不僅提高了算法的執(zhí)行效率,也減少了內(nèi)存的占有量。
對(duì)于FP_growth算法,不少學(xué)者提出了改進(jìn)方向:1)修改FP-tree節(jié)點(diǎn)結(jié)構(gòu)[3]、添加Hash表輔助存儲(chǔ)結(jié)構(gòu)[4-5]、使用構(gòu)造鏈表(Building list, B-list)[6]修改頻繁項(xiàng)集表達(dá)形式等以提高節(jié)點(diǎn)查詢速度;2)直接對(duì)最大頻繁項(xiàng)進(jìn)行挖掘,使用剪枝技術(shù)減少FP-tree的節(jié)點(diǎn)數(shù)[7-9];3)使用分布式計(jì)算框架,設(shè)計(jì)并行運(yùn)行的FP-growth算法,如并行FP-growth算法(Parallel FP-growth,PFP)[10]、利用MapReduce模型加上動(dòng)態(tài)閾值法改進(jìn)并行FP-growth算法[11]以縮短算法執(zhí)行時(shí)間。但是這些改進(jìn)都需要掃描數(shù)據(jù)庫(kù)2次:第1次掃描創(chuàng)建降序排列的頻繁項(xiàng)集頭表,第2次掃描構(gòu)建FP-tree,且構(gòu)建過程依賴第1次掃描結(jié)果。而自然排序樹(Canonical-order tree,CAN-tree)算法[12]的提出正好解決了該問題。CAN-tree在構(gòu)建時(shí)使用自然排序法(例如字母排序法),例如事務(wù){(diào)a,c,b},在構(gòu)建時(shí)直接以{a,b,c}順序插入節(jié)點(diǎn),而FP-tree在構(gòu)建時(shí)需要根據(jù){a}、{c}、的支持度計(jì)數(shù)進(jìn)行降序排列,所以CAN-tree僅需掃描數(shù)據(jù)庫(kù)1次即可構(gòu)建完整樹,提高了算法的執(zhí)行效率。但是CAN-tree在構(gòu)建時(shí)保留了所有數(shù)據(jù),這導(dǎo)致樹的結(jié)構(gòu)非常龐大,直接降低了后期挖掘速度。對(duì)此有學(xué)者對(duì)CAN-tree節(jié)點(diǎn)結(jié)構(gòu)進(jìn)行了改進(jìn),如將CAN-tree節(jié)點(diǎn)中指向子節(jié)點(diǎn)的指針改成指向父節(jié)點(diǎn)[13],在此基礎(chǔ)上,基于CAN-tree快速構(gòu)建算法(Fast Construction algorithm based on CAN-tree, FCCAN)[14]又增加了哈希存儲(chǔ)表和葉子頭表,以減少查詢和剪枝時(shí)間,提高條件模式基生成速度,但是這些改進(jìn)都未從本質(zhì)上縮小CAN-tree的結(jié)構(gòu)。對(duì)比FP-tree可以發(fā)現(xiàn),CAN-tree的結(jié)構(gòu)明顯比較稀疏,因?yàn)镕P-tree使用支持度降序排列插入節(jié)點(diǎn),支持度計(jì)數(shù)越大的節(jié)點(diǎn)越靠近Root節(jié)點(diǎn)(因?yàn)镽oot是所有路徑的共享節(jié)點(diǎn)),所以支持度計(jì)數(shù)越大的節(jié)點(diǎn)在樹結(jié)構(gòu)中的共享程度越大,其共享前綴路徑也越多,F(xiàn)P-tree也越緊湊。在頻繁項(xiàng)集挖掘過程中,算法需要對(duì)整棵樹結(jié)構(gòu)進(jìn)行遞歸處理,而龐大的樹結(jié)構(gòu)會(huì)明顯增加遞歸挖掘的層次和執(zhí)行的時(shí)間,所以緊湊的樹結(jié)構(gòu)在后期生成條件模式基進(jìn)行頻繁項(xiàng)集挖掘時(shí)效率更高。
由于CAN-tree構(gòu)建的完整樹結(jié)構(gòu)龐大稀疏、后期遞歸進(jìn)行頻繁項(xiàng)集挖掘時(shí)遞歸層次過多、挖掘效率低等問題,本文著力于縮小CAN-tree構(gòu)建的樹結(jié)構(gòu),提出了一種基于重構(gòu)的改進(jìn)CAN-tree算法: 1)使用自然排序法實(shí)現(xiàn)掃描數(shù)據(jù)庫(kù)1次,構(gòu)造CAN-tree和頻繁項(xiàng)集頭表;2)對(duì)頻繁項(xiàng)集頭表進(jìn)行降序排列,并根據(jù)支持度計(jì)數(shù)降序模式結(jié)合最小支持度剪枝法對(duì)CAN-tree進(jìn)行重構(gòu)。改進(jìn)算法保留了CAN-tree僅需掃描數(shù)據(jù)庫(kù)1次的優(yōu)勢(shì),降低了算法在數(shù)據(jù)庫(kù)讀取上的時(shí)間損耗,同時(shí)對(duì)CAN-tree構(gòu)建的樹結(jié)構(gòu)進(jìn)行重構(gòu),得到壓縮性更高的樹,以減少后期頻繁項(xiàng)集挖掘過程中遞歸掃描層數(shù),進(jìn)一步減少算法執(zhí)行的時(shí)間損耗,提高執(zhí)行效率。
1 頻繁項(xiàng)集
設(shè)I={i1,i2,…,in}是項(xiàng)目的集合,稱為項(xiàng)集。如果項(xiàng)集長(zhǎng)度為k,則稱為k-項(xiàng)集,項(xiàng)集I可稱為n-項(xiàng)集。如果I是頻繁的,則可稱為頻繁n-項(xiàng)集。設(shè)數(shù)據(jù)庫(kù)D,D中每條數(shù)據(jù)是一個(gè)項(xiàng)集,稱為TID,項(xiàng)集中包含一個(gè)或者多個(gè)I中的項(xiàng)。假設(shè)A是I中的一個(gè)子項(xiàng)集,A的支持度計(jì)數(shù)count(A)是D中包含A的個(gè)數(shù),支持度support(A)是count(A)在D中事務(wù)總數(shù)所占的比例,最小支持度可記為min_support。
5 結(jié)語
為減少CAN-tree構(gòu)建的樹結(jié)構(gòu)節(jié)點(diǎn)個(gè)數(shù),提高算法在頻繁項(xiàng)集挖掘過程中的效率,本文提出了一種基于重構(gòu)的改進(jìn)CAN-tree算法,該算法有以下特點(diǎn):1)在構(gòu)建樹結(jié)構(gòu)時(shí)首先采用自然排序法,僅需掃描數(shù)據(jù)庫(kù)一次;2)利用重構(gòu)機(jī)制,采用支持度降序方式構(gòu)建壓縮性更好的樹結(jié)構(gòu),同時(shí)加入剪枝操作以減少樹結(jié)構(gòu)中的節(jié)點(diǎn)個(gè)數(shù),提高挖掘效率。實(shí)驗(yàn)結(jié)果表明,基于重構(gòu)的改進(jìn)CAN-tree算法在頻繁項(xiàng)集挖掘中具有更好的執(zhí)行效率。但是改進(jìn)CAN-tree算法在構(gòu)建樹結(jié)構(gòu)階段中保留了數(shù)據(jù)庫(kù)所有信息,并且比原算法多了重構(gòu)階段,所以改進(jìn)CAN-tree算法在獲得高壓縮性樹結(jié)構(gòu)前對(duì)系統(tǒng)內(nèi)存的開銷較大,對(duì)于這方面的缺陷,下一步將考慮與分布式計(jì)算框架相結(jié)合,同時(shí)在內(nèi)存消耗和算法執(zhí)行時(shí)間上進(jìn)行完善。
參考文獻(xiàn):
[1] AGRAWL R, IMIELINSKI T, SWAMI A. Mining association rules between sets of items in large databases [C] // Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1993: 207-216.
[2] HAN J, PEI J, YIN Y. Mining frequent patterns without candidate generation [C]// Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM 2000:1-12.
[3] DING Z, WEI Q, DING X. An improved FP-growth algorithm based on Compound single linked list [C] // ICIC 09Proceedings of the 2009 Second International Conference on Information and Computing Science. Washington, DC: IEEE Computer Society, 2009, 1: 351-353.
[4] HAO J, XU H. An improved algorithm for frequency itemsets mining [C] // Proceedings of the 2017 Fifth International Conference on Advanced Cloud and Big Data. Washington, DC: IEEE Computer Society, 2017: 314-317.
[5] 李也白,唐輝,張淳,等. 基于改進(jìn)的FP-tree的頻繁模式挖掘算法[J]. 計(jì)算機(jī)應(yīng)用,2011,31(1):101-103.(LI Y B, TANG H, ZHANG C, el al. Frequent pattern mining algorithm based on improve FP-tree [J]. Journal of Computer Applications, 2011, 31(1): 101-103.)
[6] 李校林,杜托,劉彪.基于B-list的快速頻繁模式挖掘算法[J].計(jì)算機(jī)應(yīng)用,2017,37(8):2357-2361,2367. (LI X L,DU T,LIU B. Fast algorithm for mining frequent patterns based on B-list [J]. Journal of Computer Applications, 2017, 37(8): 2357-2361, 2367.)
[7] 馬麗生,姚光順,楊傳健.基于改進(jìn)的FP-tree最大頻繁項(xiàng)目集挖掘算法[J].計(jì)算機(jī)應(yīng)用,2012,32(2):326-329.(MA L S, YAO G S, YANG C J. Mining algorithm for maximal frequent itemsets based on improved FP-tree [J]. Journal of Computer Applications, 2012, 32(2): 326-329.)
[8] 寧慧,王素紅,催立剛,等. 基于改進(jìn)的FP-tree最大頻繁模式挖掘算法[J]. 應(yīng)用科技,2016,43(2):37-43.(NING H, WANG S H, CUI L G, et al. An algorithm for mining maximal frequent patterns based on improved FP-tree[J]. Applied Science and Technology,2016,43(2):37-43.)
[9] 趙陽,吳廖丹. 一種自底向上的最大頻繁項(xiàng)集挖掘方法[J]. 計(jì)算機(jī)技術(shù)與發(fā)展,2017,27(8):57-60.(ZHAO Y,WU L D. A bottom-up method for mining maximum frequent itemsets [J]. Computer Technology and Development, 2017, 27(8): 57-60.)
[10] LI H, WANG Y, ZHAN D, el al. PFP: parallel FP-growth for query recommendation [C]// Proceedings of the 2008 ACM Conference on Recommender Systems. New York: ACM, 2008: 107-114.
[11] WEI X, MA Y, ZHANG F, el al. Incremental FP-Growth mining strategy for dynamic threshold value and database based on MapReduce [C] // Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Wori in Design. Piscataway, NJ: IEEE, 2014: 271-276.
[12] LEUNG C K-S, KHAN Q I, HOQUE T. CANTree: a tree structure for efficient incremental mining of frequent patterns [C] // Proceedings of the Fifth IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2005: 274-281.
[13] 鄒力鹍,張其善. 基于CAN-樹的高效關(guān)聯(lián)規(guī)則增量挖掘算法[J]. 計(jì)算機(jī)工程,2008,34(3):29-31.(ZOU L K, ZHANG Q S. Efficient incremental association rules mining algorithm based on CAN-tree [J]. Computer Engineering, 2008, 34(3): 29-31.)
[14] 陳剛,閆英戰(zhàn),劉秉權(quán). 一種基于CAN-tree快速構(gòu)建算法[J]. 微電子學(xué)與計(jì)算機(jī),2014,31(1):76-82.(CHEN G, YAN Y Z, LIU B Q. A fast construction algorithm based on CAN-tree[J]. Microelectronics & Computer, 2014, 32(1): 76-82.)
[15] TAN P M, STEINBACH M, KUMAR V. 數(shù)據(jù)挖掘?qū)д摚ㄍ暾妫M]. 范明,等譯. 北京:人民郵電出版社,2017: 225-228.(TAN P M, STEINBACH M, KUMAR V. Writing Introduction to Data Mining [M]. FAN M, translated. Beijing:Post & Telecom Press, 2017: 225-228.)