金媛媛 李丹 楊明
摘 ?要: 針對現有學科競賽學員選拔中對評估數據缺少有效利用的問題,提出一種基于熵加權聚類的挖掘算法,對學科數據集合進行聚類,從而實現科學合理的人才挑選機制。采用人工統計對數據進行采集和歸一化預處理,并利用稀疏分數進行數據特征選擇,實現非必要聚類特征的過濾。通過熵加權聚類算法挖掘具有最優解的競賽成員分配方案。實例分析結果表明,相比標準的Apriori算法,熵加權聚類算法運行效率更高,驗證了提出方法的合理性和有效性。
關鍵詞: 聚類分析; 人才評估; 熵加權; 數據挖掘; 歸一化預處理; 數據特征選擇
中圖分類號: TN911.1?34; TP309 ? ? ? ? ? ? ? ? ? 文獻標識碼: A ? ? ? ? ? ? ? ? ? ?文章編號: 1004?373X(2019)19?0112?03
Abstract: In order to solve the problem of the lack of effective use of the evaluation data in the selection of existing academic contestants, a mining algorithm based on entropy?weighted clustering is proposed to cluster the subject data sets to achieve a scientific and rational mechanism of talent selection. The data is collected and normalized by manual statistic approach, and the sparse scores are used to select the data features for filtering of the non?essential clustering features. The entropy weighted clustering algorithm is used to mine the competition member allocation scheme with the optimal solution. The example analysis results show that the entropy?weighted clustering algorithm is more efficient than the standard Apriori algorithm, which verifies the rationality and effectiveness of the proposed method.
Keywords: cluster analysis; talent assessment; entropy weighting; data mining; normalization preprocessing; data feature selection
數據挖掘作為一種新興的計算機科學技術,已經逐漸應用到社會的各個行業之中,能夠在海量數據中尋找到有價值或關聯的科學技術,通常包含三大方面內容:知識發現過程、數據挖掘分類和數據挖掘應用。聚類分析是目前應用較為廣泛的數據挖掘方法,可以視為一個劃分數據對象集的過程。文獻[1]提出適用于軌跡模式和路徑挖掘的聚類方法。文獻[2]提出基于事件日志的關聯、預測和聚類動態行為的事件過程挖掘框架。文獻[3]提出基于熵加權聚類算法的自組網優化方案。
隨著我國教學質量的不斷提升,各種科學競賽已經成為各大高校展現自身教學實力的平臺,對提高學生專業素質和培養學習興趣有較大的促進作用。由于競賽準備時間短且學科競賽學員選拔困難,如何分配相應學科較優的學生參賽就具有比較現實的研究意義,但是,現階段相關領域的研究十分稀少,且僅限于Apriori關聯規則挖掘,例如文獻[4]。因此,本文提出一種基于熵加權聚類的挖掘算法,對學科數據集合進行聚類,實現科學合理的人才挑選機制,從而解決多目標的競賽成員分配求解問題。實驗結果表明,熵加權聚類算法同Apriori算法一樣均能夠有效應用于學科競賽學員選拔中,但是熵加權聚類算法具有更短的求解時間。
1.1 ?統計數據的歸一化
首先采用人工統計對學科競賽選拔中涉及的輸入樣本數據進行采集,并歸一化預處理[5?6],具體方式如下:


可以看出,隨著學生數量的提升,兩種算法所需的運行時間均不斷增加,但是在相同數量條件下,熵加權聚類挖掘算法比標準的Apriori關聯規則挖掘算法運行時間更短,即挖掘效率更高。
本文提出一種基于熵加權聚類的挖掘算法,對學科數據集合進行聚類,從而實現科學合理的人才挑選機制,解決多目標的競賽成員分配求解問題。采用稀疏分數表示法,降低數據維度。并通過學習成績、興趣指數和潛力指數3個評估指標進行聚類矩陣計算,實例驗證了提出算法的有效性和高效性。但是挖掘過程中當支持度設置逐漸增大時,算法的運行效率下降較為嚴重,后續將對此進行重點研究。
參考文獻
[1] HUNG C C, PENG W C, LEE W C. Clustering and aggrega?ting clues of trajectories for mining trajectory patterns and routes [J]. The VLDB journal, 2015, 24(2): 169?192.
[2] LEONI M D, AALST W M P V D, DEES M. A general process mining framework for correlating, predicting and cluste?ring dynamic behavior based on event logs [J]. Information systems, 2016, 56(3): 235?257.
[3] FATHIAN M, JAFARIAN?MOGHADDAM A R. New cluste?ring algorithms for vehicular Ad?hoc network in a highway communication environment [J]. Wireless networks, 2015, 21(8): 2765?2780.
[4] 李毓蘭.改進Apriori算法及其在信息學奧賽學員選拔中的應用[D].泉州:華僑大學,2015.
LI Yulan. Improved Apriori algorithm and its application in the selection of informatics students [D]. Quanzhou: Huaqiao University, 2015.
[5] CASTRO P M. Normalized multiparametric disaggregation: an efficient relaxation for mixed?integer bilinear problems [J]. Journal of global optimization, 2016, 64(4): 765?784.
[6] GLEASON S, RUF C S, CLARIZIA M P, et al. Calibration and unwrapping of the normalized scattering cross section for the cyclone global navigation satellite system [J]. IEEE transactions on geoscience & remote sensing, 2016, 54(5): 2495?2509.
[7] BORNMANN L, HAUNSCHILD R. Normalization of mendeley reader impact on the reader?and paper?side: a comparison of the mean discipline normalized reader score (MDNRS) with the mean normalized reader score (MNRS) and bare reader counts [J]. Journal of informetrics, 2016, 10(3): 776?788.
[8] ZHANG C, ZHOU S. Renormalized and entropy solutions for nonlinear parabolic equations with variable exponents and L1 data [J]. Journal of differential equations, 2017, 248: 1376?1400.
[9] BORNMANN L, THOR A, MARX W, et al. The application of bibliometrics to research evaluation in the humanities and social sciences: an exploratory study using normalized Google Scholar data for the publications of a research institute [J]. Journal of the association for information science & technology, 2016, 67(11): 2778?2789.
[10] 魏霖靜,寧璐璐,郭斌,等.大數據中基于熵加權的稀疏分數特征選擇聚類算法[J].計算機應用研究,2018,35(8):2293?2294.
WEI Linjing, NING Lulu, GUO Bin, et al. Sparse?segment feature selection clustering algorithm based on entropy weigh?ting in big data [J]. Application research of computers, 2018, 35(8): 2293?2294.
[11] YANG M S, NATALIANI Y. A feature?reduction fuzzy cluste?ring algorithm based on feature?weighted entropy [J]. IEEE transactions on fuzzy systems, 2018, 26(2): 817?835.
[12] KAWAMURA T, SEKINE M, MATSUMURA K. Detecting hypernym/hyponym in science and technology thesaurus using entropy?based clustering of word vectors [J]. International journal of semantic computing, 2017, 11(4): 17?24.
[13] 李敏,李彩霞,魏霖靜.基于熵加權的四叉樹分解單幀圖像去霧[J].計算機工程與設計,2017,38(6):1575?1579.
LI Min, LI Caixia, WEI Linjing. Four?tree decomposition of single frame image defogging based on entropy weighting [J]. Computer engineering and design, 2017, 38(6): 1575?1579.
[14] HAFEZALKOTOB Arian, ASHKAN Hafezalkotob. Extended MULTIMOORA method based on Shannon entropy weight for materials selection [J]. Journal of Industrial engineering international, 2016, 12(1): 1?13.