于娜


摘? 要:該文考慮了一類Fused Lasso問題的特征選擇方法。與之前的方法不同,該文利用變分不等式為對(duì)偶問題提供充要條件,構(gòu)造了特征選擇方法。通過(guò)給出優(yōu)化問題的對(duì)偶問題,進(jìn)而導(dǎo)出對(duì)偶問題變分不等式形式下的必要條件。構(gòu)造一個(gè)包含對(duì)偶最優(yōu)解的對(duì)偶可行域,并在這個(gè)可行域上估計(jì)對(duì)偶約束上界,建立篩選規(guī)則,識(shí)別出具有相同系數(shù)的相鄰特征,進(jìn)而實(shí)現(xiàn)特征剔除。
關(guān)鍵詞:特征選擇? 變分不等式? 篩選規(guī)則? 對(duì)偶問題
中圖分類號(hào):O177.5? ? ? ? ? ? ? ? ? ? ? ? ? 文獻(xiàn)標(biāo)識(shí)碼:A? ? ? ? ? ? ? ? ? ?文章編號(hào):1672-3791(2020)12(b)-0032-03
Abstract: This paper considers the feature selection for Fused Lass. Unlike the previous method, this paper uses variational inequality to provide sufficient and necessary conditions for the dual problem, and constructs the feature selection. By giving the dual problem of the optimization problem, derive the necessary conditions in the form of variational inequality of the dual problem. Construct a dual feasible region containing the dual optimal solution, and estimate the upper bound of the dual constraint on this feasible region. Established a screening rule, to identify adjacent features with the same coefficient, and achieve feature removal.
Key Words: Feature Selection; Variational Inequality; Screening Rules; Dual problem.
傳統(tǒng)的線性回歸,作為一種基本的數(shù)據(jù)分析技術(shù)被廣泛的應(yīng)用。但對(duì)于高維數(shù)據(jù)的處理上仍面臨著巨大的困難,如何挖掘出有用的信息變得尤為重要,因而促使了新的變量選擇方法的產(chǎn)生。1996年Tibshirani[1]提出了一種基于正則(罰)的Lasso模型,模型如下:
其中,p表示模型系數(shù)。稀疏學(xué)習(xí)是一門有效分析高維數(shù)據(jù)的技術(shù),被廣泛地應(yīng)用到各個(gè)領(lǐng)域,并且這類模型的系數(shù)只含有少量的非零項(xiàng)。通過(guò)懲罰模型系數(shù)的絕對(duì)值函數(shù), 將模型系數(shù)進(jìn)行壓縮, 可以把權(quán)重很小的特征系數(shù)壓縮為零, 進(jìn)而剔除其所對(duì)應(yīng)的特征。
很多學(xué)者也對(duì)Lasso模型進(jìn)行了改進(jìn),2005年針對(duì)相鄰特征間有很強(qiáng)相關(guān)性的高維數(shù)據(jù),Tibshirani和Saunders[2]提出了 Fused Lasso估計(jì)。模型如下:
該模型不僅將較小系數(shù)壓縮為零,也可以將部分系數(shù)差分壓縮為零。不僅實(shí)現(xiàn)了系數(shù)差分的稀疏性,同時(shí)也使得相鄰系數(shù)之間更加平滑。關(guān)于該模型的一些篩選方法也應(yīng)運(yùn)而生[3-6]。
1? 篩選規(guī)則的建立
該文主要研究的是如下優(yōu)化問題:
2.2 可行集建立
在給定參數(shù)時(shí),初始問題和對(duì)偶問題的最優(yōu)解,可知。不難看出,通過(guò)該文構(gòu)建的篩選規(guī)則知,要想提高計(jì)算效率,降低計(jì)算難度,需要通過(guò)對(duì)偶最優(yōu)解進(jìn)行篩選。但難點(diǎn)在于,無(wú)法通過(guò)簡(jiǎn)單的運(yùn)算,求得在下的對(duì)偶最優(yōu)解。由此,該文考慮利用定理1中的變分不等式構(gòu)建一個(gè)緊的對(duì)偶可行集。
3? 結(jié)語(yǔ)
大數(shù)據(jù)時(shí)代,當(dāng)采集到的特征維數(shù)和樣本數(shù)據(jù)集很大時(shí),數(shù)據(jù)挖掘編的尤為重要如何求解這些問題變得尤為重要并且充滿挑戰(zhàn)。但是在眾多數(shù)據(jù)中,并不是所有的數(shù)據(jù)特征都是具有代表性的,所以需要剔除一些非積極的特征(不具有代表性的), 這就是特征選擇,主要是為了提高模型的計(jì)算效率。
該文提出的特征選擇方法如下。
(1)通過(guò)估計(jì)特征和在對(duì)偶問題最優(yōu)解集中的上界,來(lái)找到相鄰特征中具有相同系數(shù)的特征,并將其剔除。
(2)篩選的關(guān)鍵是對(duì)偶最優(yōu)解的估計(jì)。因此該文利用變分不等式篩選方法構(gòu)建一個(gè)更緊的對(duì)偶可行集,來(lái)準(zhǔn)確地估計(jì)出對(duì)偶最優(yōu)解。該篩選方法可以準(zhǔn)確的快速識(shí)別解中具有相同系數(shù)的相鄰特征。
參考文獻(xiàn)
[1] Rob. Tibshirani. Regression Shrinkage and Selection Via the Lasso[J].Journal of the Royal Statistical Series B:Methodological,1996,58(1):267-288.
[2] Robert Tibshirani,Michael A.Saunders,Saharon Rossent,et al. Sparsity and Smoothness via the Fused Lasso[J].Journal of the Royal Statistical Series B(Statistical Methodology),? 2005,67(1):91-108.
[3] WANG Jie, FAN Wei, YE Jieping. Fused Lasso Screening Rules via the Monotonicity of Subdifferentials[J].IEEE Transacyion on. Pattern Analysis and Machine Intelligence, 2015,37(9):1806-1820.
[4] 張環(huán).Fused-LASSO懲罰最小一乘回歸的統(tǒng)計(jì)分析與優(yōu)化算法[D].北京交通大學(xué),2016.
[5] LIU Jun, ZHAO Zheng,WANG Jie, et al. Safe Screening with Variational Inequalities and Its Application to Lasso[C]// ICML,14: Proceedings of the 31st International Conference on International Conferece on Machine Learning,2014:289-297.
[6] Nataliya. Sokolovska, Yann. Chevaleyre, Karine. Clement. The Fused Lasso Penalty for Learning Interpretable Medical Scoring Systems[C]//2017 International Joint Conference on Neural Networks(LJCNN).2017.
[7] REN Shaogang,HUANG Shuai,YE Jieping,et al. Safe Feature Screening for Generalized LASSO[C]//IEEE Transactions on Yattern Analysis and Machine Intelligence.2018:2992-3006.
[8] 李怡.數(shù)據(jù)挖掘技術(shù)及應(yīng)用[J].科技資訊,2017,15(24):21-22.
[9] Alnur. Ali, Ryan J. Tibshirani. The Generalized Lasso Problem and Uniqueness[J].Statistics Theory, 2019,13(2):2307-2347.