吳疆 董婷 蔣平



摘要:
應用半監督學習方法拉普拉斯支持向量機(Laplace Support Vector Machine, LapSVM)對蛋白質結構類進行預測。首先7個氨基酸理化性質參數作為替代模型將蛋白質序列轉換為數字序列,自協方差變換(AutocrossCovariance, AC)用來描述具有一定間隔氨基酸殘基之間的相互關系并將數字序列變換為統一長度的向量,構建樣本的特征空間。然后在數據集中分別隨機挑選20、50、80、110、140、170個樣本作為無標簽樣本構建訓練集,一對多分解策略和留一法用來評價LapSVM模型的預報能力。分類器對蛋白質樣本類預測正確率為94.12%,與標準支持向量機算法(Support Vector Machine, SVM)方法90.69%的預測精度相比有明顯的競爭力。實驗結果有效驗證了無標簽樣本的分布信息作為弱規則能有效提升分類器的預報性能。同時提供了一種新穎的思路,應用半監督方法解決全監督學習問題,更小的優化規模,更好的預報能力。
關鍵詞:
半監督學習; 蛋白質結構類; 拉普拉斯支持向量機; 自協方差變換
中圖分類號: TP 391
文獻標志碼: A
Protein Structural Classes Prediction by Using Laplace Support
Vector Machine and Based on Semisupervised Method
WU Jiang1, DONG Ting1, JIANG Ping1,2
(1. Department of Information Engineering ,Yulin University, Yulin, Shanxi ?719000, China;
2. School of Computer Science and Technology, Xidian University, Xian, Shanxi 710071, China)
Abstract:
The purpose of the study is to predict protein structural classes by using Laplace support vector machine (LapSVM) which is a novel semisupervised learning method. Firstly, seven amino acid physicochemical properties cited from literature was applied to transform the protein sequences into numeric vectors, and auto covariance (AC) was used in transforming the physicochemical properties of the amino acids of given proteins into features space with the same size, which is suitable for training models. AC focuses on the neighboring effects and the interactions between residues with a certain distance apart in protein sequences. Secondly, 20, 50, 80, 110, 140 and 170 samples were randomly selected as unlabelled samples to construct training datasets, “oneagainstall” strategy and leaveoneout method were employed to estimate the performance. The prediction accuracy 94.12% was obtained, and it is very promising compared with the accuracy 90.69% predicted by Support Vector Machine (SVM). The experimental results proofed that the unlabelled samples input as weak rules can lightly improve the prediction performances, simultaneously, a novel idea is using semisupervised method to solve a supervised learning problem intends to less optimal scale and higher prediction accuracy.
Key words:
semisupervised learning; protein structural class; Laplace support vector machine; auto correlation