李林睿 常舒予 喬一鳴



摘要:LAMOST(郭守敬望遠(yuǎn)鏡)提供了大量的天文光譜數(shù)據(jù),而天體分類是天文學(xué)中得到廣泛關(guān)注的問題,由于天體數(shù)量大,數(shù)據(jù)維度高,如何使用機(jī)器學(xué)習(xí)的方法對光譜進(jìn)行處理,成為近些年的熱點。針對天體分類問題,提出了HSODM(High-dimensional Spectral with Outlier Data Mining),這是一種改進(jìn)的高維離群數(shù)據(jù)識別方法,其采用無監(jiān)督學(xué)習(xí)方式,基于隨機(jī)距離將大量高維光譜數(shù)據(jù)中的極少數(shù)未知天體或離群數(shù)據(jù)識別出來,便于后續(xù)天體分類、離群數(shù)據(jù)挖掘等相關(guān)處理。項目中運用數(shù)據(jù)預(yù)處理、主成分分析降維、長短期記憶神經(jīng)網(wǎng)絡(luò)模型建立與訓(xùn)練、參數(shù)調(diào)優(yōu)、結(jié)果預(yù)測與分析,最終通過評估方法和數(shù)據(jù)可視化等手段對模型進(jìn)行評價與展示。研究中提出的改進(jìn)方法和優(yōu)化的神經(jīng)網(wǎng)絡(luò)可以縮短訓(xùn)練時間,提高模型預(yù)測準(zhǔn)確度。經(jīng)過實驗發(fā)現(xiàn),改進(jìn)方法對ROC (receiver operating characteristic) 曲線面積、P-R曲線面積、F1分?jǐn)?shù)和G-mean分?jǐn)?shù)都有相應(yīng)的提高。
關(guān)鍵詞: 表示學(xué)習(xí);高維光譜;離群點檢測;數(shù)據(jù)挖掘; 分類
Abstract: LAMOST (Large Sky Area Multi-Object Fiber Spectroscopy Telescope) Telescope provides a large amount of astronomical spectral data, and astronomical classification is a problem that has received widespread attention in astronomy. Due to the large number of celestial bodies and the high dimensionality of data, how to use machine learning methods to process spectra has become a problem in recent years. Hot spot. Aiming at the problem of celestial body classification, HSODM (High-dimensional Spectral with Outlier Data Mining) is proposed, which is an improved method for identifying high-dimensional outlier data. It uses an unsupervised learning method and combines a large number of high-dimensional spectral data based on random distance. A very small number of unknown celestial bodies or outlier data can be identified to facilitate subsequent celestial body classification, outlier data mining and other related processing. In the project, data preprocessing, principal component analysis and dimensionality reduction, long and short-term memory neural network model establishment and training, parameter tuning, result prediction and analysis are used in the project, and the model is finally evaluated and displayed by means of evaluation methods and data visualization. The improved method and optimized neural network proposed in the research can shorten the training time and improve the accuracy of model prediction. After experimentation, it is found that the improved method has corresponding improvement on ROC curve area, P-R curve area, F1 score and G-mean score.
Key words: representation learning; high-dimensional spectral; outlier detection; data mining; classification
天文學(xué)隨著科學(xué)技術(shù)的發(fā)展,先進(jìn)的觀測設(shè)備使我們能夠望向宇宙更深處,同時也帶來了天文數(shù)據(jù)爆炸式的增長[1]。郭守敬望遠(yuǎn)鏡(LAMOST)作為世界上光譜獲取率最高的望遠(yuǎn)鏡,LAMOST每個觀測夜晚能采集萬余條光譜,這將為一些天文和天體物理學(xué)家在星系紅移巡天、宇宙學(xué)模型、宇宙大尺度結(jié)構(gòu)、星系形成和演化以及結(jié)合各類射線的光譜觀測等研究工作[2]上提供大量素材,對天文學(xué)領(lǐng)域的發(fā)展起到推動和完善作用。LAMOST數(shù)據(jù)集中的每一條光譜提供了3690-9100埃的波長范圍內(nèi)的一系列輻射強(qiáng)度值。光譜分類就是要從上千維的光譜數(shù)據(jù)中選擇和提取對分類識別最有效的特征來構(gòu)建特征空間,例如選擇特定波長或波段上的光譜流量值等作為特征,并運用算法對各種天體進(jìn)行區(qū)分 。