
摘要:數(shù)據(jù)清洗是數(shù)據(jù)分析、數(shù)據(jù)挖掘等研究的起點。本文對數(shù)據(jù)清洗的研究進(jìn)行了綜述。首先闡述了數(shù)據(jù)清洗與數(shù)據(jù)質(zhì)量的關(guān)系,然后說明了數(shù)據(jù)清洗的概況,并分析了數(shù)據(jù)清洗的步驟及方法,最后簡要介紹了國內(nèi)外關(guān)于數(shù)據(jù)清洗的研究近況,同時對中文數(shù)據(jù)清洗研究做了展望。
關(guān)鍵詞:臟數(shù)據(jù);數(shù)據(jù)清洗;數(shù)據(jù)質(zhì)量;相似重復(fù)數(shù)據(jù);清洗步驟
中國分類號:TP391 文獻(xiàn)標(biāo)識碼:A
文章編號:1009-3044(2020)20-0044-04
A Review of The Development of Data Cleaning
LIAO Shu-yan
( Central China Normal University, Wuhan 430079, China)
Abstract: Data cleaning is the starting point of data analysis, data mining and so on. In this paper, the research of data cleaning isreviewed. Firstly, the relationship between data cleaning and data quality is explained, and then the data cleaning is described. andthe steps and algorithms of data cleaning are analyzed, and the research situation on data cleaning at home and abroad is brieflY- in-troduced. and the research on Chinese data cleaning is a prospect.
Key words: dirtV data; data cleaning; data quality; similar duplicate data; cleaning steps
1引言
數(shù)據(jù)是信息時代的標(biāo)志性產(chǎn)物,逐漸獨立于軟件產(chǎn)品,甚至主導(dǎo)了某些軟件產(chǎn)品的發(fā)展。在互聯(lián)網(wǎng)蓬勃發(fā)展的時代,人們能夠從各個方面獲得海量數(shù)據(jù)。在獲得數(shù)據(jù)之后,人們往往希望能對這些數(shù)據(jù)進(jìn)行不同的處理,并從中抽取出有價值的信息。為了得到滿足人們需要的有價值的信息,就要求所獲得的數(shù)據(jù)具有可靠性,同時能夠準(zhǔn)確反映實際情況。但是實際上,人們獲得的第一手?jǐn)?shù)據(jù)通常是“臟數(shù)據(jù)”。“臟數(shù)據(jù)”主要指不一致或不準(zhǔn)確數(shù)據(jù)、陳舊數(shù)據(jù)以及人為造成的錯誤數(shù)據(jù)等[1]。如果對臟數(shù)據(jù)不加以必要的清洗處理就直接分析,那么從這些數(shù)據(jù)中得出的最終結(jié)論或規(guī)律必然是不準(zhǔn)確。數(shù)據(jù)清潔的重要性由此凸顯出來一它能提高數(shù)據(jù)的公信力和準(zhǔn)確度,因而對數(shù)據(jù)清洗的研究就顯得至關(guān)重要。
2數(shù)據(jù)清洗與數(shù)據(jù)質(zhì)量的關(guān)系
數(shù)據(jù)清洗過程的主要加工處理對象是臟數(shù)據(jù)。臟數(shù)據(jù)本身具有的不一致和不準(zhǔn)確性等特點,直接影響了數(shù)據(jù)的顯式和隱式價值,即直接影響了數(shù)據(jù)的質(zhì)量?!?br>