徐俊 王慶華 趙云龍
摘要:針對冠字號小圖片存儲到HDFS系統中帶來的訪問瓶頸問題,改進了原有的HDFS系統,新提出的分布式系統機制是充分基于文件相關性(File Correlation)進行合并處理的HDFS(FCHDFS)。由于HDFS中所有的文件都是由單一的主節點服務器托管-NameNode,每個存儲到HDFS的文件在NameNode主存儲器中都需要存儲它的元數據,這必然導致小文件數量越大HDFS性能就越差。存儲和管理大量的小文件,對NameNode是一個沉重的負擔。可以存儲在HDFS的文件數量是受到NameNode的內存大小約束。為了提高存儲和訪問HDFS上的冠字號小文件的效率,該文提出了一個基于文件關聯性的小文件高效處理機制。在這種方法中,按照客戶和時間區分,一組相關的文件相結合為一個大文件,從而減少文件數目。而新建的索引機制能從相應的聯合文件中訪問單個文件。實驗結果表明,FCHDFS大大減少主節點內存中元數據數量,也提高了存儲和訪問大量小文件的效率。
關鍵詞:Hadoop;小文件;HDFS;文件合并
中圖分類號:TP18 文獻標識碼:A 文章編號:1009-3044(2014)17-3980-06
Research on Distributed Memory of Crown Size Small Files Based on Improved HDFS
XU Jun1,2, WANG Qing-hua1, ZHAO Yun-long1
( 1. ATM Research Institute, GRGBanking,Guangzhou Radio Group, Guangzhou 510663, China; 2. College of Computer, South China Normal University, Guangzhou 510631, China )
Abstract: Aiming at the access bottleneck problem caused by storage crown size small picture to the HDFS system, improved the existing HDFS system, new mechanism of distributed system is fully based on files correlation (File Correlation) and combined these correlated files. Because of all the files in HDFS are made by the master node server hosting the -NameNode single, each stored in the HDFS file needs to store the metadata it in NameNode main memory, which is bound to lead to a larger number of small file HDFS performance worse. For small file storage and management, NameNode is a heavy burden. The number of HDFS stored documents is constrainted by NameNode memory size. In addition, HDFS does not consider the correlation between files. In order to improve the efficiency of small file storage and access of HDFS, this paper proposes an efficient mechanism for handling small files based on crown size file association. In this method, according to differentiate customer and time, a group of related files are combined into one big file, thereby reducing the number of files. To access a single file from the file corresponding to the new index mechanism. The experimental results show that, FCHDFS greatly reduce the number of master nodes in data memory, and also improve the efficiency of storage and access to a large number of small files.
Key words: Hadoop; small file; HDFS; file merge
根據目前各大銀行總行的需求,冠字號信息需要集中到總行管理,冠字號圖像采用網點節點機保存,支持保存至少3個月的全行數據,數據量達幾十億上百億條。需求分析表明,記錄的產生速度和記錄的數量都滿足典型的大數據特征[1,2],已經接近或超過傳統數據庫技術的處理能力,而隨著系統的持久運行和業務擴大,記錄數量還有持續增長的趨勢。因此分析系統的架構應以成熟的大數據系統架構為基礎才能良好的實現其業務能力的要求。……