網絡爬蟲的設計與實現

2014-07-28 18:40:08董日壯郭曙超

電腦知識與技術 2014年17期

董日壯　郭曙超

摘要：隨著社會的飛速發展，互聯網上信息容量急劇增加，人們對搜索引擎的依賴愈發強烈。網絡爬蟲是搜索引擎的關鍵技術之一，同時也是快速獲取網絡上可用資源的有效工具。為了能夠對網絡爬蟲更深入的了解并熟練合理的應用于各種的應用和系統中，經過對網絡爬蟲的框架、基本工作流程、抓取策略的分析和了解，使用Java與HTML解析工具jsoup以及MySQL數據庫實現一個網絡爬蟲，簡單爬取京東的圖書數據，用于用戶喜好的分析及購買傾向的判斷，為用戶提供個性化的服務。

關鍵詞：搜索引擎；網絡爬蟲；抓取策略；Java；jsoup；MySQL

中圖分類號：TP391 文獻標識碼：A 文章編號：1009-3044（2014）17-3986-03

Design and Implementation of Web Crawler

DONG Ri-zhuang1， GUO Shu-chao2

（1.School of Computer Engineering， Qingdao Technological University， Qingdao 266033， China； 2.Shandong Entry-Exit Inspection and Quarantine Bureau， Qingdao 266000， China）

Abstract： With the rapid development of society， a sharp increase in information capacity on the Internet， people rely on search engines is growing. As one of the key technologies of Web crawler search engines， but also an effective tool for quick access to the available resources on the network. In order to understand web crawler better and apply it into various applications and systems more skillful and reasonably. After analyze and understand the framework， basic workflow， grab strategy of web crawler， use programming language of Java and HTML parsing tools jsoup and MySQL database implements a web crawler， crawling Jingdong book data simply in order to analysis users preferences and purchase predisposition， so that to provide users with personalized service.

Key words： search engine； Web crawler； grab analyze； Java； jsoup； MySQL

1 概述

隨著社會發展與時代進步，信息社會的發展速度超出了絕大多數人的想象，與此同時，互聯網容量已經達到了一個空前的規模。據搜索引擎巨頭Google透露，在2012年時候，Google的網頁爬蟲Googlebot每天都會經過大約200億個網頁[1]，并且追蹤著約300億個獨立的URL鏈接。此外，Google每個月的搜索請求接近1000億次。由此可以看出，互聯網信息量龐大，搜索引擎應用廣泛。但是海量的信息要求搜索引擎給出更快的反饋。

網絡爬蟲[2，3，4]（Web Crawler）作為搜索引擎的重要組成部分，同樣也需要更快的發展，以應對迅速增長的互聯網容量。網絡爬蟲通常又被稱作網絡蜘蛛[5]（Web Spider），是一個可以自動在互聯網上漫游并可以自動下載網頁的程序或腳本。由于其功能多樣，網絡爬蟲可以被用于多種場合中，比如微博上有眾多的用戶與其他用戶之間的聯系的信息；……

登錄APP查看全文