面向电子商务网站的专业网络爬虫设计与实现_计算机软件

（毕业论文页数：28 字数：14338 开题报告任务书）面向电子商务网站的专业网络爬虫设计与实现
摘要：网络爬虫是一个自动下载网页的程序，是搜索引擎的重要组成。传统爬虫从一个或若干初始网页的URL开始，获得初始网页上的URL，在抓取网页的过程中，不断从当前页面上抽取新的URL放入队列，直到该URL对列为空为止。
本文设计的这款面向电子商务网站的专业网络爬虫，只对电子商务网站进行信息搜索，让用户可以尽可能多的找到自己关心的商品信息。面向电子商务网站的专业网络爬虫设计的工作流程十分复杂，需要根据一定的网页分析过滤与电子商务商品信息无关的链接，保留有用的链接并将其放入等待抓取的URL队列。然后，它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL，并重复上述过程，直到达到保存URL的队列为空为止。另外，所有被爬虫抓取的网页将会被系统存贮。
文章在分析网络爬虫的工作原理的基础上，结合多线程技术，设计了这个网络爬虫程序。

关键字：搜索引擎，网络爬虫，电子商务

The Topic-Specific Web Crawler of Oriented e-commerce website Design and Implementation

Abstract：Web Crawler is a procedure of automatically downloading website pages, it downloads website pages from the World Wide Web for search engine, and works as an important component of search engine. Traditional Web Crawler starts from one or several of the initial URL of a website, and get some new URLs from the website pages, in the process of continuously downloading website html pages, it finds some new URLs and determine which URLs will be added into a queue, it works until the URL Queue is empty.
The Web Crawler, which is designed by me, is to collect information on the e-commerce websites, so that users can find as much information as they concerned.
The Web Crawler which downloads e-commerce websites, has a very complicated workflow, and needs doing an analysis for the website and filter links which are unrelated to e-commerce website, then keeps the useful links and places them into the URL queue. Then, under certain searching strategy, it would choose the next URL from the queue to download the website page, and repeat this process until the URL queue is empty. In addition, all the pages are stored on the local driver.
Based on the analysis of the principle of the Web Crawler, and the multithreading technology, this article designs this Web Crawler procedure.

Key Words: Search engine, Web Crawler, E-commerce

目录
摘要 I
ABSTRACT II
目录 III
1 绪论 4
1.1 课题背景及意义 4
1.2 国内外研究现状 2
1.3 爬虫程序在电子商务的应用 3
1.4 本文所要完成的工作 4
2 网络爬虫 5
2.1 搜索引擎概述 5
2.1.1 通用搜索引擎概述 5
2.1.2 专业搜索引擎介绍 5
2.1.3 搜索引擎的性能指标 7
2.2 网络爬虫概述 9
2.2.1 网络爬虫简介 9
2.2.2 网络爬虫工作原理 9
3 专业网络爬虫的设计 10
3.1 爬虫设计原理 10
3.2 线程技术的应用 10
3.2.1 创建线程 10
3.2.2 线程间通信 11
3.3 网络爬虫结构分析 11
3.3.1 如何解析HTML 11
3.3.2 Spider程序结构 13
3.3.3 构造Spider程序 15
3.3.4 URL筛选策略 18
3.4 运行结果分析 18
结论 20
致谢 21
参考文献 22

1 绪论
1.1 课题背景及意义
当今，互联网上的信息量呈指数级增长，每4个月翻一番，网上的数据信息爆炸式地增长，其规模已从1993年的几千个网页快速增长到2003年的至少80亿个网页和560亿个超链接，而现在网页数和链接数已远远超出这个统计数字，而Google现在能检索的网页量还是80亿，比起网络上巨大的信息量而言还是比较少的。
搜索引擎是在互联网上查询信息的主要工具之一.它自发地在互联网中搜集，发现信息，对信息进行筛选、下载、分类和建立可查询的数据库，并为用户提供检索服务。搜索引擎提供的导航服务己经成为互联网上非常重要的网络服务，搜索引擎站点也被美誉为”网络门户”。
随着网络的迅速发展，万维网成为大量信息的载体，如何有效地提取并利用这些信息成为一个巨大的挑战。通用的搜索引擎(Search Engine)，例如搜索引擎AltaVista，Yahoo!和Google，也存在着一定的局限性[1]，如：(1)不同领域、不同背景的用户往往具有不同的检索目的和需求，通用搜索引擎所返回的结果包含大量用户不关心的网页; (2)通用搜索引擎的目标是尽可能大的网络覆盖率，有限的搜索引擎服务器资源与无限的网络数据资源之间的矛盾将进一步加深; (3)万维网数据形式的丰富和网络技术的不断发展，图片、数据库、音频/视频多媒体等不同数据大量出现，通用搜索引擎往往对这些信息含量密集且具有一定结构的数据无能为力，不能很好地发现和获取; (4)通用搜索引擎大多提供基于关键字的检索，难以支持根据语义信息提出的查询。
为了提高对检索信息的准确度，提高信息的搜索量，有必要建立一系列专业的搜索引擎，这种专业的搜索引擎，根据一定的策略，只对某类指定的站点感性趣，从而建立起只提供某类信息检索的专用搜索引擎. 而网络爬虫，作为这种专业的搜索引擎的重要组成部分，也会跟以往通用的爬虫在工作策略上有许多不一样.

面向电子商务网站的专业网络爬虫设计与实现

栏目导航

热门关键词

最新论文

随机论文