Research and design of distributed high-performance network reptiles based on cloud platform

With the arrival of large data age,data has become the most valuable resource.And web crawler technology as an important means of external data collection,has become a standard tool for data analysis.A high-performance,convenient cloud-based crawler architecture design was introduced.The overall str...

Full description

Saved in:
Bibliographic Details
Main Authors: Enming SHI, Xiaojun XIAO, Yu LU
Format: Article
Language:zho
Published: Beijing Xintong Media Co., Ltd 2017-08-01
Series:Dianxin kexue
Subjects:
Online Access:http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2017234/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:With the arrival of large data age,data has become the most valuable resource.And web crawler technology as an important means of external data collection,has become a standard tool for data analysis.A high-performance,convenient cloud-based crawler architecture design was introduced.The overall structure of the reptile to the distributed design and the design of the sub-module was described in detail.Each module of the crawler was encapsulated in Docker,and Kubernetes was used as the resource scheduling and management of the cluster.In the performance of optimization,the MD5 reset tree algorithm,DNS optimization and asynchronous I/O were adopted.Experimental results show that the performance of crawler has obvious advantages compared with the UN optimized scheme.
ISSN:1000-0801