Information extraction from massive Web pages based on node property and text content

To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning...

Full description

Saved in:
Bibliographic Details
Main Authors: Hai-yan WANG, Pan CAO
Format: Article
Language:zho
Published: Editorial Department of Journal on Communications 2016-10-01
Series:Tongxin xuebao
Subjects:
Online Access:http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.
ISSN:1000-436X