Volume 10 Number 6 (Nov. 2015)
Home > Archive > 2015 > Volume 10 Number 6 (Nov. 2015) >
JCP 2015 Vol.10(6): 381-387 ISSN: 1796-203X
doi: 10.17706/jcp.10.6.381-387

A Study of Web Information Extraction Technology Based on Beautiful Soup

Chunmei Zheng1, Guomei He1, Zuojie Peng2
1School of Information Engineering, China University of Geosciences, Beijing, 100083, China
2Tencent, Inc., Beijing, 100083, China


Abstract—In the context of comparative analysis of common web information retrieval technologies, this article discusses the principles and applications of Beautiful Soup, a vertical information search technology based on DOM tree structure. Supported by actual system examples and centering on the system architecture and core technology, this article discusses how to use Beautiful Soup to conduct deep information retrieval for partially structured webpage data, obtain directional information, reorganize the information, and then send the information to users via text message. The test results demonstrate that the web crawler achieved over 95% accuracy, satisfying the needs for commercial application.

Index Terms—DOM, information collection, vertical information retrieval, beautiful soup, customize.

[PDF]

Cite: Chunmei Zheng, Guomei He, Zuojie Peng, "A Study of Web Information Extraction Technology Based on Beautiful Soup," Journal of Computers vol. 10, no. 6, pp. 381-387, 2015.

General Information

ISSN: 1796-203X
Abbreviated Title: J.Comput.
Frequency: Monthly
Editor-in-Chief: Prof. Liansheng Tan
Executive Editor: Ms. Nina Lee
Abstracting/ Indexing: DBLP, EBSCO,  ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat, CNKI,etc
E-mail: jcp@iap.org
  • Aug 16, 2019 News!

    Vol 14, No 8 has been published with online version   [Click]

  • Jul 19, 2019 News!

    Vol 14, No 7 has been published with online version   [Click]

  • Jun 21, 2019 News!

    Vol 14, No 6 has been published with online version   [Click]

  • Apr 28, 2019 News!

    Vol 14, No 5 has been published with online version 7 papers are published in this issue after peer review   [Click]

  • Mar 20, 2019 News!

    Vol 14, No 3 has been published with online version   [Click]

  • Read more>>