• Home
  • Browse
    • Current Issue
    • By Issue
    • By Author
    • By Subject
    • Author Index
    • Keyword Index
  • Journal Info
    • About Journal
    • Aims and Scope
    • Editorial Board
    • Editorial Staff
    • Publication Ethics
    • Indexing and Abstracting
    • Related Links
    • FAQ
    • Peer Review Process
    • News
  • Guide for Authors
  • Submit Manuscript
  • Reviewers
  • Contact Us
 
  • Login
  • Register
Home Articles List Article Information
  • Save Records
  • |
  • Printable Version
  • |
  • Recommend
  • |
  • How to cite Export to
    RIS EndNote BibTeX APA MLA Harvard Vancouver
  • |
  • Share Share
    CiteULike Mendeley Facebook Google LinkedIn Twitter Telegram
Journal of AI and Data Mining
Articles in Press
Current Issue
Journal Archive
Volume Volume 6 (2018)
Issue Issue 2
Issue Issue 1
Volume Volume 5 (2017)
Volume Volume 4 (2016)
Volume Volume 3 (2015)
Volume Volume 2 (2014)
Volume Volume 1 (2013)
Pouramini, A., Khaje Hassani, S., Nasiri, S. (2018). Data Extraction using Content-Based Handles. Journal of AI and Data Mining, 6(2), 399-407. doi: 10.22044/jadm.2017.990
A. Pouramini; S. Khaje Hassani; Sh. Nasiri. "Data Extraction using Content-Based Handles". Journal of AI and Data Mining, 6, 2, 2018, 399-407. doi: 10.22044/jadm.2017.990
Pouramini, A., Khaje Hassani, S., Nasiri, S. (2018). 'Data Extraction using Content-Based Handles', Journal of AI and Data Mining, 6(2), pp. 399-407. doi: 10.22044/jadm.2017.990
Pouramini, A., Khaje Hassani, S., Nasiri, S. Data Extraction using Content-Based Handles. Journal of AI and Data Mining, 2018; 6(2): 399-407. doi: 10.22044/jadm.2017.990

Data Extraction using Content-Based Handles

Article 15, Volume 6, Issue 2, Summer and Autumn 2018, Page 399-407  XML PDF (1187 K)
Document Type: Original Manuscript
DOI: 10.22044/jadm.2017.990
Authors
A. Pouramini ; S. Khaje Hassani; Sh. Nasiri
Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.
Abstract
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text features such as textual delimiters, keywords, constants or text patterns, which we call handles, to construct patterns for the target data regions and data records. We offer a polynomial algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM-tree. The extracted data is directly mapped onto a hierarchical XML structure, which forms the output of the wrapper. The wrappers that are generated by this method are robust and independent of the HTML structure. Therefore, they can be adapted to similar websites to gather and integrate information.
Keywords
Web Data Record Extraction; Web Wrapper Generation; Web Information Extraction
Main Subjects
Document and Text Processing
Supplementary Files
download 1453-1.pdf
Statistics
Article View: 463
PDF Download: 48
Home | Glossary | News | Aims and Scope | Sitemap
Top Top

free analytics


Creative Commons License
JAD is licensed under a Creative Commons Attribution 4.0 International License.

Journal Management System. Designed by sinaweb.