2002 Reports
DOM-based Content Extraction of HTML Documents
Web pages often contain clutter around the body of the article as well as distracting features that take away from the true information that the user is pursuing. This can range from pop-up ads to flashy banners to unnecessary images and links scattered around the screen. Extraction of 'useful and relevant' content from web pages, has many applications ranging from lightweight environments, like cell phone and PDA browsing, to speech rendering for the visually impaired, to text summarization Most approaches to removing the clutter or making the content more readable involves either changing the size of the font or simply removing certain HTML-denoted components like images, thus taking away from the webpage's inherent look and feel. Unlike Content Reformatting, which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses Content Extraction. We have developed a framework that employs an easily extensible set of techniques that incorporate advantages of previous work on content extraction while limiting the disadvantages. Our key insight is to work with the Document Object Model tree (after parsing and correcting the HTML), rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy that anyone can use to extract content from HTML web pages for their own purposes.
Subjects
Files
-
cucs-024-02.pdf application/pdf 801 KB Download File
More About This Work
- Academic Units
- Computer Science
- Publisher
- Department of Computer Science, Columbia University
- Series
- Columbia University Computer Science Technical Reports, CUCS-024-02
- Published Here
- April 21, 2011