DOM-based Content Extraction of HTML Documents

Gupta, Suhit; Neistadt, David; Kaiser, Gail E.; Grimm, Peter

Web pages often contain clutter around the body of the article as well as distracting features that take away from the true information that the user is pursuing. This can range from pop-up ads to flashy banners to unnecessary images and links scattered around the screen. Extraction of 'useful and relevant' content from web pages, has many applications ranging from lightweight environments, like cell phone and PDA browsing, to speech rendering for the visually impaired, to text summarization Most approaches to removing the clutter or making the content more readable involves either changing the size of the font or simply removing certain HTML-denoted components like images, thus taking away from the webpage's inherent look and feel. Unlike Content Reformatting, which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses Content Extraction. We have developed a framework that employs an easily extensible set of techniques that incorporate advantages of previous work on content extraction while limiting the disadvantages. Our key insight is to work with the Document Object Model tree (after parsing and correcting the HTML), rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy that anyone can use to extract content from HTML web pages for their own purposes.



More About This Work

Academic Units
Computer Science
Department of Computer Science, Columbia University
Columbia University Computer Science Technical Reports, CUCS-024-02
Published Here
April 21, 2011