Web Page Retrieval and Parsing

Robert D. Cameron
Feb 11, 2004

Web Page Retrieval

Web page retrieval is a commonly required function for robots, agents and other forms of tool which consider the web as a data source.

Web as File System

Web pages may be viewed as files that can be processed in much the same way as local files.

PHP Filesystem functions allow you to open either local, http or https files.

The client URL (cURL) package is a command-line utility for accessing files according to a wide variety of URL schemes, including http, https, telnet, ldap, dict, gopher and file. It may be available as a run-time library (libcurl) with PHP.

Recursive Retrieval

GNU Wget is a Unix program that can recursively download web pages. It has many features.

Time-out, retry, resumption.
FTP and http.
Relative link processing.
User-agent emulation.

HTML Parsing: Scraping

Extracting data from web pages is made difficult by the fact that many web pages do not conform to valid HTML syntax.

The term web scraping is often used to refer to the process of extracting data from HTML pages.

Web scraping is often performed with the powerful string processing capabilities available with regular expression packages.

For example, the following program illustrates extraction of hyperlinks from a web page.

The scrapelinks.php script.
A sample page for scraping.
The scrape results.

HTML Preprocessing with HTML Tidy

The HTML Tidy utility may be used to preprocess ill-formed HTML prior to page scraping or parsing. An on-line service can be used to demonstrate.

Tidy can produce XHTML output, which can then be reliably parsed using various XML parsers.

The REX Shallow Parser can be used to process any document to return a list of both valid and invalid XML items.