|
If you know basic HTML you can extract information from web pages using screen-scraper. Extractor patterns are snippets of HTML containing special tokens that indicate where data to be extracted are found. For example, the following extractor pattern:
<head><title>~@PAGE_TITLE@~</title></head>
would indicate that the title of the HTML page is to be extracted into the PAGE_TITLE variable. Parsing HTML in this fashion is significantly simpler than other common methods that involve complex regular expressions and text scanning.
Once scraped, the data can then be written to a spreadsheet, saved to a database, made available to an external application, or just about anything else that can be done in a scripting language.
|