Using Scrapeable Files

Using Scrapeable Files

Overview

A scrapeable file is a URL-accessible file that you want to have retrieved as part of a scraping session. These files are the core of screen-scraping as they determine what information will be made available to extract data from.

Scrapeable files are created by clicking the "Add Scrapeable File" button from the "General" tab for a scraping session. You can delete a scrapeable file by right-clicking (or option-clicking in Mac OS X) it in the tree on the left side of the screen and selecting "Delete".

In addition to working with files on remote servers, screen-scraper can also handle files on local file systems. For example, the following is a valid path to designate in the URL field: C:\wwwroot\myweb\my_file.htm.

Properties tab



The "Properties" tab defines basic settings needed to request a file.

  • Delete: Deletes the scrapeable file.
  • Copy: Copies the scrapeable file. (enterprise edition only)
  • Name: Identifies the scrapeable file.

  • URL: The URL of the file to be scraped. This is likely something like http://www.mysite.com/, but can also contain embedded session variables, like this: http://www.mysite.com/cgi-bin/test.cgi?param1=~#TEST#~. In the latter case the text ~#TEST#~ would get replaced with the value of the corresponding session variable.
  • Sequence: Indicates the order in which this file should be requested.
  • This scrapeable file will be invoked manually from a script: Indicates that this scrapeable file will be invoked within a script, so it should not be scraped in sequence. If this box is checked the "Sequence" text box becomes grayed out.

Parameters tab



The "Parameters" tab indicates GET and POST parameters that should be sent when the file is requested. Note that GET parameters can also be embedded in the "URL" field under the "Properties" tab. Parameters are added using the "Add Parameter" button. They can be deleted by selecting them and either hitting the "Delete" key on the keyboard, or by right-clicking (option-clicking in Mac OS X) and selecting "Delete".

In the Enterprise Edition of screen-scraper you can also designate files to be uploaded. This is done by designating "FILE" as the parameter type. The "Key" column would containg the name of the parameter (as found in the corresponding HTML form), and the value would be the local path to the file you'd like to upload (e.g., C:\myfiles\this_file.txt).

Embedded session variables can be used in the "Key" and "Value" fields for parameters. For example, if you have a "username" POST parameter you might embed a USERNAME session variable in the "Value" field with the token ~#USERNAME#~. This would cause the value of the "USERNAME" session variable to be substituted in at run time.

Extractor Patterns tab



This tab holds the various extractor patterns that will be applied to the HTML of this scrapeable file. See the using extractor patterns page for more information.

Scripts tab



Using this tab scripts can be designated to run either before or after the file is requested. This can be useful for functions like setting session variables and requesting multiple pages of search results. The script to be run is designated under the "Script Name" column. The sequence the scripts should be invoked in is determined by the "Sequence" column. Indicate the event that should trigger the script using the "When to Run" column. If the checkbox in the "Enabled?" column is not checked the script will not get run.

Last Request tab



This tab will display the raw HTTP request for the last time this file was retrieved. This tab can be useful for debugging in looking at POST and GET parameters that were sent to the server.

Last Response tab



This tab displays the raw HTTP and HTML from the last time this file was requested. The most common use for this tab is in generating and testing extractor patterns. You can generate an extractor pattern by highlighting a block of text or HTML, right-clicking (option-clicking on Mac OS X) and selecting "Generate extractor pattern from selected text".

The "Render HTML"/"View Source" button allows you to toggle between a rendered version of the page and the raw HTML source. In certain cases the HTML may contain embedded JavaScript and complex DHTML that screen-scraper has difficulty rendering. You can also use the "Display Response in Browser" button to display the web page in your default web browser.

Note that the contents shown under the "Last Request" tab might appear differently from the original HTML of the page. screen-scraper has the ability to "tidy" the HTML, which can facilitate data extraction. See using extractor patterns for more details on tidying HTML.

When viewed as text, the HTML for the last response can be searched using the "Find..." button.

Advanced tab (professional and enterprise editions only)



This tab contains a few advanced settings.

  • Username and Password: These two text fields are used with sites that make use of Basic, Digest, NTLM authentication. You can generally recognize when a web site requires this type of authentication because, after requesting the page, a small box will pop up requesting a username and password.
  • Tidy HTML after scraping?: When this box is checked screen-scraper will "tidy" the HTML after requesting the file. This cleans up the HTML, which facilitates extracting data from it. Note that a performance hit is incurred, however, when tidying is done. In cases where performance is critical this box should be un-checked.


From here:

More details on related stuff: