Overview

screen-scraper has a built-in scripting engine to facilitate dynamically scraping sites and working with data once it's been extracted. Depending on your needs scripts can be helpful for such things as interacting with databases and dynamically determining which files get scraped when.

Invoking scripts in screen-scraper is similar to other programming languages in that they're tied to events. Just as you might designate a block of code to be run when a button is clicked in Visual Basic, in screen-scraper you might run a script after an HTML file has been downloaded or data has been extracted from a page.

Depending on your preferences, there are a number of languages that scripts can be written in. screen-scraper supports JavaScript, Interpreted Java, and Python on any platform, and JScript, VBScript, and Perl when running on Windows. Try the links at the bottom of this screen for information specific to each of the scripting languages.

If you haven't done so already, we'd highly recommend taking some time to go through our tutorials in order to get more familiar with how scripts are used.

Managing scripts

Scripts are added by clicking the "New Script" button (looks like a pencil and paper) or by selecting "File->New Script" from the menu bar. Delete a script either by selecting it and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete".

Each script is given a unique name so that you can easily indicate when it should be invoked (e.g. before a scraping session begins or after each application of an extractor pattern). You can also select the language the script is written in. Scripts can be exported to an XML file so that they can be backed up or transferred to other instances of screen-scraper. See the Importing and exporting objects page for more information on this. Clicking on the "Show Script Instances" button will display any locations where this script is invoked in the format scraping session: scrapeable file: extractor pattern.

Finally, you're given a text box in which to write your script. The text editing features for authoring scripts in screen-scraper are currently fairly limited, so you may want to consider using an external editor, then copying and pasting text in to screen-scraper.

Using scripts

There are three ways to invoke a script: right-click it in the main tree and select "Run Script", select it from the tree on the left and click the "Run" button, or associate it with some event. For example, if you click on a scraping session in the tree, then on the "Scripts" tab, you'll notice that you can designate scripts to be invoked either before a scraping session begins or after it completes. Other events that can be used to invoke scripts relate to scrapeable files and extractor patterns. After associating a script with an object in this way it can be disassociated by selecting it in the table and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete". You can also selectively enable and disable scripts using the "Enabled?" checkbox in the rightmost column.

Working with external Java libraries

Existing Java code can be referred to from within scripts. If it doesn't exist already, create an "ext" directory inside the "lib" folder of your screen-scraper installation directory. Simply copy any jar files you'd like to refer from within scripts into this directory. Note that you'll still need to use the "import" statement within your scripts to refer to specific classes.

Built-in objects

screen-scraper offers a few objects that you can work with in a script. Bear in mind that not all of these variables will be available in all scripts. See the "Variable scope" section (following this one) for more details.

  • session. This variable allows for interaction with the currently running session. It has the following methods:
    • breakpoint() . Displays the "Breakpoint" frame. See the "Debugging scripts" section below for more details.
      example:   session.breakpoint();
    • downloadFile( String url, String fileName ) . Downloads the file found at the url and saves it to a local file system at the path designated by fileName.
      example:   session.downloadFile( "http://www.foo.com/imgs/puppy_image.gif", "C:\images\puppy.gif" );
    • getVariable( String identifier ). Retrieves the value of a saved session variable designated by identifier.
      example:   cityCode = session.getVariable( "CITY_CODE" );
    • log( String message ). Causes message to be wriiten to the "Log" panel for the currently running scraping session.
      example:   session.log( "Inserting extracted data into the database." );
    • pause( long time ). Causes the scraping session to pause for the given number of milliseconds.
      example:   session.pause( 5000 );
    • setVariable( String identifier, Object value ). Designates that value should be saved for the duration of the session, and can be accessed using the getVariable method using identifier. Note that the dataSet and dataRecord objects can be stored in session variables, and later accessed using a RemoteScrapingSession (see the links at the bottom of the Running screen-scraper as a server page for more details on this).
      example:   session.setVariable( "CITY_CODE", dataSet.get( 0, "CITY_CODE" ) );
    • scrapeFile( String scrapeableFileIdentifier ). Causes the scrapeable file identified by scrapeableFileIdentifier to be scraped.
      example:   session.scrapeFile( "Login" );
    • sendMail( String subject, String body, String recipients, String attachments, String headers ) . Sends an email using the given subject and body to the given comma-separated recipients. The attachments (optional) property should be a comma-separated list of paths designating files that should be sent as attachments with the email. The headers (optional) parameter allows you to designate arbitrary SMTP headers to be used when sending the email. Note that in order for this to work properly a valid mail server must have been previously designated in the settings dialog box.
      example:   session.sendMail( "Test message", "This is the body of the email", "my_friend@mydomain.com", null, null );
    • setCookie( String domain, String key, String value ) . Causes a cookie to be manually set on the current session state. Note that this method should be rarely used, given that screen-scraper automatically manages cookies. It might be necessary in cases where a site sets cookies via JavaScript.
      example:   session.setCookie( "mydomain.com", "cookie_key", "cookie_value" );
    • stopScraping(). Indicates that the current session should be stopped as soon as possible.
      example:   session.stopScraping();
  • scrapeableFile. This variable allows for interaction with the current scrapeable file. It has the following methods:
    • addHTTPParameter( HTTPParameter parameter ) . Dynamically adds an HTTPParameter to the current scrapeable file. The HTTPParameter constructor is as follows: HTTPParameter( String key, String value, int sequence, int type ). Valid types for the constructor are TYPE_GET and TYPE_POST.
      example:   scrapeablFile.addHTTPParameter( new com.screenscraper.common.HTTPParameter( "key", "value", 1, com.screenscraper.common.HTTPParameter.TYPE_POST ) );
    • extractData( String text, String name ). Manually invokes an extractor pattern, returning the extracted data in a DataSet object. The text parameter should be a string containing the HTML you'd like to extract information from. The name parameter should be the name of an extractor pattern of the form [scraping session]:[scrapeable file]:extractor pattern where the scraping session and scrapeable file portions of the name are optional. For example, if you passed in "My Scraping Session:My Scrapeable File:My Extractor Pattern" screen-scraper would find the extractor pattern named "My Extractor Pattern" inside the scrapeable file "My Scrapeable File", which it would look for inside the scraping session called "My Scraping Session". You could also pass in "My Scrapeable File:My Extractor Pattern", which would cause screen-scraper to look in the current running scraping session for the scrapeable file "My Scrapeable File" where it would look for the extractor pattern "My Extractor Pattern". If the extractor pattern you want to use is associated with the current scrapeable file you can simply pass in its name (e.g., "My Extractor Pattern").
      example:   DataSet productData = currentScrapeableFile.extractData( productDescriptionText, "PRODUCT" );
    • extractOneValue( String text, String name ). This method is similar to extractData except that it assumes only a single string will be returned. When the method is invoked the first column in the first row of the resulting DataSet object will be returned. The text parameter should be a string containing the HTML you'd like to extract information from. The name parameter should be the name of an extractor pattern associated with the current scrapeable file.
      example:   productName = currentScrapeableFile.extractData( productDescriptionText, "PRODUCT_NAME" );
    • getContentAsString(). Gets the content that was retrieved when the scrapeable file was requested.
      example:   session.log( scrapeableFile.getContentAsString() );
    • getCurrentPOSTData(). Returns the POST data for the scrapeable file. Note that if this method is invoked after the scrapeable file is requested it will contain the POST data with all of the session variable tokens resolved.
      example:   currentPOSTData = scrapeableFile.getCurrentPOSTData();
    • getCurrentURL(). Returns the URL of the scrapeable file. Note that if this method is invoked after the scrapeable file is requested it will contain the URL with all of the session variable tokens resolved.
      example:   currentURL = scrapeableFile.getCurrentURL();
    • noExtractorPatternsMatched(). Will return true if no extractor patterns associated with the scrapeable file found a match. This can be a useful error-handling mechanism.
      example:   if( scrapeableFile.noExtractorPatternsMatched() ) { session.log( "Warning! No extractor patterns matched." ); }
    • removeAllHTTPParameters(). Removes all of the HTTP parameters from the current scrapeable file. This can be useful in cases where scrapeable files are requested multiple times and parameters are added dynamically using the addHTTPParameter method.
      example:   scrapeablFile.removeAllHTTPParameters();
    • wasErrorOnRequest(). Indicates whether or not an error occurred when screen-scraper tried to retrieve the scrapeable file.
      example:   if( scrapeableFile.wasErrorOnRequest() ) { session.log( "Connection error occurred." ); }
    dataSet. A data set is analogous to a result or record set that would be returned from a database query. A data set contains any number of data records, which are analogous to rows in a database. The dataSet object holds all data records extracted by an extractor pattern after it has been applied as many times as possible to the HTML retrieved by a scrapeable file. Its methods are:
    • deleteDataRecord( int dataRecordNumber ). Removes a data record from the set designated by dataRecordNumber.
      example:   dataSet.deleteDataRecord( 2 );
    • getAllDataRecords(). Returns an ArrayList of DataSet objects (which simply extends Hashtable).
      example:   allData = dataSet.getAllDataRecords();
    • getDataRecord( int dataRecordNumber ). Returns the data record (a Hashtable object) indicated by the given dataRecordNumber. The first data record is referenced by dataRecordNumber 0.
      example:   firstDataRecord = dataSet.getDataRecord( 0 );
    • getNumDataRecords(). Returns the number of data records held by this object.
      example:   totalDataRecords = dataSet.getNumDataRecords();
    • get( int dataRecordNumber, String identifier ). Gets a single piece of data held by the data record designated by dataRecordNumber corresponding to identifier.
      example:   firstCityCode = dataSet.get( 0, "CITY_CODE" );
    • writeToFile( String fileName ). Writes the data contained by the set to a tab-delimited file. If the file already exists this method will append data to it.
      example:   dataSet.writeToFile( "C:\site_data\extracted_data.txt" );
    dataRecord. This gives access to the most recently extracted data record. This will most likely only be used in scripts that get accessed after each time an extractor pattern is applied. This object simply extends Hashtable, and documentation on its methods can be found here. Note that this object is populated using the names of tokens from extractor patterns. So, for example, if your extractor pattern uses a token named "CITY_CODE" the data extracted by that extractor pattern would be retrieved like so:
    example:   cityCode = dataRecord.get( "CITY_CODE" );
    com.screenscraper.scraper.RunnableScrapingSession. This is a class that can be instantiated within a script in order to run a scraping session. The "Maximum number of concurrent running scraping sessions" in the "Settings" dialog box will control how many scraping sessions can be run simultaneously.
    • RunnableScrapingSession( String name ). The constructor. Takes the name of an existing scraping session.
      example:   myScrapingSession = new com.screenscraper.scraper.RunnableScrapingSession( "My Session" );
    • getName(). Returns the name of the scraping session.
      example:   sessionName = myScrapingSession.getName();
    • getTimeout(). Gets the timeout, in minutes, of the session.
      example:   session.log( "Session timeout: " + runnableScrapingSession.getTimeout() );
    • scrape(). Starts the session scraping. This is equivalent to clicking the "Start Scraper" button on the scraping session "General" panel. Please note that when this method is called it will return immediately. That is, the line just following the one executing this method will be run without waiting for the scraping session to finish scraping. Internally, screen-scraper spawns a separate thread to handle the scraping session so that the script can continue executing (and so that multiple scraping sessions can be run simultaneously).
      example:   myScrapingSession.scrape();
    • setTimeout( int timeout ). Sets the timeout, in minutes, of the session. That is, after the given number of minutes have passed the session will automatically terminate. This can be useful in cases where an infinite loop might occur (e.g., the same pages get scraped over and over).
      example:   runnableScrapingSession.getTimeout( 60 );
    • setVariable( String identifier, Object value ). Designates that value should be saved for the duration of the session, and can be accessed through the session object (described above) using the getVariable method.
      example:   myScrapingSession.setVariable( "LOGIN_USERNAME", "my_username" );

Variable scope

Depending on when a script gets run different variables may be in scope. When associating a script with an object, such as a scraping session or scrapeable file, you're asked to specify when the script is to be run. The table that follows specifies what variables will be in scope depending on when a given script is run. Note that none of the variables will be in scope when a script is invoked directly (by clicking on the "Run Script" button), though it is common in these scripts to create RunnableScrapingSession objects.

When Script is Run session in scope scrapeableFile in scope dataSet in scope dataRecord in scope
Before scraping session begins X      
After scraping session ends X      
Before file is scraped X X    
After file is scraped X X    
Before pattern is applied X X    
After pattern is applied X X X  
After each pattern application X X X X

Debugging scripts

One of the best ways to fix errors is to simply watch the scraping session log (under the "Log" tab) and the "error.log" file (located in the "log" directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.

When screen-scraper is running as a server it will automatically generate individual log files in the "log" directory for each running scraping session (this can be disabled in the settings window). An "error.log" file will also be generated in that same directory when internal screen-scraper errors occur.

The "Breakpoint" window can also be invaluable in debugging scripts. You can invoke it by inserting the line session.breakpoint() into your script. While the "Breakpoint" is displayed script execution will halt. There are two buttons along the top of the window. The "play" button will simply continue execution of your script. Clicking the "stop" button will cause screen-scraper to halt execution as soon as it can. The "Breakpoint" window also exposes any session variables, data sets, and data records that are in scope. These values can be altered in the "Breakpoint" window as well.


From here: