Using Scripts

Using Scripts

Overview

screen-scraper has a built-in scripting engine to facilitate dynamically scraping sites and working with data once it's been extracted. Depending on your needs scripts can be helpful for such things as interacting with databases and dynamically determining which files get scraped when.

Invoking scripts in screen-scraper is similar to other programming languages in that they're tied to events. Just as you might designate a block of code to be run when a button is clicked in Visual Basic, in screen-scraper you might run a script after an HTML file has been downloaded or data has been extracted from a page.

Depending on your preferences, there are a number of languages that scripts can be written in. screen-scraper supports JavaScript, Interpreted Java, and Python on any platform, and JScript, VBScript, and Perl when running on Windows. Try the links at the bottom of this screen for information specific to each of the scripting languages.

If you haven't done so already, we'd highly recommend taking some time to go through our tutorials in order to get more familiar with how scripts are used.

Managing scripts

Scripts are added by clicking the "New Script" button (looks like a pencil and paper) or by selecting "File->New Script" from the menu bar. Delete a script either by selecting it and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete".

Each script is given a unique name so that you can easily indicate when it should be invoked (e.g. before a scraping session begins or after each application of an extractor pattern). You can also select the language the script is written in. Scripts can be exported to an XML file so that they can be backed up or transferred to other instances of screen-scraper. See the Importing and exporting objects page for more information on this. Clicking on the "Show Script Instances" button will display any locations where this script is invoked in the format scraping session: scrapeable file: extractor pattern.

Finally, you're given a text box in which to write your script. The text editing features for authoring scripts in screen-scraper are currently fairly limited, so you may want to consider using an external editor, then copying and pasting text in to screen-scraper.

Using scripts

You designate a script to be executed by associating it with some event. For example, if you click on a scraping session in the tree, then on the "Scripts" tab, you'll notice that you can designate scripts to be invoked either before a scraping session begins or after it completes. Other events that can be used to invoke scripts relate to scrapeable files and extractor patterns. After associating a script with an object in this way it can be disassociated by selecting it in the table and pressing the "Delete" key or by right-clicking it (or control-clicking on Mac OS X) and selecting "Delete". You can also selectively enable and disable scripts using the "Enabled?" checkbox in the rightmost column.

Working with external Java libraries

Existing Java code can be referred to from within scripts. Simply copy any jar files you'd like to reference from within scripts into the "lib\ext" folder found in screen-scraper's directory. Note that you'll still need to use the "import" statement within your scripts to refer to specific classes, like this:

import com.foo.bar.*;

Please note one very important issue on this--your class files must be compiled to run under a 1.4.2 JRE. If you compile them to run under 1.5+ your scripts will fail and screen-scraper won't provide a clear message as to why. Please be certain that you're compiling them for 1.4.2.

Built-in objects

screen-scraper offers a few objects that you can work with in a script. Bear in mind that not all of these variables will be available in all scripts. See the Variable scope section (following this one) for more details. You can view details on all of the objects and their methods in our API Documentation.

Variable scope

Depending on when a script gets run different variables may be in scope. When associating a script with an object, such as a scraping session or scrapeable file, you're asked to specify when the script is to be run. The table that follows specifies what variables will be in scope depending on when a given script is run. Note that none of the variables will be in scope when a script is invoked directly, though it is common in these scripts to create RunnableScrapingSession objects.

When Script is Run session in scope scrapeableFile in scope dataSet in scope dataRecord in scope
Before scraping session begins X      
After scraping session ends X      
Before file is scraped X X    
After file is scraped X X    
Before pattern is applied X X    
After pattern is applied X X X  
After each pattern application X X X X

Debugging scripts

One of the best ways to fix errors is to simply watch the scraping session log (under the "Log" tab) and the "error.log" file (located in the "log" directory where screen-scraper was installed) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.

When screen-scraper is running as a server it will automatically generate individual log files in the "log" directory for each running scraping session (this can be disabled in the settings window). An "error.log" file will also be generated in that same directory when internal screen-scraper errors occur.

The "Breakpoint" window can also be invaluable in debugging scripts. You can invoke it by inserting the line session.breakpoint() into your script. While the "Breakpoint" is displayed script execution will halt. There are two buttons along the top of the window. The "play" button will simply continue execution of your script. Clicking the "stop" button will cause screen-scraper to halt execution as soon as it can. The "Breakpoint" window also exposes any session variables, data sets, and data records that are in scope. These values can be altered in the "Breakpoint" window as well.


From here: