screen-scraper logo The web data extraction experts Search
Navigation See our products Contact us Buy screen-scraper professional Help and support Download screen-scraper Free quote Navigation
screen-scraper FAQ

This page is meant to cover a lot of the common technical problems and questions that people encounter when using screen-scraper. If what we have on this page and in the documentation doesn't address an issue you've encountered please feel free to drop us a line or try our forum. You may also use our glossary of screen-scraping terms for help understanding the screen scraping paradigm.

General Non-Technical
General Technical
Troubleshooting
Tips & Suggestions

General Non-Technical

For the Professional and Enterprise editions of screen-scraper what upgrades am I entitled to?

If you license the professional or enterprise edition of screen-scraper today you are entitled to major and minor upgrades for that particular edition forever at no cost. It's possible that we'll change this policy down the road, but that's how it stands today.

back to top


How do I upgrade from version 3.x to version 4?

In order to upgrade from version 3.x to version 4, you'll need to install version 4 fresh using one of our installers, accessible from this page. This is necessary because we're now using an upgraded version of the Java Runtime Environment, which can't be upgraded using the normal "Check for updates..." method of upgrading within screen-scraper.

Those who licensed screen-scraper Professional Edition prior to the release of version 4.0 are entitled to a free upgrade. Please note that version 4.0 of the Professional Edition lacks some features that were available in version 3.0 (and subsequent alphas). If you licensed screen-scraper Professional Edition before version 4.0 was release, just send us a support request indicating such. As part of this upgrade you'll be entitled to a license for screen-scraper Enterprise Edition, but will not be entitled to the phone and email support that current licensees of the Enterprise Edition get.

Steps to upgrading to version 4.0 from any version 3.x or below.

1. Make a back up of your existing installation and export your scraping sessions and scripts by following instructions here.

2. For Windows users, uninstall the old version version using Add/Remove Programs in the Control Panel.

3. Download and install the latest version from this page.

4. Import your work into your new installation following instructions on the lower half of this page.

back to top


What are the differences between the three editions of screen-scraper?

See our comparison matrix for this.

back to top


I'd like to have an offline copy of your documentation. Could you provide me with one?

We have a full offline version of our site in a zip file (which may be somewhat dated) here. Simply download it, decompress it, and open the index.html file.

back to top


I've just purchased a screen-scraper license. How do I register it in my local copy of screen-scraper?

When you purchased your license you would have entered in an email address. This address is used to unlock your local copy of screen-scraper. To do that, simply select "Enter registration information" from the "Options" menu, then enter that email address.

back to top


How does screen-scraper licensing work?

There are three editions of screen-scraper: Basic and Professional. The Basic Edition is completely free. It costs nothing and never will. The Professional and Enterprise Editions carry a licensing cost, which cost can currently be found here.

Our screen-scraper Professional and Enterprise Editions are licensed on a per machine basis. A single instance of screen-scraper can scrape as many sites as the underlying hardware allows (i.e., we don't charge per site scraped).

For the Enterprise Edition, we do allow a single license to be used for two machines under certain circumstances. If you have one machine you use for development, and another used for production/deployment, it is acceptable to use the same license for both machines.

For a copy of the screen-scraper Professional and Enterprise Editions license please see our online copy here.

We offer discounts to students and academic institutions. Please contact us directly if this is of interest.

We also offer volume discounts, as follows:

5-10 licenses: 10% discount
11-15 licenses: 20% discount
16+ licenses: 30% discount

back to top

General Technical

The web site I'd like to scrape uses cookies, can screen-scraper handle this?

Absolutely. screen-scraper handles cookies (and BASIC authentication tokens) transparently behind the scenes. When setting up screen-scraper to scrape information from your site you rarely need to take any thought for cookies. In certain cases, sites will set cookies in JavaScript. In such cases, you can set them within a screen-scraper script via the session.setCookie method.

back to top


Can screen-scraper work with sites that use HTTPS?

screen-scraper supports HTTPS on all supported platforms except certain early versions of Mac OS X. If you're using the screen-scraper proxy server to access a site that uses HTTPS follow the directions found under the "Viewing encrypted transactions" found on this documentation page: Using the Proxy Server. In setting up scrapeable files to access pages that use HTTPS you don't need to treat them any differently than those that use HTTP.

back to top


I'd like to scrape information from a web site that requires me to log in first. Can screen-scraper handle this?

Yes. This is a common situation, and generally just requires that you create a scrapeable file to handle logging in. This scrapeable file should be run first in the scraping session, allowing the web site to set cookies, which screen-scraper will then track for you.

For example, if you wanted to scrape a list of all auctions you're watching from the ebay web site, you would create a scrapeable file that would first log you in (issue a POST request with your username and password), then you would create subsequent scrapeable files that would scrape the information you're interested in.

There is also a special type of authentication known as "BASIC" or "WWW-Authenticate". You'll know a web site is using this when, upon attempting to access a particular URL, you are presented with a small dialog box requesting a username and password. When setting up screen-scraper to scrape a page using this type of authentication you simply need to enter in the username and password in the "Properties" tab under "BASIC Authentication Parameters" for the scrapeable file you set up to scrape the page. Note that you generally only need to enter the username and password once for a given site on a single scrapeable file, as screen-scraper will retain the username and password for you.

We give an example of configuring screen-scraper to log in to a site in our third tutorial.

back to top


Does screen-scraper follow redirects?

screen-scraper will automatically follow certain redirects, so it just depends on what type the web site is making use of. There are three types of redirects that are typically used on the web:

1. 3xx HTTP responses. These are probably the most common, and are the ones screen-scraper will automatically follow. For example, instead of responding with a 200: OK HTTP response, the server will respond with 302: Moved Temporarily, then supply the URL the browser is to redirect to in a "Location" HTTP header. In these cases you shouldn't need to do anything at all; screen-scraper will simply follow them as a browser would.

2. META refresh tags. These are special HTML tags that are often embedded in a web page which contain the URL the browser is to redirect to. screen-scraper will not automatically follow these, so you'd need to create a separate scrapeable file to send screen-scraper to them. This might also involve extracting certain parameters from the URL before going to the redirected page.

3. JavaScript redirects. Occasionally sites will utilize client-side JavaScript to send the browser to a new location. As it pertains to screen-scraper, the technique for dealing with these is basically the same as that described in #2.

back to top


How much memory and what type of CPU is recommended for screen-scraper?

Unfortunately, the short answer to this question is, "it depends." If you're doing only very simple things with screen-scraper (e.g., scraping a few files once in a while) it could run comfortably in 64MB of RAM with a 500MHz processor. On the other end of the spectrum, if you're running multiple lengthy scraping sessions in parallel the memory and CPU requirements could climb quite a bit. Allocating the right amount of memory to screen-scraper invariably involves some experimentation. For example, you might run your scraping sessions in as realistic a scenario as possible, then use tools such as the Windows Task Manager or top to monitor CPU and memory usage. Remember that you can adjust the amount of memory screen-scraper is allocated by opening the "Settings" dialog box (click on the wrench icon), then altering the value labeled "Maximum memory allocation in megabytes".

It might also be helpful to look over the question below on optimizing scraping sessions.

back to top


Can screen-scraper be scheduled to scrape sites on a periodic basis?

If you're using the Enterprise Edition of screen-scraper, this can be done via the web interface.

For the Basic and Professional editions, the best way to go about this is to use an external scheduler, such as the Windows Task Scheduler or the Unix cron daemon. You'll typically set up one of these schedulers to either invoke screen-scraper from the command line or to invoke a separate application, which in turn invokes screen-scraper while it's running as a server.

back to top


Can screen-scraper extract information from non-HTML objects such as Java applets, ActiveX controls or Adobe Flash movies?

The short answer to this one is, "Sometimes." Most all widgets (applets, etc.) that communicate with their server via HTTP can be sccraped by screen-scraper. Oftentimes, however, they'll use a proprietary protocol. Most of the time Adobe Flash movies use HTTP when they need to communicate with a server, but Java applets and ActiveX controls don't always. The easiest way to find out is to use screen-scraper's proxy server when interacting with a page containing one of these elements. Take a close look at the HTTP requests and responses passing between the web browser and the server. If you see text in there (often XML or URL-encoded lists of parameters) then the chances are good that screen-scraper can extract the information being passed between the client and server. Note, however, that there may be text that the widget is displaying that doesn't get passed between the client and server. Unfortunately, in such cases, screen-scraper is unable to extract that information. The only utilities we're aware of that may allow for scraping that type of information would be IBM's Rational Robot and OpenSpan.

back to top


Can screen-scraper extract information from PDF files?

Sort of, yes. See this blog posting.

back to top


My web site is hosted on a shared server (virtual hosting). Can I use screen-scraper with it?

In order to install screen-scraper on a machine, you'll likely need administrative or root access. Generally this is not the case with virtual hosting, so you likely will not be able to run screen-scraper on your server.

Oftentimes this won't preclude you from using it, however. A common scenario is to scrape data on a local machine, write the data to a CSV file, then upload it to a server to be imported. If you have a database running on the server, you may also still be able to run screen-scraper from a local machine, then insert the scraped data into your database using the technique we describe in our fifth tutorial.

back to top


I'd like to scrape data from a mainframe/tn3270 application. Can screen-scraper handle this?

No. screen-scraper is designed only to scrape data from web sites. If you're looking for a solution that can extract data from older mainframe-type applications, we'd recommend looking at Jagacy.

back to top


What character sets does screen-scraper support?

screen-scraper supports any character sets supported by the 1.5 Java Virtual Machine. A complete list can be found here: http://java.sun.com/j2se/1.5/docs/guide/intl/encoding.doc.html.

back to top

Troubleshooting

I get an error like this when I run a script: "undefined variable or class name: dataSet". What's wrong?

Certain scripts are meant to be invoked only in the context of a running scraping session. That is, a script might be invoked to be run after data is extracted by an extractor pattern (by selecting "After pattern is applied" when associating the script with the extractor pattern). In other words, only certain objects (e.g. dataSet and session) are in scope depending on when the script is run. For more details on this see the "Variable scope" section of this documentation page: Using Scripts.

back to top


At times screen-scraper seems to stop working and my computer starts beeping. What's wrong?

This is most likely happening because screen-scraper is running out of memory. For example, if you're scraping large amounts information from a web site with multiple concurrent running scraping sessions, screen-scraper may need to keep a lot of information in memory while it does so. Here are possible ways to remedy this problem:

  • Lower the number of concurrent running scraping sessions. This setting can be changed by selecting "Settings" from the "Option" menu and selecting the "General" tab.
  • Increase the memory allocation to screen-scraper. Again, this can be changed under the "General" tab of the "Settings" dialog box.

If those suggestions don't seem to help don't hesitate to email us.

back to top


Why does screen-scraper crash when running some scripts written in VBScript or JScript?

Certain versions of the Microsoft Windows Script environment don't contain all of the objects you might want to refer to in scripts (such as the FileSystemObject), which might cause screen-scraper to crash. If this occurs we would recommend installing the latest version of the Microsoft Windows Script environment, which can be downloaded here.

This can also happen if you're running multiple scraping sessions in parallel that all use scripts written in VBScript or JScript. Unfortunately, this is an issue outside of our control. The Microsoft Scripting engine poses a limit such that if multiple instances of external scripts are run within it simultaneously unpredictable results can occur. If you need to run multiple scraping sessions in parallel we recommend that you script in Interpreted Java, JavaScript, or Python.

back to top


In trying to invoke screen-scraper using the COM driver I get the following error: ActiveX component can't create object. How do I fix this?

The most likely cause to this problem is that you don't have the latest Microsoft Virtual Machine installed on that computer. This is especially a problem with Windows Server 2003, as it does not ship with the Microsoft Virtual Machine. Please note that this is Microsoft's Virtual Machine, and not Sun's Java Virtual Machine. Microsoft's Virtual Machine can be downloaded from these locations:

http://java-virtual-machine.net/download.html
http://vm.jheroen.nl/
http://www.mvps.org/marksxp/WindowsXP/java.php

back to top


When I launch screen-scraper I get this error message: "Can't launch executable. The class path definition is missing from the lax file." How do I fix this?

We use a third party package called InstallAnywhere to handle installing and launching the screen-scraper GUI, and it obviously isn't perfect. We've had reports that on occasion one of the files InstallAnywhere uses to launch screen-scraper can become corrupted. Fortunately, it's an easy fix. Simply download a fresh LAX file here overwriting the corresponding file in your screen-scraper folder. Note that if you installed screen-scraper to a non-standard location (e.g., not "C:\Program Files\screen-scraper professional edition\") you may need to edit the file manually such that it points to your screen-scraper installation.

back to top


When using the screen-scraper proxy server my web browser hangs at the site I'm trying to access, or I don't see anything show up in my proxy session. How do I fix this?

Unfortunately, screen-scraper's prxoy server isn't perfect, and, on occasion, you'll encounter sites that it has difficulty with. Frequently the issue can be resolved by using a different web browser, such as Firefox or Opera.

Depending on your operating system, instead of designating "localhost" in your web browser, you may need to enter "127.0.0.1" or the IP address of your computer.

If you normally connect to the Internet through a proxy server (outside of screen-scraper), you'll need to configure screen-scraper to use that proxy server. This can be done in the "Settings" window (click on the wrench icon), under the "External Proxy Server" section.

If changing your browser doesn't help, it's still possible that you can proxy the site enough that you can create scrapeable files from the requests. It simply needs to be done in a more piecemeal fashion. If you need to resort to this, try the following for each page you need to scrape:

  1. Without using the proxy server, in your web browser go to the page containing the link or form that points to the page you want to scrape.
  2. Start up screen-scraper's proxy server and configure your browser to use the proxy.
  3. Click the link or submit the form that links to the page you want to scrape.
  4. screen-scraper's proxy may hang, but if you click on the HTTP transaction under the "Progress" tab in screen-scraper you may see at least the request portion of the transaction. If this is the case then you can stop the proxy and create a scrapeable file from the HTTP transaction that was recorded.

Note also that you typically only need to proxy forms that use POST requests. Scrapeable files corresponding to normal links and forms that use the GET method can be created by simply copying the URL from your web browser.

3rd Party Options

Alternatively, if the screen-scraper proxy freezes entirely and does not record any of the transaction you can access the HTTP header information within your browser by utilizing one of the following.

For additional instructions please see our page on Using Scrapeable Files.

back to top


When I try to update screen-scraper to a "pre-release" version it's telling me there are no updates. How do I upgrade to pre-release or alpha versions?

By default screen-scraper will only allow updates to stable versions (e.g., 2.6 as opposed to 2.6.0.5a). In order to upgrade to unstable versions you need to open the "Settings" dialog box (click on the wrench icon in the button bar), then check the box labeled "Allow upgrading to unstable versions".

After closing the Settings window, go again under Options and choose "Check for updates."

If you're interested in upgrading or downgrading to a specific version (including alpha releases) please see the following instructions.

back to top


When I scrape sites that contain double-byte characters sets (e.g., Mandarin), screen-scraper shows question marks and boxes instead of the characters. How do I fix this?

While we're still refining screen-scraper's ability to handle international character sets, based on our testing, it should handle most situations just fine. When scraping sites with international character sets, though, there are a few extra steps you'll need to take:

  • In the "Settings" dialog box (click on the wrench icon to open it), ensure that the primary character set you're working with is indicated under "Default character set". You can find a comprehensive list of characters sets supported by screen-scraper here.
  • Also in the "Settings" dialog box, under "Default font", ensure that you have a font selected that will display the characters in the set you're using (e.g., "Arial Unicode MS").
  • For each of your scrapeable files, under the "Advanced" tab, un-check the box labeled "Tidy HTML after scraping?" If you tidy the HTML most international characters will get replaced with HTML entities.

If you're having trouble with a particular site, please feel free to contact us so that we can look into it for you.

back to top


The site I'm trying to scrape is telling me that cookies need to be enabled. How to I fix this?

There are a few possible causes for this issue:

  1. The URL or POST parameters are incorrect. You'll want to ensure that the URL and POST parameters you're sending to the remote site are exactly what the site is expecting. For example, if you inadvertently leave off a session ID from the URL or a POST parameter, the remote server could report that cookies are disabled.
  2. screen-scraper isn't handling cookies as the remote server is expecting. If this is the case, you'll need to modify the cookie settings under the "Advanced" tab for your scraping session (the "Cookie policy" drop-down list and "Use HTTP strict mode" checkbox). The most common fix is to set the "Cookie policy" to "Compatibility" and to check the "Use HTTP strict mode" checkbox.
  3. The server is setting cookies via JavaScript. screen-scraper doesn't execute any of the JavaScript in an HTML page, so if the site is setting cookies via JavaScript, you'll need to set the manually via a script within screeen-scraper. This is probably the trickiest one to debug, and may require examining the HTML and .js files for the site to determine where and how cookies are being set. Once you can determine how that's being done, you can use screen-scraper's session.setCookie method to set the cookie manually.

back to top


When trying to install under Linux it fails while installing the Java Runtime Environment. How do I fix this?

We use InstallAnywhere for our installer, and it seems to have trouble with more recent versions of the Linux kernel. You may experience errors indicating an "error while loading shared libraries". A user of screen-scraper reported that this will resolve the issue:

cp setup_ss_pro.bin setup_ss_pro.bin.bak (so we have a working copy)
cat setup_ss_pro.bin.bak | sed "s/export LD_ASSUME_KERNEL/#xport LD_ASSUME_KERNEL/" > /tmp/setup_ss_pro.bin

That is not a typo for export above (#xport). You must use the same number of characters or else the installer thinks that the file is corrupt.

Now just make sure that /tmp/setup_ss_pro.bin is executable and then run.

You may also need to perform the same trick with the "screen-scraper" binary file used to launch screen-scraper.

If none of that helps, you might try installing via our Linux tarball, which can be downloaded here

back to top


screen-scraper is telling me that my database is corrupted. How do I remedy this?

On rare occasions the main screen-scraper database can become corrupted. This might happen if your computer crashes while screen-scraper is running, for example. Fortunately, as of version 2.8 screen-scraper will automatically back up your database periodically. Even if your database has become corrupted, it's likely you haven't lost much work.

In the directory where screen-scraper is installed (e.g., "C:\Program Files\screen-scraper professional edition\"), you'll find the following directory path: "resource\db\backup\". This "backup" folder should contain a series of folders with dates and times, each of which will contain a backup of your database. You'll use these to restore your database, by following the steps below:

  1. Ensure that screen-scraper is not running. This would include the workbench, server, and any command line instances you might have running.
  2. Kill any java.exe, javaw.exe, java, or javaw processes running on your machine that might correspond to screen-scraper. It's possible that the screen-scraper database process could still be alive, which would lock the database files. You can kill processes on Windows using the Windows Taks Manager.
  3. Delete your existing database files. These are located in "resource\db\", and all begin with "ss".
  4. Copy the database files from the most recent backup folder into the "resource\db" folder. For example, you might copy all of the files beginning with "ss" from "resource\db\backup\September 8, 2006 09.23.31 AM\" into "resource\db\".
  5. Try launching screen-scraper. If everything is normal, you're done. If you get the same "Database Corrupted" message, go back to step 1.

back to top


screen-scraper is telling me that it can't bind to certain ports. How do I fix this?

This error is caused by two possible scenarios.

Cause: Port Blocked. In order for screen-scraper to function properly it will need to open a series of local ports on your computer. There are occasions when these ports may be blocked by other software running on your machine, such as firewalls. If screen-scraper is telling you it can't bind to specific ports, you'll either need to free those particular ports up on your machine, or select different ports for screen-scraper to use. To free up the ports you may need to configure a firewall so that it allows for the ports to be bound. You may also need to quit another application that's using the same port (which could even be another instance of screen-scraper running on the same machine). If you'd like to configure screen-scraper to use different ports, see this FAQ.

Cause: Crash. You might also get this error message if the screen-scraper workbench or server crashed, but the database process remains alive. If after the port number in the message it shows "(for the database)", this may be the cause. To remedy this, you'll need to kill the database process manually, then start screen-scraper again. The process to kill will be called "java" on Linux and Mac OS X, and "java.exe" on Windows. If you're running Linux, you likely already know how to kill a process. To kill a process in Windows open the "Windows Task Manager" (hit Ctrl-Shift Escape), click on the "Processes" tab, then kill any "java.exe" processes you know you don't need.

back to top


When I try to connect to the screen-scraper server from my application it refuses connections. How do I fix this?

First check to ensure that the screen-scraper server is running. Details on doing that can be found here.

This may also be occurring because the IP address of the machine that is connecting to screen-scraper isn't listed in screen-scraper's list of allowed hosts. You can correct this in one of two ways:

  • If the machine running screen-scraper can launch the workbench (e.g., it's running Windows or Linux with Xwindows), you can adjust the security settings by opening the "Settings" window (click the wrench icon), clicking on the "Servers" icon, then entering the IP address (or a portion of the IP address) of the machine you want to allow to connect to screen-scraper in the box labeled "Hosts to allow to connect".
  • If the computer running screen-scraper can't launch the workbench (e.g., it's running Linux without Xwindows installled), you can adjust the security settings by altering the "resource/conf/screen-scraper.properties" file. Add the IP address (or a portion of the IP address) of the machine you want to allow to connect to the "IPAddressesToAllow" property (it's comma-delimited).

After making either of the changes mentioned above, you'll need to restart screen-scraper.

If that still doesn't help, check to ensure that you're trying to connect to screen-scraper using the port on which screen-scraper is listening. The default for the screen-scraper server is 8777, and the default for the SOAP server is 8779. These port numbers can both be altered via the "Settings" dialog box in the workbench (click the wrench icon), under the "Servers" section.

back to top


When I try to install screen-scraper I get a message that reads, "Please select another location to extract the installer to." What am I doing wrong?

This is an issue related to the installer software we use (InstallAnywhere). To remedy the problem, try the following:

  1. Ensure that you have enough hard drive space to contain the decompressed files. You should have at least three times the size of the installer you're using.
  2. Temporarily disable any anti-virus or firewall software.
  3. Ensure that the user account under which you're installing has write access to any temporary folders on the computer. The simplest approach would be to use an account that has administrative access.
  4. Re-download the installer. In some cases the installer file can become corrupted, which can cause this error. Adobe has a good FAQ on dealing with this: http://www.adobe.com/cfusion/knowledgebase/index.cfm?id=f8582407.

back to top


When I export my scraping session I get a zero-byte file. What am I doing wrong?

This is most likely because the character set you're currently using is set to something screen-scraper's file exporter can't deal with. We're working on a fix for this, but in the meantime try changing the "Default character set" in the "Settings" dialog box to "UTF-8".

back to top


When in server mode one of screen-scraper's processes consumes a lot of virtual memory. How do I stop this from happening?

In your "resource\conf\wrapper.conf" file remove this line:

wrapper.java.additional.1=-Xss5M

That's a legacy parameter that is no longer needed by screen-scraper, but may still exist in your instance.

back to top


I receive an error when trying to run screen-scraper from the terminal in Linux. What is an alternate way to run screen-scraper?

As an alternative to executing the screen-scraper binary in Linux you may need to execute a shell script containing the following code. This shell script works only in launching the screen-scraper workbench. To work with screen-scraper in server mode use start_server.sh and stop_server.sh. Execute this shell script from the same location where screen-scraper was installed.

jre/bin/java -Xmx128m -jar screen-scraper.jar

back to top

Tips & Suggestions

How do I transfer my scraping sessions and other objects from one machine to another?

Oftentimes you'll do your work on a development machine, then need to transfer objects up to a production machine. This generally includes scraping sessions and scripts. To do so you have two options:

1. Export your scraping session(s) and script(s) from one machine, then import them into the other. Instructions on doing this can be found here.

2. The second (and easier) possibility would be to simply copy your database from one machine to the other. The database consists of everything inside of the "resource\db" directory of your screen-scraper installation.

back to top


I'd like to insert the data screen-scraper extracts into a database. How do I do that?

There are several ways this can be done:

  1. Write the data out to a delimited file, then have a separate program read the data from the file in and insert it into the database. For example, this could be done with the file generated in our third tutorial. You might write a PHP script or Visual Basic application that reads in the file and inserts it into a database. This technique is easy to implement, and also allows you to alter or clean up the data in your own code before inserting it into your database.
  2. Write the extracted data to a file as SQL statements. Most database have some kind of import feature that allows you to specify a file containing SQL statements. The database reads in the file and executes each of the SQL statements. One of the primary advantages to this approach is that it's simple to implement. It also doesn't require writing any further code to get the data into the database.
  3. Insert the data directly into a database via screen-scraper scripts. This would be done either via JDBC (for scripts written in Interpreted Java, Python, or JavaScript) or ADO (for scripts written in VBScript or JScript). The advantage to this approach is that it's relatively simple to implement and debug, and doesn't require going through the intermediate step of writing the data to a file.
  4. Pass the extracted data to compiled code and have it insert it into the database. For example, you might create Java classes that can insert the data into a database. You would then jar up these classes, place them into screen-scraper's "lib\ext" folder, and screen-scraper would add them to its classpath. Once that's done you can then import your classes into screen-scraper scripts and make use of the objects by passing them the extracted data. This could also be done with COM DLL's registered on your system. Of all the approaches suggested here this is probably the fastest and most robust, but can be a bit trickier to debug.
  5. Invoke screen-scraper from an external application, retrieve the extracted data from screen-scraper, then have the external application handle the database interaction. For example, you might create a PHP script that invokes screen-scraper, tells it to extract product information (as in our third tutorial), requests the extracted product information from screen-scraper, then inserts it into a database (that is, all of the SQL statements and such would be in the PHP code). If you're using the Enterprise Edition, the best way to do this is to handle the data in real-time (i.e., as it is being scraped). Documentation on doing that can be found under the "Handling Scraped Data in Real Time" section on this page. In the Professional Edition the data will need to be stored up in a session variable as a data set, then requested at the end from the calling application. This technique can work great for smaller data sets, but it should be avoided if the data sets will be large. screen-scraper will need to store the data in memory (saved in session variables) as it's being extracted so that it can later pass it along to the external application. If a large amount of data is extracted and stored in memory it could cause screen-scraper to run out of memory.
  6. Create a scrapeable file that POST's extracted data to a local web-enabled script (e.g., one written in PHP or ASP.NET) that accepts the data and inserts it into a database. We provide an example of this in our Fifth Tutorial.

As a side note, it is by design that screen-scraper doesn't insert information automatically into a database for you. The approach we've taken to the design of screen-scraper is to ensure that it does one thing very well: extract information from web sites. Generally related to that process, however, are subsequent steps that involve manipulating and cleaning up the information, as well as storing it in some persistent mechanism (such as a database or text file). All of those things can be done by screen-scraper, but we've designed screen-scraper primarily to handle data extraction.

back to top


screen-scraper is stripping out white space when I don't want it to. How do I stop it from doing this?

When screen-scraper applies extractor patterns to HTML it first strips out "unnecessary" white space. This makes the extraction process significantly faster; however, on rare occasions the white space may not be quite so "unnecessary". Circumventing this requires a bit of a workaround that involves replacing white space characters (such as hard returns) with temporary markers, applying the extractor pattern, then replacing the temporary markers with the white space characters. This is best illustrated by an example scraping session, which you can download here.

Note that this should be considered a temporary solution. We'll address this issue more elegantly in an upcoming version of screen-scraper.

back to top


How can I optimize screen-scraper's performance?

Here are some tips:

  • Allocate more memory to screen-scraper. This can be done under the "Settings" dialog box (click the wrench icon) via the "Maximum memory allocation" setting.
  • Run long scrapes either from the command line or in server mode. The workbench is really just designed for creating scraping sessions and such; if you try to run long scrapes from it you could encounter memory problems.
  • Only save values in session variables when you have to. This is especially true for data sets extracted by extractor patterns. Each time you save a value in a session variable screen-scraper keeps it in memory for the life of the scraping session unless you explicitly null it out. For an extractor pattern, under the "Advanced" tab, when you click the "Automatically save the data set generated by this extractor pattern in a session variable" checkbox you're telling screen-scraper to retain that entire data set in memory. This is fine for relatively small data sets, but should be avoided for large ones. The performance hit for doing this can be mitigated by also checking the "Cache the data set" checkbox (also found under the "Advanced" tab), but when the value for the variable is requested screen-scraper will still need to read it into memory temporarily.
  • Write data out as it gets extracted. This is a corollary to the previous point. Rather than saving data sets in memory you should instead write scripts that will either write the data out to a file or insert it into a database as it gets extracted. A common way of doing this is to write compiled Java code that takes a DataRecord containing extracted data, and handles inserting it into a database. See "I'd like to insert the data screen-scraper extracts into a database. How do I do that?" for more on this.
  • Don't tidy HTML. This can make working with extractor patterns a bit trickier, but can save a fair amount on CPU usage. You can tell screen-scraper not to tidy HTML by unchecking the "Tidy HTML after scraping?" box found under the "Advanced" tab for a scrapeable file.
  • Reuse objects. This is a general principle of programming, and should be followed when using screen-scraper. For example, if you're connecting to a database within screen-scraper scripts, rather than disconnecting and reconnecting each time you need to issue a SQL statement, you should instead keep a connection object in a session variable so that it can be reused.
  • Use compiled code where possible. This will generally mean writing Java code, compiling it into a jar file, then placing it into screen-scraper's "lib\ext" folder. The jar will then be automatically added to screen-scraper's classpath such that you can refer to it in your scripts (e.g., you can include "import" statements in your scripts in order to use your classes).
  • Reduce the number of scraping sessions you run in parallel. screen-scraper has the ability to run multiple scraping sessions simultaneously. This is often necessary and desirable, but it can also have an impact on memory usage and the performance of each scraping session. You can set the number of scraping sessions you'd like to allow screen-scraper to run simultaneously by opening the "Settings" dialog box (click on the wrench icon), then adjusting the value labeled "Maximum number of concurrent running scraping sessions".
  • Avoid requesting files that are unnecessary. Oftentimes in order to get to the page containing the data you'd like to extract screen-scraper will need to first request a few other pages (e.g., one that handles logging in to the site). It's often worth it to experiment a bit by disabling certain files that you would normally request in your web browser (e.g., frames in a frameset) to see if they're actually required in order to be able to request the page containing the data you want.
  • Allocate more memory to screen-scraper. This can be done by opening the "Settings" dialog box (click on the wrench icon), then adjusting the value labeled "Maximum memory allocation in megabytes".
  • Fix extractor patterns that are timing out. Extractor patterns that time out can leave threads running which, over time, can consume a fair amount of memory. To see if your extractor patterns are timing out look for a message like this in your log: "Warning! The operation timed out while applying the extractor pattern, so it is being skipped." To fix these extractor patterns you'll typically want to remove any ~@IGNORE@~ tags, replacing them instead with tokens that use specific regular expressions. You should also try to add regular expressions to other tokens so as to make the match more precise. You can also often avoid timeouts by using sub-extractor patterns instead of full extractor patterns. This allows the extraction to be done in a more piecemeal fashion, which is more efficient.
  • Disable logging. This can be done in the "Settings" window (click on the wrench icon) under the "Servers" section, by un-checking the box labeled "Generate log files". You should, of course, only do this, though, once you're satisfied that your scraping sessions are all working as you'd like them to.

back to top


I'm running screen-scraper in a GUI-less environment (e.g., not running XWindows). How do I transfer my screen-scraper license to my server?

If you're using the Enterprise Edition of screen-scraper, this can be done via the web interface.

In the Professional or Enterprise Editions of screen-scraper, create a text file in screen-scraper's folder named register.txt file that contains a single line with the email address under which you registered screen-scraper. Start up either the screen-scraper server or invoke screen-scraper from the command line. screen-scraper will read in that file, validate the license, then write the result of the validation to a file called register_result.txt. Once the license has been validated, the register_result.txt file can be deleted.

back to top


I'm running screen-scraper in a GUI-less environment (e.g., not running XWindows). How do I update it to the latest version?

When screen-scraper normally updates itself it downloads a zip file from our server, decompresses it, copies the files it contains on top of the existing files, then updates its version number. You'll instead need to do this manually. To do so follow these steps:

  1. Download the update file. The URL for the udpate you'll need can be generated for you using our updater form.
  2. Decompress it.
  3. Copy the contents on top of your existing screen-scraper files.
  4. Edit the "Version" property of your "resource/conf/screen-scraper.properties" file so that it reflects the new version.

The next time you launch screen-scraper you'll have the updated version.

We're in the process of creating a browser-based interface for screen-scraper that will allow you to update screen-scraper without having to go through this manual process.

back to top


I'm running screen-scraper in a machine that isn't connected to the Internet. How do I transfer my screen-scraper license to it?

Follow these steps:

  1. On a machine that does have Internet access, install and register screen-scraper.
  2. Install screen-scraper on the machine that isn't connected to the Internet, if necessary.
  3. With both instances of screen-scraper closed (i.e., not running the workbench, server, or in command line mode), copy all files beginning with "ss" from the "resource\db" folder of the licensed instance of screen-scraper on top of the corresponding files of the unlicensed instance of screen-scraper, overwriting them.

Note that in doing this you'll be copying the entire screen-scraper database from one machine to another, so along with the licensing information it will also copy any scraping sessions, proxy sessions, and scripts. This will also mean overwriting any of those objects found on the unlicnesed instance. Before copying the database over, care should be taken to export any objects from the unlicensed instance that you'd like to retain.

back to top


Can I run multiple instances of screen-scraper on the same machine?

Yes. To do so, though, you'll need to have separate copies of screen-scraper installed (copying an already installed instance works, too). You'll also need to add and modify a few settings to screen-scraper's "resource\conf\screen-scraper.properties" file so that the port bindings for the various instances don't conflict with each other. You'll also want to be sure that screen-scraper is completely closed before copying any files or editing any properties files. Here are the properties you'll need to add and change in the screen-scraper.properties file, along with sample values:

#Change to match the new install directory
InstallDirectory=C\:\\Program Files\\screen-scraper professional edition 2\\

#Change default values
ServerPort=8780
ProxyPort=8779

#Add these ports
DatabasePort=9852
SOAPPort=8458
WebServerShutdownPort=8555

The ServerPort and ProxyPort settings should already be in your properties file. The rest will need to be added. Just be sure that you select different numbers for each of the ports across the various instances you have installed. Also note that you'll want to alter the properties file only when screen-scraper is not running.

If you're on a Unix-based system, such as Linux or Mac OS X, you'll want to modify the following three files: start_server.sh, stop_server.sh, resource/conf/wrapper.conf. Each of these files contain the path to where screen-scraper is installed. Edit them so that they reflect the correct path.

Regarding licensing issues for this type of setup, you're free to use the same license on each of the instances running on the same machine. However, you would need separate licenses for instances running across multiple machines (one per machine).

back to top


How do I make a backup of the work I've done in screen-scraper?

As with any work you do on your computer, it's good to back it up once in a while. The preferred method for doing this in screen-scraper is to export your scraping sessions and scripts as XML files (note that you only need to back up the scripts that aren't referenced in scraping sessions--any scripts called from within scraping sessions will be automatically exported along with the scraping session). Once the files have been exported you might also consider storing them in a versioning system such as CVS or Subversion.

screen-scraper will automatically back up your database periodically to ensure that you don't lose any work. You can also manually invoke this backup process by selecting "Backup Database" from the "File" menu. The database backups are stored in the "resource\db\backup" folder. The directories within that folder contain previous versions of your database. If your database has somehow become corrupted, you may be able to simply revert back to a previous version. Help on that can be found here.

back to top


Will screen-scraper notify me if the site I'm scraping changes?

Once you've set up screen-scraper to extract data from a web site there's a good chance the web site will change at some point. Oftentimes cosmetic changes such as the addition of a font tag or changing text from bold to italic won't affect anything, but if the site makes more dramatic changes, such as altering their navigation system, then your scraping session will break. This generally results causes screen-scraper to either fail to extract records from the site entirely, or scrape significantly fewer records than it had previously. It also usually means that you'll need to update your scraping session to account for the changes in the web site.

There are two approaches we generally take to addressing this issue. The first (and best) approach is to track the number of records screen-scraper extracts each time the scraping session is run. Let's suppose you're extracting records from a site that, on average, will yield about 100 records. If you run the scrape one day and it suddenly only extracts 10 records then something has likely changed with the site, so you'll probably need to adjust your scraping session to account for it. The second approach is to have a special extractor pattern or two that checks for a specific piece of text that you know should be present every time you scrape. This approach is most useful in cases where a site doesn't yield a consistent number of records. If your special extractor pattern doesn't match the text it's looking for then something has likely changed on the site.

Along with all of this you'll likely want some kind of notification system so that you can be made aware when the site changes. To do this you might consider something like screen-scraper's sendMail function. Even better would be to set up an external application that monitors the number of records scraped each time, then logs an error in a database or log file if something comes up.

back to top


How do I extract data from two tables that are basically identical in structure?

This isn't a scenario you'll run into too often, but it's common enough that we decided to include it in the FAQ. At times you may run into a page containing various tables of data. All of the tables are essentially identical in structure, but when you extract the data you want to be able to tell which rows of data came from which tables. For example, consider this page. If you view the HTML from the page you'll notice that the structure of the two tables is basically the same. If you use a normal extractor pattern that matches a row of data, though, you're going to get all four rows of data, and won't be able to tell which row came from which table. That is, your first inclination might be to use an extractor pattern like this:

<tr> <td class="datacell">~@CELL_DATA1@~ <td class="datacell">~@CELL_DATA2@~ <td class="datacell">~@CELL_DATA3@~ <td class="datacell">~@CELL_DATA4@~ </tr>

It matches the data just fine, but you don't know which table each row came from.

In situations like these there are two possible approaches. The first is to use regular expressions that match the data in such a way that you are able to differentiate between the table rows. For example, download this scraping session and import it into screen-scraper. If you run it, you'll notice that it extracts the data from each table separately. It does this by using regular expressions that differentiate the data in the first table (whose cells all end with the letter "a") from the data in the second table (whose cells all end with the letter "x"). You can see this by opening the "Table 1 row" or "Table 2 row" extractor patterns, and editing the properties on any of the tokens (e.g., ~@CELL_DATA1@~). If you look under the "Regular Expression" tab, you'll see the expression that makes the match.

Unfortunately, it's not always the case that regular expressions will allow you to distinguish between table rows. The alternative is to handle the data extraction in scripts. Note that this approach requires the professional edition of screen-scraper, and makes use of the scrapeableFile.extractData method. Download this scraping session and import it into scraping session. Again, if you run it, you'll notice that it extracts the data from the two tables separately. The scripts here provide the key to extracting the data. Take a look at the "Similar tables--extract table 1 data" script. It gets invoked after the "Table 1 data" extractor pattern matches.

If you've encountered a similar situation to the one presented here it's possible you can use these examples to tackle the task. Take a careful look through the extractor patterns and scripts to see how they're set up. If you have questions on them or run into any trouble, don't hesitate to post to our support forum.

back to top


I'm trying to scrape an HTML form that requires the user to type in text shown in an image. Can screen-scraper handle this?

This is known as a CAPTCHA mechanism, and is intended to discourage automated form submissions. There are essentially two ways of working around these:

Oftentimes sites will use a poorly implemented CAPTCHA such that it can be determined up front what the text will read. For example, the site may actually have only four or five images, and it simply cycles through them. By looking at the names of the images one could determine what the corresponding text will be. The text could then be used to populate the appropriate HTML form.

Assuming the CAPTCHA mechanism works as it should (i.e., that a human being would have to type in the text shown in the image), it gets a bit trickier to deal with. The best route would probably be to run a scraping session as you normally would, then, once you arrive at the page containing the CAPTCHA, follow these steps:

  1. Download the CAPTCHA image to the local hard drive (e.g., using the session.downloadFile method).
  2. Using a screen-scraper script, pop up a dialog box using Java code that displays the image, and contains a text box that will accept user input. Within a script you have full access to the Java API, so you could pop up something like a custom JDialog containing the image and text box.
  3. Have a person type into the text box the characters displayed in the image.
  4. Accept the text entered by the user, then drop it into a screen-scraper session variable.
  5. Use the value in the session variable to populate the HTML form element.

This obviously isn't ideal, but, unfortunately, there may not be another way. The CAPTCHA images are designed such that they can't be read by a machine. As such, human intervention is required.

back to top


How do I send dynamic POST parameters in screen-scraper?

If you've gone through our first few tutorials, you know that session variables can be embedded in URL's by using a token like this: ~#FOO#~ (see this page for a detailed example of this). Well, the very same technique can be used with POST variables. When you create a scrapeable file that uses POST parameters, they'll be displayed under the "Parameters" tab for that scrapeable file. In any of those POST parameters you can use the same type of token mentioned before. For example, if you're logging in to a web site (as described here), instead of hard-coding the username and password, you might instead substitute the tokens ~#USERNAME#~ and ~#PASSWORD#~ in the "Value" column, for the respective parameters. Prior to invoking that scrapeable file, you could then set two session variables corresponding to the username and password, which values would then be substituted for the ~#USERNAME#~ and ~#PASSWORD#~ tokens.

back to top


Any recommendations on how to handle projects that involve large numbers of scraping sessions?

In cases where you're dealing with large numbers of scraping sessions, it becomes too cumbersome to retain them all in the workbench. Even if you organize them neatly into folders, there will likely still be too many to viably work with. Rather than keep all scraping sessions in the workbench at once, we generally find it useful to export and save them all to a central directory, which, ideally is under version control using something like Subversion or CVS. When you need to work with a particular scraping session, you simply import it from the repository. Every once in a while, you export the scraping session back to the central directory. Ideally the directory also gets backed up once in a while so that you don't lose any work. When working with a project where there are a large number of scraping sessions, you'll also often have a series of "general" scripts that get used by most, if not all, of your scraping sessions. For example, you might have one script that gets invoked by every scraping session, which is in charge of opening a database connection or initializing a file to which extracted data will be written. We typically handle these "general" scripts by storing them in a separate folder, alongside where all of the scraping sessions are stored. This directory should get versioned and backed up as well. The difference with the "general" scripts is that it's typically a good idea to keep them all in the workbench in their own folder. Usually there aren't very many of them, and they get used often enough that you'll typically want to just retain them in the screen-scraper workbench.

back to top


Why does screen-scraper run slowly on Microsoft® Windows Server 2003?

Though we have not done extensive testing on Microsoft® Windows Server 2003 we have had reports of unusually slow performance. We attribute this to the implementation of Sun's JavaTM code with extra security restrictions. One possible solution would be to install screen-scraper to run under Windows 2000 compatibility mode. Instructions on how to set the compatibility mode during installation can be found here: http://support.microsoft.com/kb/324265.

back to top


Is HTML Tidy permanently turned on for basic edition 4.x?

Unfortunately, yes. This is a bug that slipped past our testing prior to the release of version 4.0. Because we do not offer alpha release ("unstable") upgrades in basic edition we are unable to resolve this issue until the next public release, version 5.0. We do not have set schedules for our public releases and can not say when the next release will be.

back to top


I'm unable to start screen-scraper in server mode. How can I troubleshoot this?

If you're having trouble starting screen-scraper in server mode or running scraping sessions in server mode run the following command in a batch or shell script as an alternate way to start the server.

jre/bin/java -Xmx128M -jar screen-scraper.jar --start-server --interactive

Running the server in this way does two things to help you troubleshoot your scraping sessions and the program itself.

  1. Bypasses the wrapper.exe program, running the screen-scraper.jar file directly.
  2. Writes out to the console/command window any messages sent to standard out.
You also have two commands you can use within the console/command window:
  1. status - Indicates the number of sessions running as "clients currently connected".
  2. quit - Stops the server.

back to top

© 2002-2008 copyright e-kiwi, LLC
about us | blog | contact us | legal