screen-scraper logo The web data extraction experts Search
Navigation See our products Contact us Buy screen-scraper professional Help and support Download screen-scraper Free quote Navigation
Table of Contents

Anonymization

Overview

Under certain circumstances you may want to anonymize your scraping so that the target site is unable to trace back your IP address. For example, this might be desirable if you're scraping a competitor's site, or if the web site is blocking too many requests from a given IP address.

There are a few different ways to go about this using screen-scraper. In this section we'll go over each one in detail.

Automatic Anonymization

By far the simplest and most effective way to anonymize scraping in screen-scraper is to use the built-in automatic anonymization feature. Once you've done the initial setup, this can be as simple as checking a box. The anonymization service built in to screen-scraper is a paid service, and you'll need to sign up for it before making use of it. To do so, please send us a support request. See below for the cost of the anonymizaiton service.

The screen-scraper automatic anonymization service works by sending each HTTP request made in a scraping session through a separate high-speed HTTP proxy server. The end effect of this is that the site you're scraping will see any request you make as coming from one of several different IP addresses, rather than your actual IP address. These HTTP proxy servers are actually virtual machines that get spawned and terminated as you need them. You'll use screen-scraper to either manually or automatically spawn and terminate the proxy servers.

Once you've signed up for the anonymization service you'll be given a password that will be associated with your registered email address. Your password will be entered into the "Anonymous Proxy" section of the "Settings" window:


click to enlarge

The anonymous proxy servers will be set up in such a way that they only allow connections from your IP address. This way no one else can use any of the proxies without your authorization. In the "Settings" window you'll want to enter a comma-delimited list of allowed IP addresses for the computers that will be utilizing the service. If you'll be running your anonymized scraping sessions on the same machine (or local network) you're currently on, you can click the "Get the IP address for this computer" to determine your current IP address. We find that as many as 10 proxy servers but no fewer than five are adequate for most situations.

As the proxy servers get spawned and terminated, it's a good idea to establish the maximum number of running proxy servers you'd like to allow. This is done via the "Max running servers" setting. Just below this box you can click the "Refresh" button to see how many HTTP proxies are currently running. Because you pay for proxy servers by the hour, if you don't have your scraping session set up to automatically shut them down at the end, you'll want to use the "Terminate all running proxy servers." button in order to do that.

Aside from these global settings, there are a few settings that apply to each scraping session you'd like to anonymize. You can edit these settings under the "Anoymization" tab of your scraping session. The settings should be self-explanatory.

Once you've configured all of the necessary settings, try running your scraping session to test it out. You'll see messages in the log that indicate what proxy servers are being used, how many have been spawned, etc.

As your anonymous scraping session runs, you'll notice that screen-scraper will automatically regulate the pool of proxy servers. For example, if screen-scraper gets a timed out connection or a 403 response (authorization denied), it will terminate the current proxy server, and automatically spawn a new one in its place. This way you will likely always have a complete set of proxy servers, regardless of how frequently the target web site might be blocking your requests. You can also manually report a proxy server as blocked by calling session.currentProxyServerIsBad() in a script.

There is a $150 setup fee for the anonymization service. Beyond that, the charge for each running proxy server is 25 cents per proxy per hour. Once again, to enroll in the service please send us a support request.

Users of the automatic anonymization service must first agree by email to be bound by Amazon's Amazon Web Services Customer Agreement (please take special note of section "5.4. Amazon Elastic Compute Cloud (Amazon EC2)" of the agreement, which specifically outlines permitted activities). When using the automatic anonymization method, while the remote web site may not be able to determine your IP address, your activity will still be logged. If you attempt to use the proxy service for any illegal activities, the chances are very good that you will be prosecuted.

Anonymization Via Manual Proxy Pools

If the automatic anonymization method isn't right for you, the next best alternative might be to manually handle working with screen-scraper's built-in proxy pool. The basic approach involves running a script at the beginning of your scraping session that sets up the pool, then calling session.currentProxyServerIsBad() as you find that proxy servers are getting blocked. In order to use a proxy pool, you'll also need to get a list of anonymous proxy servers. Generally you can turn these by Googling around a bit.

The best way to demonstrate the use of proxy pools is by an example. So here it is:

That's about all there is to it. Aside from occasionally calling session.currentProxyServerIsBad(), you may also want to call session.setUseProxyFromPool to turn anonymization on and off within the scraping sesison.


From here:

© 2002-2008 copyright e-kiwi, LLC
about us | blog | contact us | legal