I recently answered a question on Quora about parallel web scraping, and thought I’d flesh it out more in a blog posting. Scraping sites on a large scale means running many bots/scrapers in parallel against one or more websites. We’ve done projects in the past that have required hundreds of bots all running at once against hundreds of websites, with many of them targeting the same website. There are special considerations that come into play when extracting information at a large scale that you may not need to consider when doing smaller jobs.
As a simple example, suppose you wanted to scrape all of the U.S. Home Depot locations from their website. You’d probably do this using their store finder feature, which allows you to search locations by zip code. One approach would be to simply query the site using every zip code in the U.S. one-by-one, then extract the results. This would work fine, except that there are over 40,000 zip codes, so the scrape could take a while. As an alternate approach, you might divide up the zip codes into five groups of about 8,000 zip codes each, then have separate scrapers running against each of the five lists. Each of the scrapers could run independently from one another, and, if set up correctly, wouldn’t overlap in the locations they’re covering.
Depending on how large and urgent the job is there are other considerations related to scaling up. As always, we want the data, but we also need to minimize impact on the target websites. It does us no good to scale up massively, only to debilitate the sites we’re working with.
Dividing up the work
One of the first considerations is, how can the task be divided up between the bots? In the example above we can easily split up zip codes. In other situations we might be able to divide letters of the alphabet (if we’re searching for a name, for example), cities, states, or perhaps categories (e.g., on an ecommerce site where we’re scraping products). We just need some way of breaking the job up into discrete pieces so that the scrapers overlap as little as possible.
In our company we use what we call the “iterator”, which is essentially a web service attached to a database. A given web crawler can query the iterator for something like a zip code, and the iterator will supply the next one in the queue. The iterator is thread-safe, so it guarantees that it doles out each zip code once and only once. If you don’t want to get that fancy, you could easily use something like separate files, and let each bot work off of a different file.
One of the most important factors in scaling efficiently is to minimize the number of requests made to a given server. In a typical scenario you’ll need to query a website using a series of parameters, as in the zip code example given above. While it’s possible to run through every zip code, most sites allow you to specify a radius for your search. By taking advantage of this the number of zip codes that need to be queried can be reduced significantly. For example, by using a radius of 10 miles the number of zip codes goes from over 40,000 to under 10,000. As much as possible, you should find ways to reduce the number of queries you’ll need to send to the target website.
When searching via letters of the alphabet it’s also common for site to cap the number of results they’ll give at once. For example, searching for individuals whose last name begins with “W” may give you 1,000 results. That doesn’t mean that there are only 1,000 actual results, but the site won’t provide any more than 1,000 at a time. When this happens you’ll need to sub-divide by appending letters until you get a number of results that don’t exceed the maximum. You’ll want to set up your algorithms so that they intelligently tack letters on to your queries only as needed.
If possible you should also monitor the target server to ensure that you’re not impacting its performance. We have monitoring software that will keep an eye on response times from the server, and scale back threads if it looks like we’re having a noticeable impact. For example, if we’re running 10 bots against a given website, and we detect that response times are starting to increase too much, our software will automatically reduce the number of bots until it finds a number that doesn’t seem to be impacting the server.
Depending on how sophisticated you’re able to get with this, you may just want to be conservative and err on the side of using fewer bots. Let’s not give screen-scraping more of a bad name than it already has for some people.
Related to the previous section, depending on how the target website is designed, your web crawlers may get blocked if you’re using too many at once, or if they’re running too quickly.
One obvious solution to getting blocked is to simply scale back the number of scrapers you’re running. You might also insert pauses between requests so as to not hit the website too hard.
If you absolutely need to acquire the data quickly, and are still getting blocked, you’ll need to use other measures to avoid detection. There are lots of articles out there on this topic, so I’ll just cover a few of the main techniques.
The first is to rotate your IP address via proxies. Requests will look to the target website as if they’re coming from different sources. We’ve evaluated quite a few services that do this, and the best we’ve found is Luminati. The basic idea is that you configure your scraper to route all requests through the proxy service, and the proxy service routes your requests through different proxies on their side. Depending on the service you select, you may have millions of distinct IP addresses available to you.
Another technique that can help is to run your bots during off-peak hours. If the websites you’re targeting are likely to get most of their traffic during daytime hours, you might run your bots in the middle of the night. You could either schedule them to run at specific times, or put them in some kind of “sleep” state during certain hours of the day.
As much as possible, your bot should appear to the website like a normal user. This may mean tweaking the user-agent HTTP header, and also adding random pauses between requests. Again, we want the information, but we don’t want to hit them too hard, so inserting pauses is often a good idea.
More threads, more hardware
Because the scraping is occurring simultaneously multiple threads will be needed. Our screen-scraper software is designed to be able to handle an arbitrary number of threads, so long as the underlying hardware can support them. As the number of scrapers multiply, though, it may be necessary to add more hardware. We try to optimize things as much as we can (e.g., by only requesting the files we need), but it’s still possible that more computers (physical or virtual) will need to be added in order to accommodate the load. Our software is designed to be distributed across an arbitrary number of machines, so scaling up is relatively painless. Depending on how much you need to scale your solution you may need to consider hardware needs up front, and even dynamically as the load changes.
Keeping track of it all
While scraping at a large scale it can be tricky to keep track of what scrapers are handling which sites, how many threads are dedicated to each, etc. We’ve built a “controller” application internally that handles all of this for us. The controller will monitor scrapes as they progress, and work closely with the iterators to ensure that everything is running smoothly. When errors occur it can generate alerts, and provides a simple dashboard interface that developers can use to track everything. It also has the ability to work in a cloud-based environment, dynamically spawning and terminating virtual machines as scraping loading increases and decreases.
Using cloud services
When scraping on a big scale the ideal environment is the cloud, such as Amazon Web Services or Google Cloud. Aside from being able to leverage the ability to spawn and terminate instances on-demand, you also get high bandwidth and some natural anonymization via multiple IP addresses. Cloud services have also been commoditized such that the cost of running in the cloud is accessible to even small businesses. We’ve designed our software to integrate with multiple cloud services (as well as physical machines), so that it can scale without having to manually manage hardware.
Scraping at scale can be relatively painless, but you need to have a plan in place before you start. As much as possible, you should minimize the impact on target servers, and only request from them the information you actually need. As you scale up, cloud-based services can be indispensable to doing it quickly and efficiently.