On occasion we’re asked to acquire data from a site that already offers an API. In almost all cases it’s going to be simpler to acquire data from a site by accessing an API as opposed to crawling. In theory, there should be no need to scrape data from a site if an API to the content is already made available. That said, there are a number of reasons why it still may make sense to scrape a site that also provides an API.
Oftentimes when an API is provided there are limits imposed as to what data is made available. For example, when acquiring data from an ecommerce website the API may provide access to basic attributes such as price and description, but not images or available sizes.
In screen-scraping, very often all data desired is available on a single screen. Rather than query the API for some data points, and scrape others, it would likely make more sense to simply request the product details page once, then grab all of the data points.
In virtually all cases there are limits imposed on the number of queries permitted in a 24-hour period. If you have a product catalog of 10,000 items you’d like to query, but you’re only permitted to access the API 500 times a day you would obviously need to extend the data acquisition over a long period of time.
There’s no theoretical limit to how quickly a website can be scraped, but care should still be taken to not hit a site too hard. Not only can it degrade the performance of the site, but you may also need to deal with countermeasures such as IP address blocking and CAPTCHAs. See my post on Large-Scale Web Scraping for more on this.
It seems odd that this one would even come up, but we’ve encountered cases where data made available via an API is out-of-date. That is, the data you see on the website when browsing is more accurate and current than the data you can access through an API. This one is puzzling because the site owners are almost inviting scraping by not keeping their data current.
In cases where an API only provides access to old data screen-scraping is almost a no-brainer. Unless you’re purely interested in historical information, it makes a lot more sense to get the data from the front end of the website where you know it will be current.
In addition to imposing rate limits, at times API performance will be deliberately degraded. It may be that site owners want to invest resources in other areas, or perhaps they don’t necessarily want to encourage use of their API.
When scraping you can effectively acquire information as quickly as the site will allow. Again, you want to be careful that you don’t hit the site too hard, but you can often acquire data much faster by crawling, especially if the API limits the number of concurrent requests.
When viewing data about a product on an ecommerce site you typically can get all of the details on a single page. In contrast, when accessing an API it’s often necessary to make multiple queries in order to get the data you’re after. For example, if you’re scraping data about furniture you may be able to get basic details about a couch with one query, but may need to make multiple queries to get information on patterns, finishes, and inventory. If you were to crawl the site for the same data it’s likely you could get everything with one request.
Hard to believe, but oftentimes an API will change over time. Take a look at this StackOverflow discussion for an example of LinkedIn changing their API to restrict access. It’s true that websites will change, making it necessary to update scraping code, but so long as the information is still made available it can be acquired.
Go Get the Data
If the data you need is entirely accessible via an API you’re in luck; there’s a good chance you’ll want to take that route, keeping in mind the caveats above. If you know you need to scrape the data to get what you’re after, though, consider dropping us a line. We can provide a free estimate on what it would take to get the information you’re after.