Screen scraping is the use of a program to access someone else’s website. Since you don’t control that site, it could be prone to change without notice in ways that might necessitate changes to the scraping routine. Sometimes such changes lead to downtime of your scrape, and/or development costs. There are some sites that are very resistant to being scraped. They instigate an arms race of tricks to block the scrapes, then we find a new means to access it. In some cases these situations become prohibitively expensive before they become impossible to scrape. For these reasons, if there is an alternative to scraping I always recommend using it if you can. Some common alternatives to scraping include:
API means “Application Programming Interface” and is basically a way for a computer to talk to another computer. Sites that supply an API will document what searches and data are available. Generally they will notify users of a change in functionality or format in advance if possible. Most common APIs present data as JSON or XML which leaves little room for missing a piece of data due to an unexpected format. APIs are designed to be lightweight and are therefore pretty fast.
RSS is the abbreviation for Really Simple Syndication (really), and it’s used to inform you when new content is added to a site. So that means it’s only useful when you are looking for newly added listings or items, but when you can use it, it’s awesome. It’s something commonly seen on podcast sites, news outlets, or blogs, but can also be on retail sites or classified ads, etc. It’s not always available, but when it is, it’s pretty nice. Most of the time RSS has a limited amount of information, but can be valuable to tell you what is new, some basic data, and offers a link to the page with the completion of the data. Sometimes RSS works best to give you enough information to know if you want to scrape a page.
Many sites offer an HTML or XML sitemap to help search engines find pages on the site, and if they’re good for search engines they might be good for you. An HTML sitemap is just a webpage that has links to categories, sub-categories, articles, etc, but it’s very useful on pages that might have a complicated menu. The XML ones are harder to read as they are meant for the search engine bots, but it’s easy for a scraper to read.
On some sites there is a link to a sitemap, most of the time at the bottom, but I find a Google search the easiest way to find one. If I wanted to find the sitemap for Walgreens.com, for example, the search would look like:
That is looking for the word “sitemap” and the “site:” isolates the domain to search. The Google Advanced search is amazing.
Often, it seems like the trickiest part of making a useful scraping routine is to find a way to get the data you need, and if a site has tools to help you find things, it’s a huge help in finding a fast, accucate, and pain-free way to make use of scraping.