The Mobile Problem
The proliferation of mobile devices has created a problem. Most web sites these days are designed to be viewed on desktop computers with high-resolution monitors and via web browsers that allow for sophisticated interactivity. Anyone who’s tried to view such sites on mobile devices with small screens can attest to a cramped feeling. Even the very best mobile web browsers leave you wanting more space. The advent of mobile apps has helped some in this respect. Many content providers simply create customized interfaces via apps to make their data usable. Apps are great, but there still exists a significant portion of information on the Web that isn’t easily accessible on mobile devices. This is where screen-scraping can often fill the gap.
Ideally content providers, like travel and news web sites, offer either an app or a mobile-friendly version of their web site. There are a variety of reasons why this may not happen, though, so screen-scraping may be used by third parties to provide alternate interfaces.
The approach you’d take to screen-scrape for mobile devices doesn’t differ too much from any other kind of screen-scraping. I’ll present a couple of scenarios that will likely be similar to many sites you’d want to scrape.
Scraping Real Estate Data
There are a lot of sites out there that list information related to real estate. This includes commercial sites like Realtor.com and Zillow, but there are also a staggering number of government and county web sites that contain invaluable real estate data. Supposing you’re a realtor or home appraiser it might be helpful to have information related to a specific property while you’re out and about. To meet this need, a software development group might build an app that provides detailed real estate information on a mobile device. Let’s use Arizona’s Maricopa county web site as an example. The site allows you to search for properties via a number of methods, including address and street name. If you’re a software developer, your app might take a street address as an input parameter, then search for a property at that location. If you perform such a search on the Maricopa site you might end up with a property like this one. That page contains all kinds of information about the property, but maybe you’re only interested in a handful of data points:
The parcel number, property description, and most recent valuation information may be the most important parts. You also wouldn’t want to attempt to display too much of this data on a mobile device because of the limited screen real estate. The nice thing about screen-scraping is that you can be very precise in what you extract.
It’s likely that this information won’t change too frequently. As such, it may make sense to simply extract all records from the web site, deposit desired data points into a database, then scrape again periodically to ensure that the information is current. Even though it could be a relatively large data set, it may be better to grab it all at once rather than hitting the site in piecemeal fashion as the data is needed. This would likely mean less of a load on the target web site, and also better performance as you wouldn’t be relying on the web site to return the information to you in real time. In such a case the best approach would be to get the information into a database, then, when the data is requested from the mobile device, grab it directly out of your database rather than relying on the Maricopa site. The flow would end up looking something like this:
In other words, the scraping is not done in real time. You extract the information in a batch process, then deposit it into a database. Once it’s there, the mobile device can make a request containing a property address to your web server, which then retrieves the corresponding record from your database, then passes it down to the mobile device. Using either an app or a mobile-friendly web page, you could then display the information on the device in a much more usable format.
Scraping Travel Air Fares
Let’s suppose you’re interested in extracting travel air fares like Southwest Airlines. In contrast to the previous example, air fare information is very volatile, and, as such, couldn’t be scraped in a batch to be accessed later from a database. That is, the information would need to be scraped in real-time, as the user performs a search. If you perform such a search on the Southwest Airlines site you’ll get a page that looks something like this:
It would be a relatively simple matter to program a screen-scraping application to iterate over each row of search results, extracting out information such as the departure times and the prices. Because this data would need to be scraped in real time the architecture would look a bit different:
In this case the mobile device sends its request to the web server, which in turn passes a request along to a screen-scraper application, which gets the data from the web site, then sends it back down the line. We’ve added a little twist to this example, though–depending on how much traffic the service gets it may be prudent to add multiple screen-scraping applications to help balance load. In the case of our own screen-scraping software a given instance can handle multiple requests simultaneously, but the scraping load can be distributed even further across multiple screen-scraper instances which may be running on different computers.