Yesterday ReadWriteWeb published an article entitled “Overwhelmed Executives Still Crave Big Data, Says Survey“. The basic gist of it is that data is vital to making business decisions, and many managers feel that they don’t have enough of it. This got me thinking about how screen-scraping plays into all of this.
At a basic level, as a data extraction company, we deal in information. It really doesn’t make much difference what industry the information pertains to; if it’s out there on the Web, we can probably can grab it. There’s a lot of talk these days about information overload, which is unquestionably a real phenomenon, but oftentimes it’s not so much the quantity of the information as it is getting access to that information in a usable format. If the data you’re interested in consists of hundreds of thousands of records spread across dozens of web sites it may not be nearly as useful as if it could be searched and analyzed in a single repository. Much of the time this is what we do. We’re tasked with aggregating large numbers of data points, normalizing and cleaning them up, then consolidating them all into a highly-structured central repository. Once the data is in such a repository the real value of it surfaces. It’s at this point that the information can be analyzed statistically, summarized, or browsed in a structured way. This leads to business intelligence, which in turn (hopefully) yields good business decisions.
On a related note, as mentioned in the article, timeliness of information can also be critical. Once again, screen-scraping can play an important role here. I can’t count the number of times a client has approached us for a project when they already have access to all (or most of) the information they want us to acquire. The trouble is that much of the time the data they already have is old, inaccurate, and/or incomplete. Web sites and other data providers will often provide an API to their information. This can be a great thing, however much of the time the API is insufficient because it provides access to information that is old or incomplete. For example, if you’re wanting information about automobile sales, an API may give you the make, model, and year of a car that was sold, but not the asking price. In contrast, live web sites generally contain the most up-to-date, complete, and accurate representation of the information. As such, even when data may be available via an API (or, gasp, a mailed CD), it’s often better to go directly to the web site if you want the best data.