Combining Scraped Data from Multiple Sites

Posted in Thoughts on 01.30.19 by Todd Wilson

Often data sets become richer when they’re combined together. A good example of this is in a small study done by Streaming Observer on the quality of movies available from the big streaming services–Amazon, Netflix, Hulu, and HBO. The study concluded that, even though Amazon has by far the most movies, Netflix has more quality movies than the other three combined. This was determined by combining data about the movies available from each streaming service with data from Rotten Tomatoes, which ranks the quality of movies.

Read more »

Enterprise-Scale Screen-Scraping

Posted in Thoughts on 11.23.10 by Todd Wilson

One of the main aspects that I think differentiates screen-scraper from many other solutions is its ability to handle large-scale scraping needs.  Additionally, it was designed from the ground up to integrate with other systems, so it generally fits nicely into most any existing setup.

If you’re doing a simple one-off data extraction project screen-scraper could certainly handle it, but, truthfully, you may be better off with something that’s a little more quick-and-easy.  On the contrary, if you’re looking to pull data from multiple web sites, and need the extracted data to be made available to other solutions, screen-scraper is an excellent option.  There are many solutions out there that may get you up and running fairly quickly, but would fall apart when faced with some of the jobs screen-scraper tackles.

Along these lines we’ve added a new Enterprise-Ready page to our site that summarizes some of what screen-scraper can do.  If you need big iron for your project, take a close look at what screen-scraper offers.

Data Cravings

Posted in Thoughts on 11.10.10 by Todd Wilson

Yesterday ReadWriteWeb published an article entitled “Overwhelmed Executives Still Crave Big Data, Says Survey“.  The basic gist of it is that data is vital to making business decisions, and many managers feel that they don’t have enough of it.  This got me thinking about how screen-scraping plays into all of this.

At a basic level, as a data extraction company, we deal in information.  It really doesn’t make much difference what industry the information pertains to; if it’s out there on the Web, we can probably can grab it.  There’s a lot of talk these days about information overload, which is unquestionably a real phenomenon, but oftentimes it’s not so much the quantity of the information as it is getting access to that information in a usable format.  If the data you’re interested in consists of hundreds of thousands of records spread across dozens of web sites it may not be nearly as useful as if it could be searched and analyzed in a single repository.  Much of the time this is what we do.  We’re tasked with aggregating large numbers of data points, normalizing and cleaning them up, then consolidating them all into a highly-structured central repository.  Once the data is in such a repository the real value of it surfaces.  It’s at this point that the information can be analyzed statistically, summarized, or browsed in a structured way.  This leads to business intelligence, which in turn (hopefully) yields good business decisions.

On a related note, as mentioned in the article, timeliness of information can also be critical.  Once again, screen-scraping can play an important role here.  I can’t count the number of times a client has approached us for a project when they already have access to all (or most of) the information they want us to acquire.  The trouble is that much of the time the data they already have is old, inaccurate, and/or incomplete.  Web sites and other data providers will often provide an API to their information.  This can be a great thing, however much of the time the API is insufficient because it provides access to information that is old or incomplete.  For example, if you’re wanting information about automobile sales, an API may give you the make, model, and year of a car that was sold, but not the asking price.  In contrast, live web sites generally contain the most up-to-date, complete, and accurate representation of the information.  As such, even when data may be available via an API (or, gasp, a mailed CD), it’s often better to go directly to the web site if you want the best data.

Further thoughts on hindering screen-scraping

Posted in Thoughts, Tips on 08.17.09 by jason

We previously listed some means to try to stop screen-scraping, but since it is an ongoing topic for us, it bears revisiting.  Any site can be scraped, but some require such an influx of time and resources as to make it prohibitively expensive.  Some of the common methods to do so are:

Turing tests

The most common implementation of the Turning Test is the old CAPTCHA that tries to ensure a human reads the text in an image, and feeds it into a form.

We have found a large number of sites that implement a very weak CAPTCHA that takes only a few minutes to get around. On the other hand, there are some very good implementations of Turing Tests that we would opt not to deal with given the choice, but a sophisticated OCR can sometimes overcome those, or many bulletin board spammers have some clever tricks to get past these.

Data as images

Sometimes you know which parts of your data are valuable. In that case it becomes reasonable to replace such text with an image. As with the Turing Test, there is ORC software that can read it, and there’s no reason we can’t save the image and have someone read it later.

Often times, however, listing data as an image without a text alternate is in violation of the Americans with Disabilities Act (ADA), and can be overcome with a couple of phone calls to a company’s legal department.

Code obfuscation

Using something like a JavaScript function to show data on the page though it’s not anywhere in the HTML source is a good trick. Other examples include putting prolific, extraneous comments through the page or having an interactive page that orders things in an unpredictable way (and the example I think of used CSS to make the display the same no matter the arrangement of the code.)

CSS Sprites

Recently we’ve encountered some instances where a page has one images containing numbers and letters, and used CSS to display only the characters they desired.  This is in effect a combination of the previous 2 methods.  First we have to get that master-image and read what characters are there, then we’d need to read the CSS in the site and determine to what character each tag was pointing.

While this is very clever, I suspect this too would run afoul the ADA, though I’ve not tested that yet.

Limit search results

Most of the data we want to get at is behind some sort of form. Some are easy, and submitting a blank form will yield all of the results. Some need an asterisk or percent put in the form. The hardest ones are those that will give you only so many results per query. Sometimes we just make a loop that will submit the letters of the alphabet to the form, but if that’s too general, we must make a loop to submit all combination of 2 or 3 letters–that’s 17,576 page requests.

IP Filtering

On occasion, a diligent webmaster will notice a large number of page requests coming from a particular IP address, and block requests from that domain.  There are a number of methods to pass requests through alternate domains, however, so this method isn’t generally very effective.

Site Tinkering

Scraping always keys off of certain things in the HTML.  Some sites have the resources to constantly tweak their HTML so that any scrapes are constantly out of date.  Therefore it becomes cost ineffective to continually update the scrape for the constantly changing conditions.

How to Measure Anything

Posted in Thoughts on 03.16.07 by Todd Wilson

Book CoverA while back I was contacted by Douglas Hubbard regarding a book he was writing entitled How to Measure Anything. He was interested in finding out more about tools that could automate online data collection, and screen-scraping popped up on his list as one method to go about this. Last week Douglas contacted me indicating that he was essentially done with the work, and it was on its way to press. He sent me a recent draft copy, and asked if I might blog a bit about it. I happily consented, and, I have to admit, I’ve really enjoyed what I’ve read so far.

Before digging into my commentary, I thought I’d include a snippet from the book that deals specifically with screen-scraping:

There is quite a lot of information on the internet and it changes fast. If you use a standard search engine, you get a list of websites, but that’s it. Suppose, instead, you needed to measure the number of times your firms name comes up in certain news sites or measure the blog traffic about a new product. You might even need to use this information in concert with other specific data reported in structured formats on other sites such as economic data from government agencies, etc.

Internet “Screen-scrapers” are a way to gather all this information on a regular basis without hiring a 24×7 staff of interns to do it all. You could use a tool like this to track used-market versions of your product on www.ebay.com, correlate your stores sales in different cities to the local weather by screen-scraping data from www.weather.com , or even just the number of hits on your firms name on various search engines hour-by-hour. As a search on the internet will reveal, there are several examples on the web of “mashups” where data is pulled from multiple sources and presented in a way that provides new insight. A common angle with mashups now is to plot information about business, real estate, traffic, and so on against a map site like Mapquest or Google Earth. I’ve found a mashup of Google Earth and real-estate data on www.housingmaps.com that allows you to see recently sold home prices on a map. Another mashup on socaltech.com shows a map that plots locations of businesses that recently received venture capital. At first glance, someone might think these are just for looking to buy a house or find a job with a new company. But how about research for a construction business or forecasting business growth in a new industry? We are limited only by our resourcefulness.

You can imagine almost limitless combinations of analysis by creating mashups of sites like Myspace and/or YouTube to measure cultural trends or public opinion. Ebay gives us tons of free data about behavior of sellers, buyers and what is being bought and sold and there are already several powerful analytical tools to summarize all the data on Ebay. Comments and reviews of individual products on the sites of Sears, Walmart, Target, and Overstock.com are a source of free input from consumers if we are clever enough to exploit it. The mind reels.

If you step back from it, fundamentally screen-scraping simply deals with repurposing information. The information you’re after with happens to be in a format that makes it less usable, and screen-scraping allows you to put it in a format that is. As Douglas points out, the ability to do this leads to infinite possibilities.

He touches on a few basic reasons for doing screen-scraping:

  1. Watching information as it changes over time.
  2. Aggregating data into a single repository.
  3. Combining information from multiple sources in such a way that the whole is greater than the sum of the parts.

Chances are, any one of us could come up with all kinds of examples of each, and many of them would apply directly to the type of work we do. Every industry deals with information. It’s likely that some of the information you deal with on a day-to-day basis would be more useful to you if it could be repurposed in one of the three ways I mention. How would your business benefit if you could be notified when one of your products is mentioned? How much time could you save if you were able to take any existing set of data you deal with frequently, and enrich it by aggregating information onto it? For example, you might take real estate property listings, and enhance it by adding information for each property that can be readily obtained from a county assessor’s web site. The end product could be quite useful, but it would be unreasonable to manually copy and paste the information from the web site. Screen-scraping allows this kind of thing to be done in an automated fashion.

How to Measure Anything isn’t available just yet, but I’d highly recommend keeping an eye out for it. If you work in an industry that deals with information and measurement (and I can’t think of one that doesn’t), you’d likely benefit from the principles Douglas teaches. Keep an eye on his How to Measure Anything web site for updates, or if you’d like to pre-order the book.

Using screen-scraper to automatically test embedded devices

Posted in Miscellaneous, Thoughts on 09.12.06 by Todd Wilson

A while back I flew out to Huntsville, AL to work with a government contractor company on automating the testing of embedded devices. To this day I’m not entirely sure what these little machines did, but they each had a web interface that needed testing (much like that of a wireless router, if you’ve worked with those before). This isn’t the most common usage for screen-scraper, but it turned out to be just what they needed.

I worked closely with Greg Chapman, one of their engineers, and he recently wrote an article on the experience entitled Testing aerospace UUTs leads to Web solution. Greg’s a smart guy, and has continued to use screen-scraper in ways that I wouldn’t have even considered.

It’s gratifying to see screen-scraper used in so many different ways, but it’s interesting that it’s versatility has almost been a curse at times to us. Our software can be used for all kinds of purposes, but we’re finding that, from a business standpoint, we’re often better off narrowing our focus to very specific applications. As one marketing expert we consulted with put it, “You guys have plastic.” Plastic is incredibly useful, but it gains value as you craft it into something with a specific purpose. I’m planning on blogging about this idea more later, but it’s interesting to consider the pros and cons of a general-purpose tool, like screen-scraper.

Developing software by the 15% rule

Posted in Miscellaneous, Thoughts on 08.24.06 by Todd Wilson

Writing software on a consulting basis can often be a losing proposition for developers or clients or both. There are too many things that can go wrong, and that ultimately translates into loss of time and money. The “15% rule” we’ve come up with is intended to create a win-win situation for both parties (or at least make it fair for everyone). Clients generally get what they want, and development shops make a fair profit. It’s not a perfect solution, but so far it seems to be working for us.

This may come as a surprise to some, but we make very little money selling software licenses. The vast majority of our revenue comes through consulting services–writing code for hire. Having now done this for several years, we’ve learned some hard lessons. On a few projects the lessons were so hard we actually lost money.

A few months ago I put together somewhat of a manifesto-type document intended to address the difficulties we’ve faced in developing software for clients. I’m pleased to say that it’s made a noticeable difference so far for us. My hope is that this blog entry will be read by others who develop software on a consulting basis, so that they can learn these lessons the easy way rather than the way we learned them.

What follows in this article is a summary of one of the main principles we now follow in developing software–the 15% rule. If you’d like, you’re welcome to read the full “Our Approach to Software Development” document.

For the impatient, the 15% rule goes like this…

Before undertaking a development project we create a statement of work (which acts as a contract and a specification) that outlines what we’ll do, how many hours it will require, and how much it will cost the client. As part of the contract we commit to invest up to the amount of time outlined in the document plus 15%. That is, if the statement of work says that the project will take us 100 hours to complete, we’ll spend up to 115 hours (but no more). As to where-fores and why-tos on how this works, read on.

Those that have developed software for hire know that the end product almost never ends up exactly as the client had pictured. There are invariably tweaks that will need to be made (that may or may not have been discussed up front) in order to get the thing to at least resemble what the client has in mind. And, yes, this can happen even if you spend hours upon hours fine tuning the specification to reflect the client’s wishes. Additionally, technical issues can crop up that weren’t anticipated by the programming team. In theory, the better the programming team the less likely this should be, but it doesn’t always end up that way (Microsoft’s Vista operating system is a sterling example). These two factors, among others, equate to the risk that is inherent in the project. Something isn’t going to go right, and that will almost always mean someone pays or loses more money than originally anticipated. The question is, who should be responsible to account for those extra dollars?

Up until relatively recently, we would shoulder almost all of the risk in our projects. If the app didn’t do what the client had in mind, or if unforeseen technical issues cropped up, it generally came out of our pockets. For the most part it wasn’t a huge problem, but always seemed to have at least some effect (the extreme cases obviously being when we lost money on a project).

This seems kind of unfair, doesn’t it? The risk inherent to the project isn’t necessarily the fault of either party. It’s just there. We didn’t put it there, and neither did the client. As such, it shouldn’t be the case that one party shoulders it all. That’s where the 15% rule comes in.

The 15% rule allows both parties to share the risk. By following this rule, we’re acknowledging that something probably won’t go as either party intended, so we need a buffer to handle the stuff that spills over. By capping it at a specific amount, though, we’re also ensuring that the buffer isn’t so big that it devours the profits of the developers.

For the most part, the clients with whom we’ve used the 15% rule are just fine with it. It is a pretty reasonable arrangement, after all. We have had the occasional party that squirms and wiggles about it, but, in the end, they’ve gone along with it and I think everyone has benefited as a result.

Three common methods for data extraction

Posted in Miscellaneous, Thoughts on 03.21.06 by Todd Wilson

Building off of my earlier posting on data discovery vs. data extraction, in the data extraction phase of the web scraping process you’ve already arrived at the page containing the data you’re interested in, and you now need to pull it out of the HTML.

Probably the most common technique used traditionally to do this is to cook up some regular expressions that match the pieces you want (e.g., URL’s and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you’re already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing “ontologies“, or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they’re often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it’s probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what’s the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

  • If you’re already familiar with regular expressions and at least one programming language, this can be a quick solution.
  • Regular expressions allow for a fair amount of “fuzziness” in the matching such that minor changes to the content won’t break them.
  • You likely don’t need to learn any new languages or tools (again, assuming you’re already familiar with regular expressions and a programming language).
  • Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It’s also nice because the various regular expression implementations don’t vary too significantly in their syntax.

Disadvantages:

  • They can be complex for those that don’t have a lot of experience with them. Learning regular expressions isn’t like going from Perl to Java. It’s more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.
  • They’re often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you’ll see what I mean.
  • If the content you’re trying to match changes (e.g., they change the web page by adding a new “font” tag) you’ll likely need to update your regular expressions to account for the change.
  • The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You’ll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there’s no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

  • You create it once and it can more or less extract the data from any page within the content domain you’re targeting.
  • The data model is generally built in. For example, if you’re extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).
  • There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

  • It’s relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.
  • These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you’re targeting.
  • You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you’ll only get into ontologies and artificial intelligence when you’re planning on extracting information from a very large number of sources. It also makes sense to do this when the data you’re trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

  • Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.
  • Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.
  • Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

  • The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.
  • A potential cost. Most ready-to-go screen-scraping applications are commercial, so you’ll likely be paying in dollars as well as time for this solution.
  • A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you’re locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you’re using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don’t mind paying a bit, you can save yourself a significant amount of time by using one. If you’re doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you’re probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we’ve been involved with that has actually required a hybrid approach of two of the aforementioned methods. We’re currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term “number of bedrooms” can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we’ve done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it’s handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we’ve written that uses ontologies in order to extract out the individual pieces we’re after. Once the data has been extracted we then insert it into a database.

Data discovery vs. data extraction

Posted in Miscellaneous, Thoughts on 03.16.06 by Todd Wilson

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the “details” links within the search results pages to get to the data you’re actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a tool like screen-scraper can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you’ve already arrived at the page containing the data you’re interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL’s and link titles). Regular expressions can be a bit complex to deal with, so screen-scraper hides most of those types of details behind the scenes, which simplifies the process. screen-scraper actually uses regular expressions to perform the data extraction, but you may or may not even be aware of that when you use it.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you’ve extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user’s web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it’s been extracted. One of the primary design goals of screen-scraper was to make it as flexible as possible in this regard. Our FAQ on saving information to a database gives several suggestions on how screen-scraper can be used in this regard.

Data mining vs. screen-scraping

Posted in Thoughts on 02.16.06 by Todd Wilson

Data mining isn’t screen-scraping. I know that some people in the room may disagree with that statement, but they’re actually two almost completely different concepts.

In a nutshell, you might state it this way: screen-scraping allows you to get information, where data mining allows you to analyze information. That’s a pretty big simplification, so I’ll elaborate a bit.

The term “screen-scraping” comes from the old mainframe terminal days where people worked on computers with green and black screens containing only text. Screen-scraping was used to extract characters from the screens so that they could be analyzed. Fast-forwarding to the web world of today, screen-scraping now most commonly refers to extracting information from web sites. That is, computer programs can “crawl” or “spider” through web sites, pulling out data. People often do this to build things like comparison shopping engines, archive web pages, or simply download text to a spreadsheet so that it can be filtered and analyzed.

Data mining, on the other hand, is defined by Wikipedia as the “practice of automatically searching large stores of data for patterns.” In other words, you already have the data, and you’re now analyzing it to learn useful things about it. Data mining often involves lots of complex algorithms based on statistical methods. It has nothing to do with how you got the data in the first place. In data mining you only care about analyzing what’s already there.

The difficulty is that people who don’t know the term “screen-scraping” will try Googling for anything that resembles it. We include a number of these terms on our web site to help such folks. For example, we created pages entitled Text Data Mining, Automated Data Collection, Web Site Data Extraction, and even Web Site Ripper (I suppose “scraping” is sort of like “ripping” 🙂 ). So it presents a bit of a problem–we don’t necessarily want to perpetuate a misconception (i.e., screen-scraping = data mining), but we also have to use terminology that people will actually use.