Three common methods for data extraction

Posted in Miscellaneous, Thoughts on 03/21/06by Todd Wilson

Building off of my earlier posting on data discovery vs. data extraction, in the data extraction phase of the web scraping process you’ve already arrived at the page containing the data you’re interested in, and you now need to pull it out of the HTML.

Probably the most common technique used traditionally to do this is to cook up some regular expressions that match the pieces you want (e.g., URL’s and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you’re already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing “ontologies“, or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they’re often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it’s probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what’s the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

  • If you’re already familiar with regular expressions and at least one programming language, this can be a quick solution.
  • Regular expressions allow for a fair amount of “fuzziness” in the matching such that minor changes to the content won’t break them.
  • You likely don’t need to learn any new languages or tools (again, assuming you’re already familiar with regular expressions and a programming language).
  • Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It’s also nice because the various regular expression implementations don’t vary too significantly in their syntax.

Disadvantages:

  • They can be complex for those that don’t have a lot of experience with them. Learning regular expressions isn’t like going from Perl to Java. It’s more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.
  • They’re often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you’ll see what I mean.
  • If the content you’re trying to match changes (e.g., they change the web page by adding a new “font” tag) you’ll likely need to update your regular expressions to account for the change.
  • The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You’ll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there’s no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

  • You create it once and it can more or less extract the data from any page within the content domain you’re targeting.
  • The data model is generally built in. For example, if you’re extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).
  • There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

  • It’s relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.
  • These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you’re targeting.
  • You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you’ll only get into ontologies and artificial intelligence when you’re planning on extracting information from a very large number of sources. It also makes sense to do this when the data you’re trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

  • Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.
  • Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.
  • Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

  • The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.
  • A potential cost. Most ready-to-go screen-scraping applications are commercial, so you’ll likely be paying in dollars as well as time for this solution.
  • A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you’re locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you’re using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don’t mind paying a bit, you can save yourself a significant amount of time by using one. If you’re doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you’re probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we’ve been involved with that has actually required a hybrid approach of two of the aforementioned methods. We’re currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term “number of bedrooms” can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we’ve done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it’s handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we’ve written that uses ontologies in order to extract out the individual pieces we’re after. Once the data has been extracted we then insert it into a database.

9 Comments »

  1. Geoffrey Daniel said,

    April 20, 2006 at 11:11 pm

    This site, and other one which I get to via a private sign-on, seem to use Java and I can’t even print the pages! Am seeking a way to (a) capture the data so I can print it, and, if possible, (b) capture it and store in my database.

    Hope you know something about how this can be done!

    Thx – Geoffrey

  2. Todd Wilson said,

    April 21, 2006 at 10:56 am

    Hi Geoffrey,

    Scraping from Java applets is quite a bit different from scraping from HTML pages. I’m actually unaware of solutions that will do this out of the box. It’s tricky, but your best bet may be to use a proxy server to watch/capture the traffic traveling in between the applet and the remote server.

    Best wishes,

    Todd Wilson

  3. L505 said,

    July 9, 2006 at 10:17 am

    Programmers using PHP/Perl tend to use regular expressions – I first started scraping content this way during my first months of starting to program. But with more experience and a few more years, I quickly decided that a true HTML parser was superior to regular expression hacks.

    Most websites that have data in them usually have HTML tables holding the data. Some sites are DIV based, but most div based sites still use HTML tables for displaying data related info. At this point, I decide that I need to build a reusable HTML table parser, in addition to just an HTML parser alone. But my golly I would never go back to using regular expressions for the main and initial parsing of the website..

    Personally my favorite parser design is to analyze first what you need out of the site, and then design some OnTag or OnContent event procedures, such as OnSearchDesc and OnSearchTitle if you were scraping google. Or if using object oriented programming, event methods rather than procedures. Then if you need to extract more detailed info such as how many bedrooms, maybe THEN use regular expressions and AI.

    But extract the major content with an html parser first.. I say no regexes, please. Those are really really messy and unmaintainable.

  4. Michael Zimmer said,

    July 28, 2006 at 4:09 am

    Thanks for this nice article. I work for a large e-commerce company and we need to (web) screen scrape large amount of pricing and product data from our competitors websites. Last year I developed a C# program for the backend (database connection etc), and for the web part we use a commercial software called “iMacros” (see URL above). I call this app from my threaded C# program. The advantage of this approach is that iMacros can handle all kinds of tricky dialog boxes, javascripts and even Flash/Java applets. iMacros’ “relative extraction” feature is very precise and stable. Also, even if the website(s) change, my staff only needs to change the extraction macros, and leave the c# program untouched. We are using this setup since last December and it works very well for us (actually, much better than the homegrown perl solution we used before). The only thing I miss in this setup are the Regex expression, which iMacros does not support 😉
    Mike

  5. varun said,

    October 10, 2006 at 7:47 am

    Awesome article !!

    Compiling product specs is a pain !!

    And I dont have the time.

    Web Scraping saves the day..

  6. cat data-extraction.techniques | more « yaw angle said,

    February 22, 2007 at 11:35 pm

    […] The screen-scrapeable blog, in particular this post […]

  7. stephen said,

    July 5, 2007 at 5:05 am

    Great article! Is there any code on the web that will get me started web sraping prices from a table on the web?

    Thanks

  8. Todd Wilson said,

    July 5, 2007 at 10:19 am

    Hi,

    Our screen-scraper app can handle that type of thing really well. You can even use our Basic Edition, which is completely free.

    Kind regards,

    Todd

  9. Bobby said,

    May 17, 2011 at 4:30 am

    Hi Todd,

    I am looking for a screen scarper tool which to help scarper events from other hospitatlity website. Could you tell me whether your tool could do this. What is your Screen-scraper app name & url to download the basic ed.

Leave a Comment