Scraping CAPTCHA forms (you know, those HTML forms with the wavy text)

Alert screen-scraper yipa posted an excellent question to our forum this morning:

One of the pages I want to scrape is behind a login with image verification (i.e., you need to enter some text generated in an image to log in). Is there a way to work around this? Maybe something like SS load the image, display/save it to a location, waits for my input after viewing the image, then moves on? Or are there other ways to handle this?

This can be a pretty tricky situation to deal with, but, in most all cases, it should still be doable. I added it to our FAQ, and here’s the explanation for your enlightenement and learning:

I’m trying to scrape an HTML form that requires the user to type in text shown in an image. Can screen-scraper handle this?

This is known as a CAPTCHA mechanism, and is intended to discourage automated form submissions. There are essentially two ways of working around these:

Oftentimes sites will use a poorly implemented CAPTCHA such that it can be determined up front what the text will read. For example, the site may actually have only four or five images, and it simply cycles through them. By looking at the names of the images one could determine what the corresponding text will be. The text could then be used to populate the appropriate HTML form.

Assuming the CAPTCHA mechanism works as it should (i.e., that a human being would have to type in the text shown in the image), it gets a bit trickier to deal with. The best route would probably be to run a scraping session as you normally would, then, once you arrive at the page containing the CAPTCHA, follow these steps:

  1. Download the CAPTCHA image to the local hard drive (e.g., using the session.downloadFile method).
  2. Using a screen-scraper script, pop up a dialog box using Java code that displays the image, and contains a text box that will accept user input. Within a script you have full access to the Java API, so you could pop up something like a custom JDialog containing the image and text box.
  3. Have a person type into the text box the characters displayed in the image.
  4. Accept the text entered by the user, then drop it into a screen-scraper session variable.
  5. Use the value in the session variable to populate the HTML form element.

This obviously isn’t ideal, but, unfortunately, there may not be another way. The CAPTCHA images are designed such that they can’t be read by a machine. As such, human intervention is required.

4 thoughts on “Scraping CAPTCHA forms (you know, those HTML forms with the wavy text)”

  1. I wish I had something to give you, but it’s a rare enough circumstance that I don’t have any code I could drum up for you. You might just need to study up a bit on your Java 🙂

  2. Alex,

    Todd and one of our developers recently collaborated on a script that allows the user to input the CAPTCHA text when requested and have screen-scraper then complete what work it needs to do.

    You can download it here.

    -Scott

  3. Just added a sample scraping session that downloads CAPTCHA image from Google’s recaptcha.com, passes image to decaptcher.com service and receives response as TEXT.

    Check it out here.

Leave a Comment