Combining Scraped Data from Multiple Sites

Posted in Thoughts on 01.30.19 by Todd Wilson

Often data sets become richer when they’re combined together. A good example of this is in a small study done by Streaming Observer on the quality of movies available from the big streaming services–Amazon, Netflix, Hulu, and HBO. The study concluded that, even though Amazon has by far the most movies, Netflix has more quality movies than the other three combined. This was determined by combining data about the movies available from each streaming service with data from Rotten Tomatoes, which ranks the quality of movies.

Read more »

8 Ways to Handle Scraped Data

Posted in How-To on by Todd Wilson

In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.

Read more »

Large-Scale Web Scraping

Posted in How-To on 01.17.19 by Todd Wilson

I recently answered a question on Quora about parallel web scraping, and thought I’d flesh it out more in a blog posting. Scraping sites on a large scale means running many bots/scrapers in parallel against one or more websites. We’ve done projects in the past that have required hundreds of bots all running at once against hundreds of websites, with many of them targeting the same website. There are special considerations that come into play when extracting information at a large scale that you may not need to consider when doing smaller jobs.

Read more »

Complex Forms

Posted in Tips on 03.22.17 by jason

There are some sites that have some pretty complex forms–sometimes in the sheer number of parameters, or sometimes by being incomprehensible to humans. In such cases we have a method to get all the form elements for you.

Read more »

Version 7.0.1a released

Posted in Updates on 04.19.16 by jason

When you updated to version 7.0.1a, the first thing you’ll notice is spruced up GUI, but there is a quite a bit going on under the hood too. You can see all the release notes here.

If you want to use this update, here is the instruction to update.

Screen-scraper 7.0 Released

Posted in Updates on 03.02.16 by jason

This new stable version adds many new features, and give you the ability to scrape sites that are using the lastest SSL features.

Read more »

Dynamic Content

Posted in Tips on 10.28.15 by jason

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.

Read more »

HTTPS connection issues

Posted in Updates on 04.29.15 by jason

We’ve been seeing lots of issues with scrapes connecting to HTTPS sites. Some of the errors include

  • ssl_error_rx_record_too_long
  • An input/output error occurred while connecting to https:// … The message was peer not authenticated.
  • javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated

The issue came about when the Heartbleed vulnerability necessitated changes to some HTTPS connections—some of types aren’t secure anymore, and new versions have come out. Screen-scraper needed two changes to catch up, and they are:

  • Update to use Java 8
  • Update of HTTPClient to 4.4

Both of these are pretty large changes, so they aren’t in the stable release yet, however in some cases they are the only option to make a scrape work, therefore here is the instructions to get what you need. Read more »

Scraping data from various industries

Posted in Miscellaneous on 06.10.13 by Todd Wilson

We’ve just added several new scraping sessions that exemplify extracting data from sites in various industries.  If you go to our home page and click on one of the buttons corresponding to an industry you’ll be taken to a page where you can download the scraping session.  The e-commerce section also has a video to walk you through the process, and we’ll be adding videos to the others shortly.

Apache Commons

Posted in Uncategorized on 05.28.13 by jason

We’ve recently included libraries for Apache Commons Lang. There is a large number of useful things in there, but I find most use for stringUtils and wordUtils.

For example, some sites one might scrape might have the results in all caps. You could:

import org.apache.commons.lang.*;

name = “GEORGE WASHINGTON CARVER”;
name = StringUtils.lowerCase(name);
name = WordUtils.capitalize(name);
session.log(“Name now shows as: ” + name);

At the end, the name is now formatted as “George Washington Carver”. Most all of the methods are already nullsafe, and there is a lot of little tools in there to try.

Previous Entries »