Version 7.0.1a released

When you updated to version 7.0.1a, the first thing you’ll notice is spruced up GUI, but there is a quite a bit going on under the hood too. You can see all the release notes here. If you want to use this update, here is the instruction to update.

Dynamic Content

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.

Read moreDynamic Content

HTTPS connection issues

We’ve been seeing lots of issues with scrapes connecting to HTTPS sites. Some of the errors include

  • ssl_error_rx_record_too_long
  • An input/output error occurred while connecting to https:// … The message was peer not authenticated.
  • javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated

The issue came about when the Heartbleed vulnerability necessitated changes to some HTTPS connections—some of types aren’t secure anymore, and new versions have come out. Screen-scraper needed two changes to catch up, and they are:

  • Update to use Java 8
  • Update of HTTPClient to 4.4

Both of these are pretty large changes, so they aren’t in the stable release yet, however in some cases they are the only option to make a scrape work, therefore here is the instructions to get what you need.

Read moreHTTPS connection issues

Apache Commons

We’ve recently included libraries for Apache Commons Lang. There is a large number of useful things in there, but I find most use for stringUtils and wordUtils. For example, some sites one might scrape might have the results in all caps. You could: import org.apache.commons.lang.*; name = “GEORGE WASHINGTON CARVER”; name = StringUtils.lowerCase(name); name = … Read moreApache Commons

Resume points

Sometimes a long scrape will be stopped mid-run by a system crash, power surge, or bad mojo.  Many times there is nothing to do but to restart, but sometimes there is a way to pick up (pretty close to) where you left off.  You need to include some extra logic, but it is often worthwhile.
Let’s say where looking a site that lists hundred of LOCATIONS, and inside each there is a listing of COMPANIES, and the data we’re after is listed in each COMPANY.

I’m going to make a script that runs at the beginning of the scrape to check for a file that contains the last scraping state.

Read moreResume points