jason, Author at screen-scrapeable

Complex Forms

January 3, 2019March 22, 2017 by jason

There are some sites that have some pretty complex forms–sometimes in the sheer number of parameters, or sometimes by being incomprehensible to humans. In such cases we have a method to get all the form elements for you.

Version 7.0.1a released

April 19, 2016 by jason

When you updated to version 7.0.1a, the first thing you’ll notice is spruced up GUI, but there is a quite a bit going on under the hood too. You can see all the release notes here. If you want to use this update, here is the instruction to update.

Screen-scraper 7.0 Released

January 3, 2019March 2, 2016 by jason

This new stable version adds many new features, and give you the ability to scrape sites that are using the lastest SSL features.

Dynamic Content

January 3, 2019October 28, 2015 by jason

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.

HTTPS connection issues

October 13, 2015April 29, 2015 by jason

We’ve been seeing lots of issues with scrapes connecting to HTTPS sites. Some of the errors include

ssl_error_rx_record_too_long
An input/output error occurred while connecting to https:// … The message was peer not authenticated.
javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated

The issue came about when the Heartbleed vulnerability necessitated changes to some HTTPS connections—some of types aren’t secure anymore, and new versions have come out. Screen-scraper needed two changes to catch up, and they are:

Update to use Java 8
Update of HTTPClient to 4.4

Both of these are pretty large changes, so they aren’t in the stable release yet, however in some cases they are the only option to make a scrape work, therefore here is the instructions to get what you need.

Apache Commons

May 28, 2013 by jason

We’ve recently included libraries for Apache Commons Lang. There is a large number of useful things in there, but I find most use for stringUtils and wordUtils. For example, some sites one might scrape might have the results in all caps. You could: import org.apache.commons.lang.*; name = “GEORGE WASHINGTON CARVER”; name = StringUtils.lowerCase(name); name = … Read moreApache Commons

Scaling & Optimizing screen-scraper

October 11, 2010 by jason

I get a lot of requests for help to configure and run screen-scraper to scrape at an optimal rate. As is often the case with optimization, it is often as much art as science since the many variables that can affect the speed of a scrape are impossible to catalog. While these steps will help … Read moreScaling & Optimizing screen-scraper

Resume points

April 12, 2010April 15, 2010 by jason

Sometimes a long scrape will be stopped mid-run by a system crash, power surge, or bad mojo. Many times there is nothing to do but to restart, but sometimes there is a way to pick up (pretty close to) where you left off. You need to include some extra logic, but it is often worthwhile.
Let’s say where looking a site that lists hundred of LOCATIONS, and inside each there is a listing of COMPANIES, and the data we’re after is listed in each COMPANY.

I’m going to make a script that runs at the beginning of the scrape to check for a file that contains the last scraping state.

Further thoughts on hindering screen-scraping

August 17, 2009 by jason

We previously listed some means to try to stop screen-scraping, but since it is an ongoing topic for us, it bears revisiting. Any site can be scraped, but some require such an influx of time and resources as to make it prohibitively expensive. Some of the common methods to do so are: Turing tests The … Read moreFurther thoughts on hindering screen-scraping

Techniques for Scraping Large Datasets

July 11, 2008July 7, 2008 by jason

Some of the sites we aspire to scrape contain vast, huge amounts of data. In such cases, an attempt to scrape data from it may run fine for a time, but eventually stop prematurely with the following message printed to the log: The error message was: The application script threw an exception: java.lang.OutOfMemoryError: Java heap … Read moreTechniques for Scraping Large Datasets