There are some sites that have some pretty complex forms–sometimes in the sheer number of parameters, or sometimes by being incomprehensible to humans. In such cases we have a method to get all the form elements for you.
jason
Version 7.0.1a released
When you updated to version 7.0.1a, the first thing you’ll notice is spruced up GUI, but there is a quite a bit going on under the hood too. You can see all the release notes here. If you want to use this update, here is the instruction to update.
Screen-scraper 7.0 Released
This new stable version adds many new features, and give you the ability to scrape sites that are using the lastest SSL features.
Dynamic Content
One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.
What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.
HTTPS connection issues
We’ve been seeing lots of issues with scrapes connecting to HTTPS sites. Some of the errors include
- ssl_error_rx_record_too_long
- An input/output error occurred while connecting to https:// … The message was peer not authenticated.
- javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
The issue came about when the Heartbleed vulnerability necessitated changes to some HTTPS connections—some of types aren’t secure anymore, and new versions have come out. Screen-scraper needed two changes to catch up, and they are:
- Update to use Java 8
- Update of HTTPClient to 4.4
Both of these are pretty large changes, so they aren’t in the stable release yet, however in some cases they are the only option to make a scrape work, therefore here is the instructions to get what you need.
Apache Commons
We’ve recently included libraries for Apache Commons Lang. There is a large number of useful things in there, but I find most use for stringUtils and wordUtils. For example, some sites one might scrape might have the results in all caps. You could: import org.apache.commons.lang.*; name = “GEORGE WASHINGTON CARVER”; name = StringUtils.lowerCase(name); name = … Read moreApache Commons
Scaling & Optimizing screen-scraper
I get a lot of requests for help to configure and run screen-scraper to scrape at an optimal rate. As is often the case with optimization, it is often as much art as science since the many variables that can affect the speed of a scrape are impossible to catalog. While these steps will help … Read moreScaling & Optimizing screen-scraper
Resume points
Sometimes a long scrape will be stopped mid-run by a system crash, power surge, or bad mojo. Many times there is nothing to do but to restart, but sometimes there is a way to pick up (pretty close to) where you left off. You need to include some extra logic, but it is often worthwhile.
Let’s say where looking a site that lists hundred of LOCATIONS, and inside each there is a listing of COMPANIES, and the data we’re after is listed in each COMPANY.
I’m going to make a script that runs at the beginning of the scrape to check for a file that contains the last scraping state.
Further thoughts on hindering screen-scraping
We previously listed some means to try to stop screen-scraping, but since it is an ongoing topic for us, it bears revisiting. Any site can be scraped, but some require such an influx of time and resources as to make it prohibitively expensive. Some of the common methods to do so are: Turing tests The … Read moreFurther thoughts on hindering screen-scraping
Techniques for Scraping Large Datasets
Some of the sites we aspire to scrape contain vast, huge amounts of data. In such cases, an attempt to scrape data from it may run fine for a time, but eventually stop prematurely with the following message printed to the log: The error message was: The application script threw an exception: java.lang.OutOfMemoryError: Java heap … Read moreTechniques for Scraping Large Datasets