Posted in Thoughts on 01.30.19 by Todd Wilson
Often data sets become richer when they’re combined together. A good example of this is in a small study done by Streaming Observer on the quality of movies available from the big streaming services–Amazon, Netflix, Hulu, and HBO. The study concluded that, even though Amazon has by far the most movies, Netflix has more quality movies than the other three combined. This was determined by combining data about the movies available from each streaming service with data from Rotten Tomatoes, which ranks the quality of movies.
Read more »
Posted in How-To on by Todd Wilson
In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.
Read more »
Posted in How-To on 01.17.19 by Todd Wilson
I recently answered a question on Quora about parallel web scraping, and thought I’d flesh it out more in a blog posting. Scraping sites on a large scale means running many bots/scrapers in parallel against one or more websites. We’ve done projects in the past that have required hundreds of bots all running at once against hundreds of websites, with many of them targeting the same website. There are special considerations that come into play when extracting information at a large scale that you may not need to consider when doing smaller jobs.
Read more »
Posted in Tips on 03.22.17 by jason
There are some sites that have some pretty complex forms–sometimes in the sheer number of parameters, or sometimes by being incomprehensible to humans. In such cases we have a method to get all the form elements for you.
Read more »
Posted in Updates on 04.19.16 by jason
When you updated to version 7.0.1a, the first thing you’ll notice is spruced up GUI, but there is a quite a bit going on under the hood too. You can see all the release notes here.
If you want to use this update, here is the instruction to update.
Posted in Updates on 03.02.16 by jason
This new stable version adds many new features, and give you the ability to scrape sites that are using the lastest SSL features.
Read more »
Posted in Tips on 10.28.15 by jason
One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.
What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.
Read more »
Posted in Updates on 04.29.15 by jason
We’ve been seeing lots of issues with scrapes connecting to HTTPS sites. Some of the errors include
- ssl_error_rx_record_too_long
- An input/output error occurred while connecting to https:// … The message was peer not authenticated.
- javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
The issue came about when the Heartbleed vulnerability necessitated changes to some HTTPS connections—some of types aren’t secure anymore, and new versions have come out. Screen-scraper needed two changes to catch up, and they are:
- Update to use Java 8
- Update of HTTPClient to 4.4
Both of these are pretty large changes, so they aren’t in the stable release yet, however in some cases they are the only option to make a scrape work, therefore here is the instructions to get what you need. Read more »
Posted in Miscellaneous on 06.10.13 by Todd Wilson
We’ve just added several new scraping sessions that exemplify extracting data from sites in various industries. If you go to our home page and click on one of the buttons corresponding to an industry you’ll be taken to a page where you can download the scraping session. The e-commerce section also has a video to walk you through the process, and we’ll be adding videos to the others shortly.
Posted in Uncategorized on 05.28.13 by jason
We’ve recently included libraries for Apache Commons Lang. There is a large number of useful things in there, but I find most use for stringUtils and wordUtils.
For example, some sites one might scrape might have the results in all caps. You could:
import org.apache.commons.lang.*;
name = “GEORGE WASHINGTON CARVER”;
name = StringUtils.lowerCase(name);
name = WordUtils.capitalize(name);
session.log(“Name now shows as: ” + name);
At the end, the name is now formatted as “George Washington Carver”. Most all of the methods are already nullsafe, and there is a lot of little tools in there to try.
Previous Entries »