Todd Wilson, Author at screen-scrapeable

How to Extract Text from PDFs and Images

December 12, 2019December 12, 2019 by Todd Wilson

Overview A surprising amount of valuable data is locked away in PDF files. There are a variety of methods for extracting data from them, but the job is made more difficult when they contain embedded images that hold the text. We recently experimented with Google Cloud Vision API to handle the task, and have had … Read moreHow to Extract Text from PDFs and Images

Version 7.0.14a of screen-scraper Released

October 28, 2019 by Todd Wilson

Just released a new alpha version of screen-scraper. Here are the changes: Bug fix to datamanager awaitCompletionOfPendingWrites method that could cause it to permanently block. Addition of new HTTP callback event fire times. Fixed a data manager issue when building schemas with some newer mysql drivers. Added sutil.makeGETRequestRequired(String) that issues a request even if the … Read moreVersion 7.0.14a of screen-scraper Released

Screen-Scraping vs. API

March 11, 2019 by Todd Wilson

On occasion we’re asked to acquire data from a site that already offers an API. In almost all cases it’s going to be simpler to acquire data from a site by accessing an API as opposed to crawling. In theory, there should be no need to scrape data from a site if an API to the content is already made available. That said, there are a number of reasons why it still may make sense to scrape a site that also provides an API.

Combining Scraped Data from Multiple Sites

March 11, 2019January 30, 2019 by Todd Wilson

Often data sets become richer when they’re combined together. A good example of this is in a small study done by Streaming Observer on the quality of movies available from the big streaming services–Amazon, Netflix, Hulu, and HBO. The study concluded that, even though Amazon has by far the most movies, Netflix has more quality movies than the other three combined. This was determined by combining data about the movies available from each streaming service with data from Rotten Tomatoes, which ranks the quality of movies.

8 Ways to Handle Scraped Data

February 1, 2019January 30, 2019 by Todd Wilson

In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.

Large-Scale Web Scraping

January 17, 2019January 17, 2019 by Todd Wilson

I recently answered a question on Quora about parallel web scraping, and thought I’d flesh it out more in a blog posting. Scraping sites on a large scale means running many bots/scrapers in parallel against one or more websites. We’ve done projects in the past that have required hundreds of bots all running at once against hundreds of websites, with many of them targeting the same website. There are special considerations that come into play when extracting information at a large scale that you may not need to consider when doing smaller jobs.

Scraping data from various industries

June 10, 2013 by Todd Wilson

We’ve just added several new scraping sessions that exemplify extracting data from sites in various industries. If you go to our home page and click on one of the buttons corresponding to an industry you’ll be taken to a page where you can download the scraping session. The e-commerce section also has a video to … Read moreScraping data from various industries

End-of-year sale!

November 29, 2012 by Todd Wilson

This is our biggest sale in quite a while. Until December 31, 2012 take 40% off Professional Edition licenses and 60% off Enterprise Edition licenses. Click here to take advantage.

Version 6.0.18a of screen-scraper Released

October 16, 2012 by Todd Wilson

A few minor updates in the one, along with a long-awaited global find feature!

Let Us Help You Learn screen-scraper

July 19, 2012 by Todd Wilson

We are pleased to announce our new coaching program. To help get started, our new users can receive up to two free hours of one-on-one coaching (click here for details). Existing users, receive help planning out your project, solving that one tough issue, learn new techniques and refine your current scraping projects. Purchase hours of training … Read moreLet Us Help You Learn screen-scraper