8 Ways to Handle Scraped Data

Posted in How-To on 01.30.19 by Todd Wilson

In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.

Read more »

Large-Scale Web Scraping

Posted in How-To on 01.17.19 by Todd Wilson

I recently answered a question on Quora about parallel web scraping, and thought I’d flesh it out more in a blog posting. Scraping sites on a large scale means running many bots/scrapers in parallel against one or more websites. We’ve done projects in the past that have required hundreds of bots all running at once against hundreds of websites, with many of them targeting the same website. There are special considerations that come into play when extracting information at a large scale that you may not need to consider when doing smaller jobs.

Read more »