How to Extract Text from PDFs and Images

Overview A surprising amount of valuable data is locked away in PDF files. There are a variety of methods for extracting data from them, but the job is made more difficult when they contain embedded images that hold the text. We recently experimented with Google Cloud Vision API to handle the task, and have had … Read moreHow to Extract Text from PDFs and Images

8 Ways to Handle Scraped Data

In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.

Large-Scale Web Scraping

I recently answered a question on Quora about parallel web scraping, and thought I’d flesh it out more in a blog posting. Scraping sites on a large scale means running many bots/scrapers in parallel against one or more websites. We’ve done projects in the past that have required hundreds of bots all running at once against hundreds of websites, with many of them targeting the same website. There are special considerations that come into play when extracting information at a large scale that you may not need to consider when doing smaller jobs.

