In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.
How to Extract Text from PDFs and Images
Overview A surprising amount of valuable data is locked away in PDF files. There are a variety of methods for extracting data from them, but the job is made more difficult when they contain embedded images that hold the text. We recently experimented with Google Cloud Vision API to handle the task, and have had … Read moreHow to Extract Text from PDFs and Images