On occasion we’re asked to acquire data from a site that already offers an API. In almost all cases it’s going to be simpler to acquire data from a site by accessing an API as opposed to crawling. In theory, there should be no need to scrape data from a site if an API to the content is already made available. That said, there are a number of reasons why it still may make sense to scrape a site that also provides an API.
Overview A surprising amount of valuable data is locked away in PDF files. There are a variety of methods for extracting data from them, but the job is made more difficult when they contain embedded images that hold the text. We recently experimented with Google Cloud Vision API to handle the task, and have had … Read moreHow to Extract Text from PDFs and Images