Combining Scraped Data from Multiple Sites

Posted in Thoughts on 01/30/19by Todd Wilson

Often data sets become richer when they’re combined together. A good example of this is in a small study done by Streaming Observer on the quality of movies available from the big streaming services–Amazon, Netflix, Hulu, and HBO. The study concluded that, even though Amazon has by far the most movies, Netflix has more quality movies than the other three combined. This was determined by combining data about the movies available from each streaming service with data from Rotten Tomatoes, which ranks the quality of movies.

Can you guess how they acquired the data for their study? The streaming services don’t supply any sort of API to publish their movie listings. Rotten Tomatoes likewise doesn’t provide automated access to their system. The only way to acquire the data is through web scraping. I also know this because of a project that we happened to do for one of those big streaming services a while back. They were interested in doing a competitive analysis with their competitors (i.e., comparing their offering with those of the other streaming services). Ironically, they had us not only scrape data from their competitors, but also their own site. It’s sometimes easier to scrape data from your own website than to jump through all of the bureaucratic hoops required to get it internally.

Leave a Comment