In general, the hard part of screen-scraping is acquiring the data you’re interested in. This means building a bot using some type of framework or application to crawl a site, and extracting specific data points. It may also mean downloading files such as images or PDF documents. Once you’ve gotten to the point that you have the data you want, you then need to do something with it. I’m going to review the more common techniques for handling extracted data.
1. Saving to a File
The most common way to handle scraped data is to simply save it to a file, such as a CSV. Much of the time data is structured in two dimensions, which makes saving it to a spreadsheet-like structure the most logical choice. Fortunately, most any programming language allows you to write to the file system, and many have libraries designed to write to CSV files. You can often pass some type of map object, which can be automatically saved as a row in a CSV file. A simple example of this would be saving product data from an ecommerce site like Amazon. Your crawler may extract product information such as the title, description, and price, then write all of that data out row-by-row to a CSV file.
If the data structure is more complex it may make sense to use something like JSON or XML. These file formats allow for data that is hierarchical to be written out to a file, while still preserving the structure. An example might be product data where you need to save the various available options (e.g., color or size). A node in your JSON file may correspond to a given product, but within that node you can have any number of sub-nodes to save information like the color, quantity, or size options corresponding to the product. Here again, most languages are going to provide a way to write to the file, with many providing libraries to handle JSON or XML within code.
2. Saving to a Database
For most projects our preference is to save to a relational database. Aside from allowing data of arbitrary structure to be saved, a database also has other advantages, such as multiple simultaneous writes and preserving data for subsequent runs.
For example, let’s suppose you’re extracting car data from a classified ad website such as Craigslist. You might have more than one bot running at a time, so writing out to a file (or even multiple files) could be problematic. You also might want to run your scraper every day, but only get new postings. When your crawler encounters a classified ad it could do a lookup in the database to see if the ad was already saved. If it was, the crawler could skip over it and move on to the next one. Most sites utilize some type of unique identifier that allows you to track which records you already have.
Many languages provide libraries that allow data to be saved to a database using some sort of abstraction layer. That is, your code doesn’t need to know much about the database itself, but simply by following certain conventions your data can be automatically saved. We’ve developed our DataManager for this purpose. It integrates directly with our application, and is almost like magic in how simple it makes the process.
Once you’ve completed your crawls and your data is securely stored in your database, you then have a number of options as to what to do with it. You may want to display the data on a web page, which the database would allow. You might also need to export it for a client, which you could do by querying the database, then writing the data out as flat CSV or JSON files.
3. Pushing Data to an API
It’s common for our clients to want to handle the data we scrape themselves. That is, we just crawl the sites, hand the data over to them, then let them decide what to do with it. In cases where we don’t need to keep track of what was scraped historically we can simply push records to an external system in realtime via an API. The API often takes the form of a REST interface, and we often send chunks of JSON to transport the data.
The primary advantage to this approach is that we don’t need to store any of the data ourselves; we just push it out as we get it. The advantage to the client is that they get the data as quickly as we acquire it, and they have full control over what happens with it.
There are a number of third-party systems that allow for this type of approach in an elegant way. We’ve used Amazon’s SNS service, for example, to push data to the cloud. It’s simple to implement, and provides a high-performance interface on scalable infrastructure.
4. Submitting Data to a Form
At times data extracted from one site becomes input for another. That is, you might screen-scrape data from a site, then submit it to another by scraping a web form. As an example, a while back we did a project where data need to be migrated from one support forum web application to another. There were a large number of posts a company had in an internal forum-like application, and they were transitioning to a different platform. They wanted the functionality of the new system, but didn’t want to lose all of their old posts. As such, we migrated each of the posts from the old system one-by one, and submitted them via a web form to the new system. In the end we copied over tens of thousands of individual posts.
5. Importing Data into a Web Store
Another common application is to download product data (usually from a wholesaler site) then import it into some type of ecommerce platform like Shopify or Magento. It’s surprising how few wholesalers provide an API that allows merchants to copy product data into their ecommerce sites. Handling these types of situations usually involves screen-scraping a select number of products or categories from a site, then generating a flat file that can then be imported into the ecommerce platform. Oftentimes images need to be brought over as well. In the end this saves merchants all kinds of time that they would otherwise spend manually copying over data such as product titles and descriptions.
6. Handling Downloadable Files
Oftentimes we’re not just dealing with text. We frequently need to save images, PDF files, Excel files, or other binay types. There are a few ways we’ve handled this in the past.
The simplest way to handle files is to download them to the file system. The only trick is that you need to come up with some type of good naming convention in order to correlate the files with the text-based data you’re storing elsewhere. For example, if you’re saving images from an ecommerce site, you might first save a product record to the database, obtain the auto-generated ID of the record, then name each image using that auto-generated ID (e.g., 1234_0.jpg, 1234_1.jpg, 1234_2.jpg).
As an alternative to saving files to the file system, you might also save them to the database in BLOB fields. The advantage to this approach is that it keeps everything neatly organized in the database. It’s a bit more complex to implement, but can be worth it. The one hitch we’ve found with this approach is that exporting data containing large BLOBs can be cumbersome. Also, it’s a good idea to store BLOB values in a separate table from your data–querying tables containing BLOB values can slow things down quite a bit.
If you’re working in the cloud something like Amazon’s S3 service can also be a good option for handling downloadable files. This avoids filling up the file system, and provides a resilient way to store and track the files. The cloud provider you’ve selected likely has a good API that allows you to save files using your own naming convention. It’s kind of like having a hard drive with virtually unlimited capacity.
7. Uploading Saved Data
Once crawling is complete and data is saved, you may want to push it out in some type of bulk upload. Historically FTP has been a preferred option, though we’ve also worked with protocols like SCP, SFTP, or even services like DropBox. FTP has been somewhat unreliable for us, and we often encourage clients to consider something more secure.
8. Displaying Data in Real Time
There are times when data is time-sensitive, and needs to be displayed to the user as it’s acquired. A while back we developed a meta-search engine for airline flights. That is, the user would enter information about their departure and arrival points, then our system would query several different travel sites simultaneously, and display the data as it was extracted. If you’ve used Kayak you know how this works. Because flight fares change so frequently we couldn’t save the data, then display it later–we had to scrape it, then push it out immediately to the user. The details of the implementation are a little more involved than I can cover here, but it involves storing the results temporarily, and a series of AJAX calls to pull them out so that they can be rendered in the browser.
In Conclusion
One of the most important pieces of advice I can give is to decouple the code you use to save the data from the scraping code itself. If you start out saving to flat files, then later decide to save to a database, this should be a relatively painless process. Also, take advantage of existing libraries that allow you to perform common tasks like these. The odds are good that whatever you plan on doing with your data someone has already written code to do the bulk of the work for you.