screen-scraper FAQ - What is screen scraping software?

There is a wide variety of software that can be used for screen-scraping. It comes in a variety of languages and complexity levels. For most scraping jobs, you’ll find that any of the options will work fine, but on rare occasions you’ll find a few options have difficulties and one that is a good fit for your job.

Broadly, there are two major categories of screen scraping software. One is browser based, and the other is without a browser.

Browser based scraping software plugs into a browser on your computer like Chrome or Firefox. Selenium and Puppeteer are often used to hook into Firefox or Chrome, open it, and go to pages you specify so you can save data from it. On some, the browser will open and take over your screen, and do it’s thing. Some flavors use a “headless browser” so it’s doing all the work of the browser, but not showing it to you. A method like this offers a few pros and cons:

Pros
- Since the page is rendered, it’s easy to see where data is located.
- In some cases, rendering the page in full is the only means to fulfill some JavaScript requirements on a page.
Cons
- Sometimes pop-overs for surveys or ads can come up unexpectedly, and affect the process.
- These software frameworks are generally require more memory and CPU resources.

The alternative is called HTTP based, and this is where Screen-Scraper lives. This software requests the page, but do not render the response like you would see on the screen. These have various abilities to run JavaScript, and generally do not make subsequent requests for images on the page, CSS docs, etc–unless specifically requested. HTTP based scrape software is often thought of as the more industrial class of scraping, as it makes targeted, precise requests to get only the desired data without any frills so it makes less impact on the server.

Pros
- Generally faster than one that renders the page
- Less impact on the target site as it doesn’t request all related files
- You don’t need to worry about pop-overs
Cons
- Limited JavaScript support can lead to request errors or difficulties on JavaScript heavy sites.
- It is often necessary to write more programming code.

Most of the time you’ll find either category of scraper will work for you, and choosing one is merely the preference of the developer. Most every scraper is vulnerable to changes to the site that leaves the bot unable to find things it is expecting, and requiring updates to the scraper. If a site is looking for bots accessing their site, both kinds can be detected.

There is another category of scraping software that might work for you, if you have a less complex set of requirements–scraping done by a browser add-on. There is a variety available for Firefox and Chrome. For these, you need to navigate to the page that you want information from, and then start the scraper to extract the information from the page. One of the more popular options on Firefox is Web Scraper, and you install it, go to a page you want to get info from, and they have a tool to set up the HTML elements you want to get. Options like these are lighter weight, and usually inexpensive because the user is still doing a lot of the work. Nevertheless, this could be an ideal solution for some people or projects.

The type of screen-scraper software that is best for your project will depend on a few factors, but the main things to look at are how well the target site deals with JavaScript not being run, and the amount of data you’re getting–a few thousand is a paltry sum, but if you’re making hundreds of thousands requests per day, you’ll need to take that into account.