One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.
What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.
Since screen-scraper doesn’t run any JavaScript, what you need to do is make that request, and scrape the response. Here is an example:
- If you go to http://screen-scraper.com/infinite%20scroller/demo.html you can see my sample page. In this case it’s one of those pages that keeps tacking content to the end forever like Facebook or Pintrest.
- If you make a scrapeable file of http://screen-scraper.com/infinite%20scroller/demo.html you can get a successful response, but the content text isn’t there.
- Now you need to pull out the screen-scraper proxy, and proxy the request. You will see the one page is making 3 requests:
- http://screen-scraper.com/infinite%20scroller/demo.html -> The landing page
- http://screen-scraper.com/infinite%20scroller/scroll.js -> A JavaScript file that is making another request for data. On this one I’m just doing a GET request for a static page. Most of the time you will either see GET requests with parameters or POST requests to get different responses. Sometimes they change up the base URL, etc. There’s no real standard.
- http://screen-scraper.com/infinite%20scroller/data.json -> The request that gets the JSON content. Here you can see the format, and the JavaScript is parsing it, and writing it to the landing page for you.
Now you have the response, and in this case it’s JSON that you can either use extractor patterns on, or parse.