Techniques for Scraping Large Datasets

Some of the sites we aspire to scrape contain vast, huge amounts of data. In such cases, an attempt to scrape data from it may run fine for a time, but eventually stop prematurely with the following message printed to the log:

The error message was: The application script threw an exception: java.lang.OutOfMemoryError: Java heap space BSF info: null at line: 0 column: columnNo

There can be a variety of causes, but most of the time it is caused by memory use in page iteration. Turning up the memory allocation for screen-scraper may take care of it, but it doesn’t address the root cause.

In a typical site structure, we input search parameters and are presented with a page of results and a link to view subsequent pages. If there are ten to twenty pages of results, it’s easiest to just scrape the “next page” link and run a script after the pattern is applied that scrapes the next page. The problem lies in the fact that this is recursive. When we’ve requested the search results, and 2 subsequent “next pages” the scrapeable files are still open in memory thusly:

  • Scrapeable file “Search results” and dataSet “Next page”
  • Scrapeable file “Next search results” and dataSet “Next page”
  • Scrapeable file “Next search results” and dataSet “Next page”

Every “Next search results” opens a new scrapable file while the previous is still open. While you can run the script on the scripts tab after the file is scraped to prevent the dataSets from remaining in scope, the scrapeable files remain in memory—the scrape may get further, but the memory still fills up with scrapable files, and it mayn’t be enough to get all the data.

The solution is to use an iterative approach.

If the site we’re scraping shows the total number of pages, using an iterative method easy. For my example, I’ll describe a site that has a link for pages 1 through 20, and a “>>” indicator to show there are pages beyond 20.

On first page of search results, I have 3 extractor patterns to extract the following information:

  1. Each result listed
  2. All the page numbers shown, and
  3. The next batch of results

When I get the to the search results page, the first extractor runs as always and drills into the details of each result as usual. The second extractor pattern grabs all the pages listed so I get a dataSet named “Pages,” containing links to pages 2 through 20, and I save the dataSet as a session variable. On the scripts tab, I then run this script after the file is scraped:

/*

Script gets all page numbers from the Pages extractor pattern, and iterates through them

*/

// Get variable

pages = session.getVariable(“Pages”);

// Clear session variable so it doesn’t linger

session.setVariable(“Pages”, null);

// Loop through pages

for (i=0; i

{

// Since the page list appears twice, use only a number larger than that just used

if (i>session.getVariable(“PAGE”))

{

session.setVariable(“PAGE”, i);

session.log(“+++Scraping page #” + i);

session.scrapeFile(“Next search results”);

}

else

{

session.log(“+++Already have page #” + i + ” so not scraping”);

}

}

The “for” loop will have the first page of search results in memory, but when it calls the “Next search results” scrapeable file to go to page 2, it only gets the results, and doesn’t try to look for a next page. The loop closes out the second page before it starts the third, and closes the third before starting the forth, etc.

The last extractor on “Search results” looks for “>>”. I save the that dataSet as a session variable named “Next batch pages”, and put this as the last script to run on the scripts tab:

import com.screenscraper.common.*;

/*

Script that checks if there is a next batch of pages

*/

if (session.getVariable(“Next batch pages”)!=null)

{

pageSet = session.getVariable(“Next batch pages”);

session.setVariable(“Next batch pages”, null);

pages = pageSet.getDataRecord(0);

page = Integer.parseInt(pages.get(“PAGE”));

if (page>session.getVariable(“PAGE”))

{

session.setVariable(“PAGE”, page);

session.log(“+++Scraping page #” + page);

session.scrapeFile(“Next batch search results”);

}

else

{

session.log(“+++Already have page #” + page + ” so not scraping”);

}

}

Now the “Next batch search results” scrapable file must do all the things the first page of search results did; get each result, look for next page links, and look for a next batch of results. Using the iterative approach to cycle through pages enables you request many more pages without keeping as many in memory, and without unnecessary pages in memory, the scrape will run far longer.

24 thoughts on “Techniques for Scraping Large Datasets”

  1. It’s really an interesting aproach. But what happend when you have a large amont of data but to access each page you need yo pass a “key” generated in the preceding page? Things like people soft where you have to pass a code from the search page to the details page to load and if you try to go from one details page to another detail page you just get a “key error”. Is there any way you can handle that?

  2. I have scraped some Peoplesoft sites and seen things like that. You also run into issues like that on some .NET pages which like to fling viewstates around like flapjacks in a country diner.

    You’ll just need one more extractor on each page where the key appears; you can name your token as you please, but I would use something like “KEY” and set it as a session variable. Then on the next page request you just replace the value in that parameter with the ~#KEY#~ token. Since you’re scraping the value on every page, the session variable will always hold the most recently found value, and thusly should work.

  3. Well I have played around with this now for a day, still with no luck, errors in the script and failure to go the the next page in the set. Is there an .sss download with this integrated so it can be studied better?

  4. I invoke the first script “When I get the to the search results page, the first extractor runs as always and drills into the details of each result as usual. The second extractor pattern grabs all the pages listed so I get a dataSet named “Pages,” containing links to pages 2 through 20, and I save the dataSet as a session variable. On the scripts tab, I then run this script after the file is scraped:”

    …. da da da Script Runs with result: The error message was: class bsh.EvalError (line 9): session .getVariable ( Pages ) — Error in method invocation: Attempt to pass void argument (position 0) to method: getVariable

    Line Nine reads “pages = session.getVariable(”Pages”);”

    On my Search results page I extract “~@PAGE@~”, “~@Pages@~” and “~@Next@~” all are stored as a Session. My gut is telling me that the Session Requested “session.getVariable(”Pages”);” is empty.

  5. OK, Figured out the above problem ””, Replace ” with ” if you cut and paste the above, BUT I have been presented with another… argh!! An error occurred while processing the script: The error message was: class bsh.ParseException (line 23): if– Encountered “if” at line 23, column 5.

    Line 23 reads: if (i>session.getVariable(“PAGE”))

    I do hope all this pain will help someone else…. If anyone can shed some light on the “New” problem, it would be appreciated..

  6. Stu,

    You seem to be missing a bit. Look at this:

    /*

    Script gets all page numbers from the Pages extractor pattern, and iterates through them

    */

    // Get variable
    pages = Integer.parseInt(session.getVariable(”Pages”));

    // Clear session variable so it doesn’t linger
    session.setVariable(”Pages”, null);

    // Loop through pages
    for (i=0; i>pages; i++)
    {
    // Since the page list appears twice, use only a number larger than that just used
    if (i>session.getVariable(”PAGE”))
    {
    session.setVariable(”PAGE”, i);
    session.log(”+++Scraping page #” + i);
    session.scrapeFile(”Next search results”);
    }
    else
    {
    session.log(”+++Already have page #” + i + ” so not scraping”);
    }
    }

  7. I figured out there was something wrong with the “for (i=0;” portion because there was no closing bracket in your example, BUT being a sysadmin and not a programmer makes life a little difficult sometimes. Hopefully the posted information will assist someone else with the head scratching…. Thanks Jason.

  8. Hey Guys. I’ve been trying to scrape the garage details from http://www.yell.com, I’ve scraped all the required data and have stored it in the db. The problem is, scrape works absolutely fantastic till first five (5) pages, but from page no 6 n on, it doesn’t scrape. It says:
    The pattern did not find any match.
    As i’ve been working on this for almost a month now (I’m new to scraping) and this problem seems to be one thing I am unable to solve.
    If any of you could shed some light or help me, that would be so sweet of you.
    Waiting eagerly…

  9. Hey Jason. Thanks for the reply. After your msg, i’ve been trying to dig deeper into the problem and finally came to conclusion that the problem is Yell blocking my IP after exactly 5 pages scraped. Here is one post that presents almost the same issue i’m facing, http://community.screen-scraper.com/node/2181

    Please help me with this, i’m using win 7. Ever since i’ve read that post, i’d been looking for that SSTOR.jar but couldn’t find it. I’ve download the TOR Browser bundle. Guide me from here on.

    Thank you very much

  10. I’m stuck on this. I followed what Jason has from September 12, 2011 but with no luck. The site I am scraping has EVENTSTATEs so I have a bit of data to keep in memory.

    Maybe I’m dense but the “PAGE” reference should be the current page right?

  11. When it gets to the Loop through pages part with the “(i=0; i>pages; i++)” nothing appears to be happening. I can’t get the If Then statement to register either log writes.

  12. Thank you so very much Jason for the SSTOR.jar file.

    I believe the Scraping session (the ip change/check) is the next thing to get, i’ve also read in the post that I need Polipo setup on my system. As i mentioned earlier, i’ve downloaded TOR Bundle and it doesn’t have that Polipo or Vidalia thing in it, i’ve search all over the net and the TORPROJECT now come with this only setup (without Polipo or Vidalia). What should I do now? Where to GO?

    Your help is highly appreciated.

    Hanki

  13. Hi again. I’ve downloaded the Polipo. Now what is the next step? When are you going to provide configuration steps and the script for checking the blocked ports and changing them to continue scrape from there on.

    Having my fingers crossed and hopes high.

  14. Hey Jason. Thanks for reply. I’ve read this post, it says something about the scrapeable session that you”ve provided. Kindly do provide me with that session file so I may carry-on scrape.

    Waiting …

  15. I’m sorry but i couldn’t find it on the page. Would you please share it here, the way you shared the SSTOR.jar?

    Really appreciate your help and prompt reply.

  16. Hey … Sorry to bug you again but i just want to remind that i am still waiting to hear from you guys.

    Regards,
    Hanki

Leave a Comment