Dynamic Content

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.

Read moreDynamic Content

To Anonymize or to Not Anonymize

Lately we find an increasing need to anonymize our scraping sessions. So, as necessity is the mother of invention, we have created and adopted a handful of different approaches to keep our scrapes up and running. Keep in mind, the only way to block a web crawler is for a website’s server to refuse connections … Read moreTo Anonymize or to Not Anonymize

Oh, the possibilities (screen-scraping online video)

Here we go for the second installment. The topic for today is online video. Online video You may be familiar with certain sites that allow you to view your favorite TV episodes or watch a poor squirrel being launched into the woods off of some guys deck via a salad strainer and 20 feet of … Read moreOh, the possibilities (screen-scraping online video)

Oh, the possibilities (ScrapbookFinds.com)

This is the first installment in what will hopefully become a series. Here at screen-scraper we handle a variety of projects for a myriad of different clients. All of our work is centered around our core software, screen-scraper, but is often complimented by third-party software such as PHP, Tomcat, Lucene, Google Web Toolkit, mySQL, along … Read moreOh, the possibilities (ScrapbookFinds.com)

To Recurse is Human, to Iterate, Divine

Well, that’s actually not always true.  Take a quick look at this blog posting here.  The fundamental issue described by that posting is one of recursion vs. iteration.  When recursion is used (a page calls a page which calls a page…) objects tend to get stacked up, and subsequently fill up memory.  When iteration is … Read moreTo Recurse is Human, to Iterate, Divine

Resume points

Sometimes a long scrape will be stopped mid-run by a system crash, power surge, or bad mojo.  Many times there is nothing to do but to restart, but sometimes there is a way to pick up (pretty close to) where you left off.  You need to include some extra logic, but it is often worthwhile.
Let’s say where looking a site that lists hundred of LOCATIONS, and inside each there is a listing of COMPANIES, and the data we’re after is listed in each COMPANY.

I’m going to make a script that runs at the beginning of the scrape to check for a file that contains the last scraping state.

Read moreResume points

Exporting & importing scraping sessions in 4.5.42a

We try hard to maintain backward compatibility as much as possible, but unfortunately it can’t always be done.  If you recently upgraded to 4.5.42a you may have noticed that scraping sessions that are exported from that version don’t import correctly into an alpha version prior to it.  This was a result of the alterations to … Read moreExporting & importing scraping sessions in 4.5.42a