Tips Archives - screen-scrapeable

Complex Forms

January 3, 2019March 22, 2017 by jason

There are some sites that have some pretty complex forms–sometimes in the sheer number of parameters, or sometimes by being incomprehensible to humans. In such cases we have a method to get all the form elements for you.

Dynamic Content

January 3, 2019October 28, 2015 by jason

One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.

What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.

New Quick Guide video

June 16, 2012June 15, 2012 by Todd Wilson

We recently released a new Quick Guide video. In less than three minutes you can get an idea of what it’s like to use screen-scraper. Source

Scraping AMF Sites

November 17, 2011November 15, 2011 by Todd Wilson

Most of the time when extracting information from web sites you’ll deal with HTML, which is generally pretty straightforward to deal with. Occasionally, though, content will be delivered via something like a Java applet or Flash movie. Just recently I completed a project that dealt with extracting data from a Flash movie, where the data … Read moreScraping AMF Sites

To Anonymize or to Not Anonymize

November 12, 2010November 11, 2010 by Todd Wilson

Lately we find an increasing need to anonymize our scraping sessions. So, as necessity is the mother of invention, we have created and adopted a handful of different approaches to keep our scrapes up and running. Keep in mind, the only way to block a web crawler is for a website’s server to refuse connections … Read moreTo Anonymize or to Not Anonymize

Oh, the possibilities (screen-scraping online video)

October 25, 2010 by Todd Wilson

Here we go for the second installment. The topic for today is online video. Online video You may be familiar with certain sites that allow you to view your favorite TV episodes or watch a poor squirrel being launched into the woods off of some guys deck via a salad strainer and 20 feet of … Read moreOh, the possibilities (screen-scraping online video)

Oh, the possibilities (ScrapbookFinds.com)

October 25, 2010October 21, 2010 by Todd Wilson

This is the first installment in what will hopefully become a series. Here at screen-scraper we handle a variety of projects for a myriad of different clients. All of our work is centered around our core software, screen-scraper, but is often complimented by third-party software such as PHP, Tomcat, Lucene, Google Web Toolkit, mySQL, along … Read moreOh, the possibilities (ScrapbookFinds.com)

To Recurse is Human, to Iterate, Divine

December 16, 2010April 15, 2010 by Todd Wilson

Well, that’s actually not always true. Take a quick look at this blog posting here. The fundamental issue described by that posting is one of recursion vs. iteration. When recursion is used (a page calls a page which calls a page…) objects tend to get stacked up, and subsequently fill up memory. When iteration is … Read moreTo Recurse is Human, to Iterate, Divine

Resume points

April 12, 2010April 15, 2010 by jason

Sometimes a long scrape will be stopped mid-run by a system crash, power surge, or bad mojo. Many times there is nothing to do but to restart, but sometimes there is a way to pick up (pretty close to) where you left off. You need to include some extra logic, but it is often worthwhile.
Let’s say where looking a site that lists hundred of LOCATIONS, and inside each there is a listing of COMPANIES, and the data we’re after is listed in each COMPANY.

I’m going to make a script that runs at the beginning of the scrape to check for a file that contains the last scraping state.

Exporting & importing scraping sessions in 4.5.42a

April 6, 2010 by Todd Wilson

We try hard to maintain backward compatibility as much as possible, but unfortunately it can’t always be done. If you recently upgraded to 4.5.42a you may have noticed that scraping sessions that are exported from that version don’t import correctly into an alpha version prior to it. This was a result of the alterations to … Read moreExporting & importing scraping sessions in 4.5.42a