There are some sites that have some pretty complex forms–sometimes in the sheer number of parameters, or sometimes by being incomprehensible to humans. In such cases we have a method to get all the form elements for you.
Tips
Dynamic Content
One’s first experience with a page full of dynamic content can be pretty confusing. Generally one can request the HTML, but it’s missing the data that is sought.
What you’re usually seeing is a page that contains JavaScript which is making a subsequent HTTP request, and getting the data to add into the HTML. That subsequent HTTP response is often JSON, but can be plain HTML, XML, or myriad other things.
New Quick Guide video
We recently released a new Quick Guide video. In less than three minutes you can get an idea of what it’s like to use screen-scraper. Source
Scraping AMF Sites
Most of the time when extracting information from web sites you’ll deal with HTML, which is generally pretty straightforward to deal with. Occasionally, though, content will be delivered via something like a Java applet or Flash movie. Just recently I completed a project that dealt with extracting data from a Flash movie, where the data … Read moreScraping AMF Sites
To Anonymize or to Not Anonymize
Lately we find an increasing need to anonymize our scraping sessions. So, as necessity is the mother of invention, we have created and adopted a handful of different approaches to keep our scrapes up and running. Keep in mind, the only way to block a web crawler is for a website’s server to refuse connections … Read moreTo Anonymize or to Not Anonymize
Oh, the possibilities (screen-scraping online video)
Here we go for the second installment. The topic for today is online video. Online video You may be familiar with certain sites that allow you to view your favorite TV episodes or watch a poor squirrel being launched into the woods off of some guys deck via a salad strainer and 20 feet of … Read moreOh, the possibilities (screen-scraping online video)
Oh, the possibilities (ScrapbookFinds.com)
This is the first installment in what will hopefully become a series. Here at screen-scraper we handle a variety of projects for a myriad of different clients. All of our work is centered around our core software, screen-scraper, but is often complimented by third-party software such as PHP, Tomcat, Lucene, Google Web Toolkit, mySQL, along … Read moreOh, the possibilities (ScrapbookFinds.com)
To Recurse is Human, to Iterate, Divine
Well, that’s actually not always true. Take a quick look at this blog posting here. The fundamental issue described by that posting is one of recursion vs. iteration. When recursion is used (a page calls a page which calls a page…) objects tend to get stacked up, and subsequently fill up memory. When iteration is … Read moreTo Recurse is Human, to Iterate, Divine
Resume points
Sometimes a long scrape will be stopped mid-run by a system crash, power surge, or bad mojo. Many times there is nothing to do but to restart, but sometimes there is a way to pick up (pretty close to) where you left off. You need to include some extra logic, but it is often worthwhile.
Let’s say where looking a site that lists hundred of LOCATIONS, and inside each there is a listing of COMPANIES, and the data we’re after is listed in each COMPANY.
I’m going to make a script that runs at the beginning of the scrape to check for a file that contains the last scraping state.
Exporting & importing scraping sessions in 4.5.42a
We try hard to maintain backward compatibility as much as possible, but unfortunately it can’t always be done. If you recently upgraded to 4.5.42a you may have noticed that scraping sessions that are exported from that version don’t import correctly into an alpha version prior to it. This was a result of the alterations to … Read moreExporting & importing scraping sessions in 4.5.42a