Version 5.0.49a of screen-scraper Released

Posted in Updates on 03.24.11 by Todd Wilson

Just a couple of little changes:

  • Enhanced error message when screen-scraper is inhibited by a local firewall.
  • Fixed a link to sub-extractor pattern help.

Smells like we might have a major version release coming out soon…

Version 5.0.48a of screen-scraper Released

Posted in Updates on 03.22.11 by Todd Wilson

Fixes in this update:

  • When exporting an object it will now be selected in the tree.
  • Fixed a bug related to the proxy / scrapeable file comparer.
  • Updated the PHP driver so that it now detects when it can’t connect to the screen-scraper server.
  • The “Runnable” tab in the web interface will now show the most recently run instance of a particular scraping session.

Screen-Scraping for iPhone, Andriod, Blackberry, and Most Any Other Mobile Device

Posted in Miscellaneous on 03.14.11 by Todd Wilson

The Mobile Problem

The proliferation of mobile devices has created a problem.  Most web sites these days are designed to be viewed on desktop computers with high-resolution monitors and via web browsers that allow for sophisticated interactivity.  Anyone who’s tried to view such sites on mobile devices with small screens can attest to a cramped feeling.  Even the very best mobile web browsers leave you wanting more space.  The advent of mobile apps has helped some in this respect.  Many content providers simply create customized interfaces via apps to make their data usable.  Apps are great, but there still exists a significant portion of information on the Web that isn’t easily accessible on mobile devices.  This is where screen-scraping can often fill the gap.

Ideally content providers, like travel and news web sites, offer either an app or a mobile-friendly version of their web site.  There are a variety of reasons why this may not happen, though, so screen-scraping may be used by third parties to provide alternate interfaces.

The approach you’d take to screen-scrape for mobile devices doesn’t differ too much from any other kind of screen-scraping.  I’ll present a couple of scenarios that will likely be similar to many sites you’d want to scrape.

Scraping Real Estate Data

There are a lot of sites out there that list information related to real estate.  This includes commercial sites like Realtor.com and Zillow, but there are also a staggering number of government and county web sites that contain invaluable real estate data.  Supposing you’re a realtor or home appraiser it might be helpful to have information related to a specific property while you’re out and about.  To meet this need, a software development group might build an app that provides detailed real estate information on a mobile device.  Let’s use Arizona’s Maricopa county web site as an example.  The site allows you to search for properties via a number of methods, including address and street name.  If you’re a software developer, your app might take a street address as an input parameter, then search for a property at that location.  If you perform such a search on the Maricopa site you might end up with a property like this one.  That page contains all kinds of information about the property, but maybe you’re only interested in a handful of data points:

The parcel number, property description, and most recent valuation information may be the most important parts.  You also wouldn’t want to attempt to display too much of this data on a mobile device because of the limited screen real estate.  The nice thing about screen-scraping is that you can be very precise in what you extract.

It’s likely that this information won’t change too frequently.  As such, it may make sense to simply extract all records from the web site, deposit desired data points into a database, then scrape again periodically to ensure that the information is current.  Even though it could be a relatively large data set, it may be better to grab it all at once rather than hitting the site in piecemeal fashion as the data is needed.  This would likely mean less of a load on the target web site, and also better performance as you wouldn’t be relying on the web site to return the information to you in real time.  In such a case the best approach would be to get the information into a database, then, when the data is requested from the mobile device, grab it directly out of your database rather than relying on the Maricopa site.  The flow would end up looking something like this:

In other words, the scraping is not done in real time.  You extract the information in a batch process, then deposit it into a database.  Once it’s there, the mobile device can make a request containing a property address to your web server, which then retrieves the corresponding record from your database, then passes it down to the mobile device.  Using either an app or a mobile-friendly web page, you could then display the information on the device in a much more usable format.

Scraping Travel Air Fares

Let’s suppose you’re interested in extracting travel air fares like Southwest Airlines.  In contrast to the previous example, air fare information is very volatile, and, as such, couldn’t be scraped in a batch to be accessed later from a database.  That is, the information would need to be scraped in real-time, as the user performs a search.  If you perform such a search on the Southwest Airlines site you’ll get a page that looks something like this:

It would be a relatively simple matter to program a screen-scraping application to iterate over each row of search results, extracting out information such as the departure times and the prices.  Because this data would need to be scraped in real time the architecture would look a bit different:

In this case the mobile device sends its request to the web server, which in turn passes a request along to a screen-scraper application, which gets the data from the web site, then sends it back down the line.  We’ve added a little twist to this example, though–depending on how much traffic the service gets it may be prudent to add multiple screen-scraping applications to help balance load.  In the case of our own screen-scraping software a given instance can handle multiple requests simultaneously, but the scraping load can be distributed even further across multiple screen-scraper instances which may be running on different computers.

Version 5.0.47a of screen-scraper Released

Posted in Updates on 03.02.11 by Todd Wilson

A number of fixes in this one:

  • Fixed a minor memory leak in the workbench.
  • Fixed a bug related to highlighting data records.
  • Fixed a bug where the scrapeable file view wasn’t updating correctly in some cases.
  • The “Generate scrapeable files in…” menu will now scroll when it contains many items.
  • The term “sutil” will now appear in blue in the script editor.

Version 5.0.46a of screen-scraper Released

Posted in Updates on 02.16.11 by Todd Wilson

Several small fixes in this version:

  • Fixed a bug related to setting the originator edition when exporting.
  • The cursor now returns to normal after attempting to highlight data records for a pattern that doesn’t match.
  • Fixed a bug where data records were not highlighting in the last response the very first time.
  • Fixed an issue where scrollbars weren’t appearing in the proxy/scrapeable file compare window.
  • Now displaying an error message when applying invalid extractor patterns.

Version 5.0.45a of screen-scraper Released

Posted in Updates on 02.15.11 by Todd Wilson

Just a couple of little changes in this one:

  • No longer truncating HTML in the “Last Response” tab.
  • Minor bug fix to the DataManager.

Version 5.0.44a of screen-scraper Released

Posted in Updates on 02.09.11 by Todd Wilson

A few more little fixes:

  • The position of the divider bar on the split pane for proxy sessions is now retained.
  • Numeric columns in tables are now rendered using the default font.
  • Fixed a minor bug related to editing extractor pattern tokens.

Version 5.0.43a of screen-scraper Released

Posted in Updates on 02.08.11 by Todd Wilson

A few fixes in this one:

  • Fixed a bug where the paste sub-extractor pattern was becoming enabled after a sub-extractor pattern had been deleted.
  • Fixed a bug where data record highlighting wouldn’t work correctly with very large HTML pages.
  • Fixed a bug where parameters sent in a multi-part request were causing invalid responses.

Version 5.0.42a of screen-scraper Released

Posted in Updates on 02.03.11 by Todd Wilson

Quite a few little fixes and enhancements in this one:

  • Fixed a bug related to the data set list view not displaying correctly.
  • Fixed an issue where anonymous proxy pool would not automatically repopulate when proxies were terminated automatically.
  • Fixed an issue in Linux where the extractor pattern panel was a bit too large.
  • Fixed an issue in Linux where the scraping session log panel was a bit too large.
  • Altered how character sets are handled in terms of how specifically set character sets override more global settings.
  • Long parameter values can now be edited in a separate text box.
  • Fixed an issue with extractor pattern token tooltips.
  • Fixed an issue with sub-extractor panels not sequencing after deletion.
  • screen-scraper will now display an error message when an invalid regular expression is entered for an extractor pattern token.
  • Fixed an issue with resizing the proxy transaction compare window.
As always, see the Alpha Log for the full history on changes.

Version 5.0.41a of screen-scraper Released

Posted in Updates on 02.01.11 by Todd Wilson

This one just contains a minor bug fix related to the Java keystore functionality we added in earlier.

« Newer EntriesPrevious Entries »