Version 5.5.40a of screen-scraper Released

Posted in Updates on 02.22.12 by Todd Wilson

Several minor fixes in this release:

  • Fixed an issue related to renaming scraping sessions.
  • Added a couple of check boxes to wrap text to the proxy panels.
  • Made a fix to ensure consistency in line wrapping the last response text box.
  • Now centering the search result in the proxy.
  • Fixed text related to the edition to be more consistent.
  • Fixed a bug related to stopping scraping when an infinite redirect is encountered.

Version 5.5.38a of screen-scraper Released

Posted in Updates on 02.07.12 by Todd Wilson

Several small fixes in this update:

  • Now determining whether or not to save on individual key strokes.
  • Fixed a bug related to displaying the start page and handling history.
  • Fixed a bug related to deleting multiple items.
  • Fixed a few minor memory leaks.
  • Now stripping internal anchors off of redirect URL’s.

Version 5.5.37a of screen-scraper Released

Posted in Updates on 01.26.12 by Todd Wilson

Several small changes in this one:

  • Fixed a bug related to duplicate token editor windows.
  • Add buttons to wrap text and find within the request/response of a proxy transaction.
  • Now using %20 instead of + to represent a space character when encoding GET/POST parameters.
  • Now correctly displaying encoded GET/POST parameters in scrapeable file proxy comparer.
  • Added search term to the top of the proxy search results window.
Also, for any out there keeping track, we’re nearing another full version release.  At this point we’ve pretty much frozen the feature set for a 6.0 release, so we’re now going to be doing a lot of internal testing to ensure we catch any bugs and such before we do a full release.  We’d also be grateful to any who are willing to help us test alpha versions.  Please let us know of any issues you find.

Version 5.5.36a of screen-scraper Released

Posted in Updates on 01.24.12 by Todd Wilson

Several changes in this one:

  • Added the following methods: session.setStopScrapingOnScriptError, session.setStopScrapingOnMaxRequestAttemptsReached, session.setStopScrapingOnExtractorPatternTimeout, scrapeableFile.getMaxRequestAttemptsReached, scrapeableFile.getExtractorPatternTimedOut.
  • Fixed a bug related to prompting for save upon exit.
  • Deprecated proxy scripting.  Can be re-enabled via the AllowProxyScripting property.
  • Fixed a minor memory leak in the workbench.
  • Updated the .NET driver to work with COM-based applications.
  • Added initial support for memory profiling.
The memory profiling stuff is especially cool.  There are times when a developer can inadvertently cause screen-scraper to run short on memory.  We’ve added code to detect for these times, then give (hopefully) a good detailed description of what’s chewing up resources so that the problem can be addressed.  It’s likely we’ll refine this one more over time, but even the initial implementation is pretty useful.

Version 5.5.33a of screen-scraper Released

Posted in Updates on 01.04.12 by Todd Wilson

The holiday enhancements have spilled over into 2012:

  • Added “Always at the end” option to force scripts to run at the end of a scraping session, even if it gets stopped prematurely.
  • The prompt to save dialog box only shows on exit when a change has actually been made.
  • Added a keyboard shortcut to the extractor pattern text box such that when text is highlighted and the Control/Command-T key combination is pressed an extractor pattern token will be generated.  This is the equivalent of using the corresponding menu item when the right-click pop-up menu is invoked.
  • Improved error reporting.
  • Added local script variables to the breakpoint frame.
  • When in workbench mode screen-scraper will now breakpoint on a script error.

Version 5.5.32a of screen-scraper Released

Posted in Updates on 12.27.11 by Todd Wilson

Things have cooled down for us a bit over the holidays, so we’ve been able to carve out time for a number of bug fixes and feature enhancements.  Here’s the list:

  • Fixed a threading issue related to the REST interface.
  • Added classes and methods related to decoding images.
  • Fixed a bug related to use of the “Breakpoint” button with RunnableScrapingSessions.
  • Added getStatusMessage, setStatusMessage, and appendStatusMessage to the session object, all of which are synonymous with their corresponding “error” methods (e.g., getStatusMessage = getErrorMessage).
  • In the web UI changed the column “Error Message” to “Status Message”.
  • Added the following methods to the scrapeableFile object: resequenceHTTPParameter( String key, int sequence ), removeHTTPParameter( String key ), addGETHTTPParameter( String key, String value, int sequence ), addGETHTTPParameter( String key, String value ), addPOSTHTTPParameter( String key, String value, int sequence ), addPOSTHTTPParameter( String key, String value )
  • Made a DataManager fix where child rows weren’t getting inserted for duplicate parent rows.
  • Changed default user agent for newly-created scraping sessions to Internet Explorer 8.
  • Now saving in a separate thread so that the GUI won’t get locked up for large objects.

Scraping AMF Sites

Posted in Tips on 11.15.11 by Todd Wilson

Most of the time when extracting information from web sites you’ll deal with HTML, which is generally pretty straightforward to deal with.  Occasionally, though, content will be delivered via something like a Java applet or Flash movie.  Just recently I completed a project that dealt with extracting data from a Flash movie, where the data was delivered from the server via Adobe’s Action Message Format (AMF).  I thought I’d share a bit about my experience here, which will hopefully be useful to others, as well as myself the next time I have to do this 🙂

The main tool you’ll deal with when scraping AMF-based data is Adobe’s Java AMF Client.  It handles most of the heavy lifting for you, though you’ll still need to do a fair amount of coding.  The other tool that is indispensable is Charles proxy, which has a built-in AMF parser.  Without it you’ll be flying blind.

The basic approach you’ll want to take is to proxy the site via Charles with your web browser, pick out the AMF requests that seem relevant, then replicate those in code.  In my case I also had to download PDF files (standard HTTP), so I actually had to run it all in screen-scraper, combining normal screen-scraper stuff with the Java AMF Client stuff.  There was also a login that had to be done outside of AMF.  Anyway, just be aware that you may have to combine both approaches in your own project.

I’m going to be providing some example code below in Interpreted Java (which is just BeanShell) as a screen-scraper script.  You’ll need to do a bit of modification if you want to run this as straight Java.

Digging into the details, here’s how my code looks that sets up the initial AMF stuff:

import flex.messaging.io.ArrayCollection;
import flex.messaging.messages.*;
import flex.messaging.io.amf.client.AMFConnection;
import flex.messaging.io.amf.client.exceptions.ClientStatusException;
import flex.messaging.io.amf.client.exceptions.ServerStatusException;
import flex.messaging.util.UUIDUtils;
import flex.messaging.io.amf.ASObject;

// Create the AMF connection.
AMFConnection amfConnection = new AMFConnection();

// Used for debugging...
//Proxy proxy = new Proxy( Proxy.Type.HTTP, new InetSocketAddress( "localhost", 8888 ) );
//amfConnection.setProxy( proxy );

// Connect to the remote url.
url = "http://www.myamfsite.com/messagebroker/amf";
try
{
amfConnection.connect(url);
}
catch( ClientStatusException cse )
{
session.logError( cse );
return;
}

// Set a few headers we'll want throughout the session.
amfConnection.addHttpRequestHeader( "Content-type", "application/x-amf" );
amfConnection.addHttpRequestHeader( "Referer", "http://www.myamfsite.com/media/MyMovie.swf" );

Here we’re setting up an AMF connection to a server whose AMF end point is found at http://www.myamfsite.com/messagebroker/amf.  The commented-out proxy code allows us to send it all through Charles; that way we can compare the requests our code produces with those we record when browsing the web site via our web browser.  Kind of an apples-to-apples comparison that helps to root out bugs.  If your code doesn’t seem to have the desired effect, compare what’s happening via Charles with the requests from your browser.  Ideally they should match as closely as possible.  I also found that I had to add the two request headers that you’ll find at the end.  The referer may or may not be necessary, but it’s likely that the content-type header is, since the Flash server would normally be expecting requests from a Flash movie, which would probably include that header by default.

Once you’ve done the initialization you can start adding AMF requests to get the data you’re after.  Again, you’ll want to do this by recording the requests from your browser in Charles, then translate those into code.  Here’s a screen-shot of a recorded AMF request from Charles:

And here’s how I translated the request into code:

CommandMessage message1 = new CommandMessage( CommandMessage.CLIENT_PING_OPERATION );
Object[] params1 = new Object[]
{
message1
};
HashMap headers1 = new HashMap();
message1.setHeader( "DSId", "nil" );
message1.setMessageId( UUIDUtils.createUUID() );
Object result1 = amfConnection.call( "null", params1 );
session.log( "Result 1: " + result1 );

Based on the request recorded by Charles, it’s obvious that this should be a CommandMessage.  The PING part of it was a bit trickier.  This is the “operation” portion of the request, which you’ll notice is recorded by Charles only as “5”.  This is where I had to bit of sleuthing through the Java AMF Client source code (which is fortunately open source and freely downloadable).  If you’ve downloaded that source code you’ll find the CommandMessage class here in the bundle: modules/core/src/flex/messaging/messages/CommandMessage.java.  Notice also in the request how I set the header “DSId” to be “nil”, which is also evident in what Charles recorded.  Again, we’re trying to get our code to match as closely as possible what was recorded by our web browser.  I gave the request a unique ID, then asked the connection to make the call.

The next request I needed was a bit different, but not too difficult to recreate from what Charles recorded:

I’ve blurred out the username I used.  Here’s the corresponding code:

// Authenticate the current user.
RemotingMessage message2 = new RemotingMessage();
message2.setOperation( "getUserByUserName" );
Object[] params2 = new Object[]
{
message2
};
String[] body2 = new String[]
{
"myUserName"
};
message2.setBody( body2 );
message2.setDestination( "XYZ" );
message2.setMessageId( UUIDUtils.createUUID() );
Object result2 = amfConnection.call( "null", params2 );
session.log( "Result 2: " + result2 );

Again, you can hopefully see how the pieces in the code correlate to what Charles recorded.

From this point it was simply a matter of adding requests as needed, along with a fair amount of trial and error to ensure that I was matching as closely as possible the original AMF requests.  The only item that tripped me up for a while that’s probably worth mentioning was when Charles recorded the body portion of the request as containing simply an “Object”.  When I did the same in code the server didn’t like it, and it took me a bit before I realized what it actually wanted was an “ASObject”.  So the code I used to create the body looks like this:

Object[] body3 = new Object[]
{
new ASObject()
};

A few last items that might be helpful:

  • The Java AMF Client download contains quite a few dependency files.  You’ll have to figure out exactly which ones of those you truly need.  In my case, in using this within screen-scraper, I ended up only needing two of the jars from the bundle: flex-messaging-common.jar and flex-messaging-core.jar.
  • As it stands the Java AMF Client can’t handle HTTPS, nor can it handle HTTPS sites that utilize an invalid secure certificate.  I ended up modifying the source for the AMFConnection class in order to add this functionality (in the bundle that class is found here: modules/core/src/flex/messaging/io/amf/client/AMFConnection.java).  You can download a zip file here that contains that modified source file as well as a compiled version of the flex-messaging-core.jar files, which contains that modified class.  If you end up modifying that class further in the bundle you can compile it with a simple “ant core” from the command line.  You need not compile the whole thing.

Version 5.5.26a of screen-scraper Released

Posted in Updates on 11.08.11 by Todd Wilson

A few fixes in this release:

  • Fixed a bug that was causing the user-agent header to be duplicated.
  • Fixed a bug where a deleted recent script still shows in the script drop-down list.
  • Fixed a bug related to multi-exports.

Version 5.5.25a of screen-scraper Released

Posted in Updates on 10.25.11 by Todd Wilson

Just a few changes:

  • Deprecated caching and filtering data sets (can be re-enabled with EnableCachingAndFilteringDataSets property).
  • Now automatically swapping extractor pattern tokens for embedded variables in certain fields in the workbench (e.g., in the URL field [email protected]@~ is changed to ~#FOO#~).
  • Added a “Find” button to the “Last Request” tab.

Version 5.5.23a of screen-scraper Released

Posted in Updates on 10.14.11 by Todd Wilson

Get ready, kiddies, this is a big one!  Found myself with some time on my hands, so I got some things done that have needed doing for a while.  Plus I added in a few little goodies that have been rolling around in my head.  Enjoy!

  • Now outputting message as a warning when extractor pattern times out.
  • Script pane no longer scrolls to the top when finding text fails.
  • The last error message will now always be retained in the Web UI.
  • Now notifying the user if a scrapeable file is generated from an HTTP transaction that contains a multi-part request, but no file parameters.
  • Changed icon to something friendlier on database backup pop-up.
  • Added session.setUserAgent.
  • Fixed an issue related to resolving relative URL’s from extracted data.
  • Fixed an issue related to reordering columns in the workbench.
  • Fixed an issue related to truncated server responses.
  • Fixed the PHP driver to allow carriage returns and line feeds to be passed in the setVariable method.
  • Now initializing the last response view to the top of the page.
  • Now displaying recently accessed scripts first in the script instances drop-down list.
  • Enlarged the scraping session notes field a bit.
  • Added back and forward buttons to the workbench.

« Newer EntriesPrevious Entries »