Tidy Time

So lately we’ve been experimenting with different tidiers in the latest alpha versions of screen-scraper.  This is the little utility that will clean up malformed HTML, making extraction easier.  For some time we’ve used a library called JTidy to handle this, which has worked quite well, but does have a couple of problems.  First, at times it simply fails to tidy the HTML.  If you’ve been using screen-scraper for a while you’ve likely seen a message indicating this in the log.  This isn’t too big of a deal, but can be a bit of a hassle.  Second, in very rare instances we’ve actually found that it will omit portions of an HTML page which are especially malformed.  This is definitely a problem and can make debugging difficult.

In order to address the issues above we’ve been trying out a few other tidiers–NekoHTML and Jericho.  We’ve actually already found issues with NekoHTML, so Jericho looks to be the favorite as of right now.  Both will still require some experimentation, though, so please use them at your own risk for now.  Once we’ve put them both through the paces we’ll likely settle on one as the recommended default.  And not to worry about any scrapeable files that are already using JTidy–they’ll stay just as they are.  At some point, though, for any new scrapeable files, you might notice a different tidier as the default.

Leave a Comment