Microsoft ASP.NET sites have consistently proven to be some of the most difficult to scrape. This is due to their unconventional nature and cryptic information passed between your browser and the server. You’ll know you’re at an ASP.NET site when your URLs end in .aspx, your links look like this:
And your POSTs look like this*:
/wEPDwUJNDczODExNjY1D2QWAgIFD2QWAgIXDzwrAA0CAA8WBB4LXyFEYXRhQm91 bmRnHgtfIUl0ZW1Db3VudGZkDBQrAABkGAIFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJh Y2tLZXlfXxYBBQdidG5JbmZvBQtndkxpY2Vuc2luZw88KwAJAgYVAQ1saWNfc2VyaWFsX 2lkCGZk/kjEfRuqcTBAeylGENOP9dFkERc=
If you’re at all familiar with conventional HTTP transactions, prepare to forgo what you’ve come to expect. Once again, Microsoft manages to defy many standard practices that, in this case, has gone unnoticed by everyone but your tireless browser tasked with making sense of it all. But now, it is your job to pick apart what’s going on and to try to reconstruct the mixed up conversation your poor browser’s been having with the server. In this blog entry I’ll attempt to cover the more common (and not so common) characteristics of ASP.NET sites and offer techniques for how best you can play the role of your submissive browser to an unforgiving taskmaster.
If you’ve already been down this road before, please post your own stories of things you’ve encountered and how you went about slaying the dragon.
As you begin the process of scraping data from a website we recommend that you start by using screen-scraper’s proxy to record the HTTP transactions while you navigate the site. You’ll then need to identify which of the proxy transactions should be made in to scrapeable files, add extractor patterns to your scrapeable files to be used as session variables for other scrapeable files**, and tie the whole thing together with scripts to run recursively and in the proper sequence while it traverses the site scraping the data you need.
Here are some general rules and recommendations
- The first rule of screen-scraping: As closely as you can, imitate the requests to the server that your browser makes. Study the raw contents of a successful request from your proxy session while constructing your scrapeable files.
- Run pages in the correct order. ASP.NET sites are very picky about the order in which pages occur. The server tracks this by referencing the referer found in the request. To ensure you pass the correct referer:
- Run your scrapeable files in the same order as when you navigated the site during your proxy session (repeated for emphasis).
- All of your scrapeable files should have the check box checked under the Properties tab where it says, “This scrapeable file will be invoked manually from a script” and should be called using the scrapeFile method. This way you’re in direct control of when scrapeable files are run.
- Sometimes you’ll need to include a scrapeable file just to ensure you maintain the correct page order by passing the right referer. When calling scrapeable files for this purpose, basic users should use the scrapeFile method. professional and enterprise users can use a shortcut by implementing the setReferer method within a script. Then, call this method in place of an actual scrapeable file.
- Prior to calling a scrapeable file, you many need to manually reset certain values when your scraping session rolls back up on itself.***
- For example, say you’re iterating through a list of categories that return a list of products. For each category you also iterate through the list of products and a details page for each product. When you complete the first category iteration screen-scraper will recursively roll back up to the next category. And it’s here that you might need to manually set the values for the next category page since the values for the last details page would still be in memory.
- One helpful approach is to name the extractor patterns for recurring parameters like the VIEWSTATE with something that indicates which page it was extracted from. For example, the VIEWSTATE found on the details page may be named VIEWSTATE_DETAILS, while the VIEWSTATE from the search results would be called VIEWSTATE_SEARCH_RESULTS. Doing so will help you to use the correct session variables when passing the post parameters in the request.
- POST parameters should NOT be ignored. Most all ASP.NET transactions rely on very specific POST data in order to respond as you’d expect.
- Include every POST parameter whether or not it has a value.
- Generally, parameters with cryptic string values must have those values extracted from the referring page and passed as session variables in the request.
- If you need to programmatically add or alter a POST parameter make use of the addHTTPParameter method which allows you to set both the key and value; as well as, control the sequence.
- Oddities that can keep you up all night:
- Occasionally, two different POST parameters will exchange the same value. This has happened with EVENTTARGET & EVENTARGUMENT. When it does, the next bullet point may also apply.
- Watch for parameters that may be included and/or disincluded between pages where you would expect them to always be the same.
- For example, sometimes parameters will show up on, say, page one of a search results page but will not show up for page two. This can continue for additional results pages and may become even more complex. In order to handle a situation like this you may need to programmatically assign the wayward parameters manually using the addHTTPParameter method.
- It’s not just the values that can change. Watch for POST parameter names that may also dynamically change.
* If a page’s VIEWSTATE is too large, screen-scraper can hang when you click on the offending proxy transaction. Wait for a while and it should recover.
**As you’re converting proxy transactions into scrapeable files, a good approach is to replace the values of parameters that look like they’re generated dynamically with session variables containing values extracted from the referring page, test it and compare side-by-side the raw request from your proxy session to that of your test run. And, repeat until you’ve successfully given the server what it wants in order to give you back what you want.