If someone says “smart phone” or “GPS” to you, you probably have an instant understanding of what they are talking about because those things are familiar and ubiquitous. But if you hear “screen scraping,” “bot,” “crawler,” or “spider” most people are less certain. All these words describe similar, but slightly different, ways for programs to interact with the internet.
A spider, or web spider, is a program that is pointed at a site, and given a set of policies–or instructions–of what to seek on the site. The spider can automatically follow links to try to traverse the entire site.
This technology is akin to what Google, Bing, and all the search engines use. No one needs to instruct the spider on how to navigate a website. The spider is unleashed on a site, and uses its policies to flip through pages and index the contents.
Spidering is desirable in cases where you need some data but don’t have the capacity to analyse the website to build a scraper, such as cases where you have a huge number of sites you want to target, and that is exactly what spidering is meant for, but a spider like this is not without challenges. Sometimes isolating the data you seek is difficult because there is no standardized way to present it. How many ways have you seen people format a North American phone number? There are those who have dashes, or some with the area code offset in parenthesis, and some people like to separate the parts with dots. Then there are some with letters. It’s crazy, but it’s also a walk in the park next to the haunted forest of street addresses.
Many websites need you to fill in a form to request data, and this is often called the “deep web” because most spiders and search engines don’t reach it. Again, there is no standard way to fill in a form, so there isn’t a way to make a spider policy to deal with it.
Often it seems like people see spidering as an easier alternative to screen-scraping that has to be tailored to a target site, and sometimes it can be, however, it’s not simple. The policies to identify your data need to be robust, and will likely never get 100% of the data you seek. Even Google is constantly updating and tweaking their policies to identify information. If your requirements are that you need to query a large number of sites, and it’s not a big deal if you don’t find all the data on the first run, then this might be the solution you seek.