Helping people with computers... one answer at a time.
Spiders that scan web sites are an important part of today's internet. Sometimes, though, spiders can cause problems for web site owners.
The Access database in my ASP-based web site is overloaded because spiders are crawling my site all day. Do you think it is a good idea to create another source specially for the spiders and only send "real people" to my live database-based site? Or do you think this will cause problems? Do you have any suggestions?
I can imagine some readers looking at this and going "spiders? There are spiders on the internet?" Indeed there are. And they are something that many website owners need to be aware of and deal with appropriately.
You see spiders are generally a good thing.
But displaying different content for them? Well, that's a bad thing. A really bad thing.
Spiders "walk the web". They're programs that automatically visit web sites, look at all the links on the web site, and then go visit all the pages and other websites that those links point to. Repeat that process indefinitely, and with almost every page on the web linking to some other page on the web, a spider that just follows links should be able to visit or access almost everything that's on the internet. At least in theory.
Spiders are a good thing because the big search engines like Google, Yahoo and others will use a spider (actually many spiders) to examine web pages for inclusion in their massive search indexes. Their spiders will also come back and visit web pages "every so often", so as to keep their index up to date with any changes you've made since the last time the spider visited.
There are two "problems" with spiders:
There are a lot of them. Probably thousands of spiders all attempting to visit every website, and often repeatedly. Every search engine, every custom search engine, a bunch of academic projects, and who knows what else may have its own spider attempting to visit your site. The load can add up.
Sometimes they misbehave. Since it's a computer program, a spider could ask for pages faster than your web server could deliver them, but a "well behaved" spider won't. Sometimes a reputable spider like Yahoo's or Google's will get confused, and sometimes spiders aren't well behaved. In those cases, a spider can bring a site to its knees.
The problems with the load caused by spiders is exacerbated, of course, if your website is designed poorly, on a low performing server, or has insufficient bandwidth.
In the original question you indicated that you're using Microsoft Access as your database. I dearly love Access for many things, but being the database behind a web site isn't one of them. It's not something Access was designed for, and would be one of the first places I'd look for performance related issues under moderate to heavy load. More appropriate technologies include Microsoft SQL Server, MySQL or others.
I do want to be clear that presenting one set of content to the spiders and another to "real" users is a very, very bad idea if you want to rank on the search engines. For one thing, you run the risk of providing the wrong content when people click on a search engine result, which is a bad experience for the users. Even worse, though, most search engines explicitly prohibit this behavior. If you present one set of content to real users and something different to the spiders, you run a very real risk of being banned from the search engine results entirely.
So once you've cleared up your site performance issues, what are your options?
The first is something called "robots.txt". This is a text file that you place in the root of your website that instructs the spiders as to what they may, and may not, do. Using robots.txt you can tell specific spiders what parts of your site they are allowed to scan. That means you can also tell a specific spider not to scan your site at all. If you don't care about search engine rankings, you can even tell all spiders not to scan your site.
The downside to robots.txt is that it relies on spiders being "well behaved". Spiders have to choose to follow the instructions that you place in robots.txt, and most do. Certainly legitimate spiders do. But what about the rest? What about the ones that ignore what you've said in robots.txt completely?
Your only real recourse, that I'm aware of, is to block them at the IP level. That means first identifying the offending spider by examining your server access logs. Then, determine the IP address or IP address range that the spider may access your site from. Lastly, use some technique on your web server to block that IP address or address range from accessing your site. The exact technique varies, but on Apache web servers, for example, it's often as easy as a simple entry in your .htaccess configuration file.
Finally, I want to mention that spiders aren't the only way your site can get over loaded. Of course, you could just be very popular, and I hope that's a problem you'll have to face some day soon :-). However spammers have also entered the picture. Spammers have started to use tools to automatically fill in any form they might find on your site, in the hopes that what they post will somehow get published. Spammers are also on the lookout for vulnerabilities in cgi scripts that they can then hijack to use as an email spam-sending relay. If you find your server is overloaded, be sure to check out exactly what it causing the problem so you can take the right action.
Comments on this entry are closed.
If you have a question, start by using the search box up at the top of the page - there's a very good chance that your question has already been answered on Ask Leo!.
If you don't find your answer, head out to http://askleo.com/ask to ask your question.