View Single Post
Old 07-27-2001, 06:30 PM   #4
mcsebraindumps
Registered User
 
Join Date: Feb 2001
Location: St Petersburg, FL
Posts: 75
Default

Web ********, Offline Explorer, WebZIP, and some of the others you have there are actually spiders. They are more descructive in my opinion than anyone blocking ads from your site because they spider your entire site which inflate your page views and gets you banned from programs. I can't use search boxes on my site for this reason, been booted from two places. I would definitely put Teleport Pro in your list as it's the most common spider. I've gone a step further than you and had an entire collection of scripts written to handle this problem. I don't use htaccess because my list would be too long Fortunately I have a dedicated server and have root so I use ipchains which completely bans an IP. I have one script to parse the log file looking for those agents and ban them on sight. I then have another script to ban any IP accessing more than 40 files (not including .gif and .jpg) in 60 seconds. I then have a hidden tag in my site pointing to /cgi-bin/mustdie.pl which bans any IP hitting that I think I catch about 99% of unwanted spiders. My scripts also allow me to tell it certain domains not to ban such as search engines. Last but not least, my script unbans each IP after 24 hours. After about a month of doing this I think I finally have the problem under control. Anyway, here's my list of agents to ban:

Teleport|Offline Explorer
DISCO Pump
WebZIP
HTTrack
MSIECrawler
FlashGet
libwww
Web********
WebCopier
ia_archiver
WebCapture
Downloader
GetRight
Fetch
NetAnts
SuperBot
Wget

If you have large files on your site I suggest allowing GetRight and NetAnts since they can also be used solely as download agents. The problem is, they can also be used to grab your entire site.
mcsebraindumps is offline