Cutting the crap with robots.txt

From miscellus

Cutting the crap with robots.txt

Sanko Seisakusyo (三幸製作所) – Tin Wind Up – Tiny Zoomer Robots – Front.jpg

You are probably here because your server is being hammered by irrelevant robots scanning your the web pages on your server. These pests can dramatically reduce your servers performance and increase the load average figures. The effect of this is to introduce delays in serving pages to your customers, the people who you want to see visiting your pages. Often this results in losing traffic and also AdSense revenue.

From a bit of hunting around I've found a pretty good set of rules that should theoretically block these pests. I can't say they will all obey these rules, but at least you have the name of the user agent you need to block.

I hope you find it useful!

This short article assumes you understand how to use robots.txt - its function is to provide you with a broad set of rules to block these nuisances

# Adbeat  ads
User-agent: adbeat_bot
Disallow: /

#AgentLinkSpammer
User-agent: AgentLinkSpammer
Disallow: /

# AhrefsBot  ads
User-agent: AhrefsBot 
Disallow: /

User-agent: AhrefsBot/4.0
Disallow: /

#aiHitBot  Ukraine or Russia
User-agent: aiHitBot
Disallow: /
User-agent: aiHitBot/1.0
Disallow: /
User-agent: aiHitBot/1.1
Disallow: /

#Acoon Germany
User-agent: Acoon
Disallow: /

#Arachmo Japan
User-agent: Arachmo
Disallow: /

#Baiduspider China and Japan
User-agent: Baiduspider
Disallow: /

User-agent: Baiduspider+
Disallow: /

User-agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)
Disallow: /

User-agent: Baiduspider/2.0;+http://www.baidu.com/search/spider.html
Disallow: /

User-agent: Baiduspider/2.0
Disallow: /

User-agent: +Baiduspider
Disallow: /

User-agent: +Baiduspider/2.0
Disallow: /

User-agent: +Baiduspider/2.0;++http://www.baidu.com/search/spider.html
Disallow: /

User-agent: Mozilla/5.0(compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Disallow: /

#careerbot  Germany
User-agent: careerbot
Disallow: /

#COMODOSpider/Nutch-1.2 United Kingdom
User-agent: COMODOSpider/Nutch-1.2
Disallow: /

#EasouSpider - China
User-agent: EasouSpider 
Disallow: /

#Exabot/3.0 - France proxy scraper
User-agent: Exabot/3.0
Disallow: /

#Exalead proxy scraper  France 
User-agent: Exalead
Disallow: /

User-agent: ExaLead Crawler
Disallow: /

#Ezooms and dotbot
User-agent: ezooms
Disallow: /

User-agent: Ezooms/1.0
Disallow: /

User-agent: DotBot              
Disallow: /

User-agent: Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot[at]gmail[dot]com)
Disallow: /

#findlinks/2.6 Germany  http://wortschatz.uni-leipzig.de/findlinks
User-agent: findlinks/2.6
Disallow: /

#Java/1.6.0_04
User-agent: Java/1.6.0_04
Disallow: /

#JikeSpider China
User-agent: JikeSpider
Disallow: /

#KaloogaBot Netherlands contextual advertising
User-agent: KaloogaBot
Disallow: /

#Mail.RU_Bot/2.0   Russia
User-agent: Mail.RU_Bot/2.0
Disallow: /
#Mail.RU   Russia
User-agent: Mail.RU
Disallow: /
#Mail.Ru   Russia
User-agent: Mail.Ru
Disallow: /
User-agent: Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots
Disallow: /

#MJ12bot United Kingdom
User-Agent: MJ12bot
Disallow: /

#MJ12bot/v1.4.3  United Kingdon
User-Agent: MJ12bot/v1.4.3
Disallow: /

User-agent: moget
Disallow: /

#Ichiro  Japan
User-agent: Ichiro
Disallow: /
#Ichiro 3.0  Japan
User-agent: Ichiro 3.0
Disallow: /

User-agent: NaverBot 
Disallow: /

User-agent: Yeti
Disallow: /

#NetcraftSurveyAgent/1.0
User-agent: NetcraftSurveyAgent/1.0
Disallow: /

#OpenWebIndex/Nutch-1.6   Germany
User-agent: OpenWebIndex/Nutch-1.6
Disallow: /
User-agent: OpenWebIndex
Disallow: /

#panoptaStudyBot  checks.panopta.com monitor
User-agent: panoptaStudyBot
Disallow: /

#panoptaStudyBot  checks.panopta.com monitor
User-agent: checks.panopta.com
Disallow: /

#picsearch Sweden  searches for pictures
User-agent: psbot
Disallow: /

#plukkie Dutch (botje.nl)/Belgium (botje.be)/France (botje.fr)/United Kingdom (botje.co.uk) search engine
User-agent: plukkie
Disallow: / 

#SeznamBot Czech Republic
User-agent: SeznamBot
Disallow: /
User-agent: SeznamBot/1.0
Disallow: /
User-agent: SeznamBot/1.1
Disallow: /
#SeznamBot/3.0
User-agent: SeznamBot/3.0
Disallow: /

#SistrixCrawler Germany DE
User-agent: SistrixCrawler
Disallow: /

User-agent: Sistrix
Disallow: /

User-agent: SISTRIX Crawler
Disallow: /

User-agent: SISTRIX
Disallow: /

# Sogou
User-agent: sogou spider
Disallow: /

User-agent: Sogou web spider
Disallow: /

# Sosospider - China http://help.soso.com/webspider.htm
User-agent: Sosospider+
Disallow: /
# Sosospider - China
User-agent: Sosospider
Disallow: /
#Sosospider/2.0 - China  may not obey robots.txt
User-agent: Sosospider/2.0
Disallow: /

#360Spider  China
User-agent: 360Spider
Disallow: /

#SurveyBot
User-agent: SurveyBot
Disallow: /

#Wada.vn Vietnamese Search/2.1

User-agent: Wada.vn
Disallow: /
User-agent: Wada.vn Vietnamese Search
Disallow: /
User-agent: Wada.vn Vietnamese Search/2.1
Disallow: /

#Yandex
User-agent: Yandex
Disallow: /

User-agent: Yandex/1.01.001
Disallow: /

User-agent: YandexBot/3.0-MirrorDetector
Disallow: /

User-agent: YandexImages/3.0
Disallow: /

User-agent: YandexSomething/1.
Disallow: /

User-agent: Yandex.com
Disallow: /

User-agent: YandexBot/3.0
Disallow: /

#YisouSpider  China
User-agent: YisouSpider
Disallow: /

#YoudaoBot/1.0  China
User-agent: YoudaoBot/1.0
Disallow: /
#YoudaoBot China
User-agent: YoudaoBot/1.0
Disallow: /

#Zao   - Japan
User-agent: Zao
Disallow: /