| 
 
 
 | 
| Keep Server Online 
 If you find the Apache Lounge, the downloads and overall help useful, please express your satisfaction with a donation.
 
 or
 
 
   
 A donation makes a contribution towards the costs, the time and effort that's going in this site and building.
 
 Thank You! Steffen
 
 Your donations will help to keep this site alive and well, and continuing building binaries. Apache Lounge is not sponsored.
 |  | 
 | 
| | 
| | 
|  Topic: htaccess - blocking Agents containing X, but allowing XY |  |  
| Author |  |  
| syncmaster913n 
 
 
 Joined: 23 Aug 2015
 Posts: 2
 Location: Poland, Warsaw
 
 | 
|  Posted: Sun 23 Aug '15 11:50    Post subject: htaccess - blocking Agents containing X, but allowing XY |   |  
| 
 |  
| Hello everyone, 
 First of all, my websites are on a very high quality managed VPS hosting account with one of the best providers available, and while I don't know the exact Apache version or operating system version on my "box," I do know that they keep everthing very much up to date. I'm guessing that this information is not necessary for my particular problem anyway, but if I'm wrong, I will open a support ticket with my host and ask them about the details and then update this post. Hope that's ok.
 
 Goal:
 
 I am trying to block certain bots/crawlers from having any sort of access to my website, while allowing other ones. Many of the bots in question do not obey robots.txt, but they almost always have the word "bot" and "spider" in them. Therefore, what I'm trying to do is:
 
 Use htaccess to block all user-agents containig anywhere (without regard to capitalization) in the user-agent either "bot" or "spider" while simultaneously granting access to just a handful of specific bots that also include the word "bot" (Googlebot and bingbot, to be exact).
 
 My solution:
 
 Here is what I have done:
 
 
  	  | Code: |  	  | BrowserMatchNoCase bot bad_bot
 BrowserMatchNoCase spider bad_bot
 BrowserMatchNoCase google good_bot
 BrowserMatchNoCase bing good_bot
 Order Deny,Allow
 Deny from env=bad_bot
 Allow from env=good_bot
 
 | 
 
 I have tested this on a dummy site setup only for testing purposes, and it seems to work: using a Chrome Browser extension that allows me to include any words I want in my User-Agent, I changed my Agent name to various things and then tried to access my website:
 
 - bot: could not access site, raw logs return 403
 - BoT: could not access site, raw logs return 403
 - spider: could not access site, raw logs return 403
 - SPIDer: could not access site, raw logs return 403
 - GooglebOt: full website access, raw logs return 200
 - BINGbot: full website access, raw logs return 200
 
 It would appear from the above that my method is working, but I'm still afraid of actually using the above htaccess code on my live site.
 
 Question:
 
 So I guess my question is: should the htaccess rules described earlier by me actually work as I want them to, do they make sense? Is there any reason why my test with the user agent "spoofing" might have only provided a false-positive result? Basically I just want to see what those of you who are intimately familiar with the way htaccess is structured think about the above, in hopes of anticipating any possible future problems that might arise due to these rules.
 
 
 P.S. I have asked my host the exact same question above and they seem to think that the code above is indeed appropriate for the job. I'd still like a "second opinion" though, if possible.
 
 Thank you for your time,
 Mark
 |  |  
| Back to top |  |  
| Steffen Moderator
 
 
 Joined: 15 Oct 2005
 Posts: 3130
 Location: Hilversum, NL, EU
 
 | 
|  Posted: Sun 23 Aug '15 12:26    Post subject: |   |  
| 
 |  
| Looks like you use Apache 2.2. 
 Not necessary to have good and bad bots, only bad bots.
 
 I had in the 2.2 old days:
  	  | Code: |  	  | SetEnvIf User-Agent archiver noc SetEnvIf User-Agent Fetch noc
 SetEnvIf User-Agent DTS noc
 SetEnvIf User-Agent slurp noc
 SetEnvIf User-Agent Baid noc
 SetEnvIf User-Agent Indy noc
 SetEnvIf User-Agent NPBot noc
 SetEnvIf User-Agent turn noc
 SetEnvIf User-Agent grub noc
 SetEnvIf User-Agent ZyBorg noc
 SetEnvIf User-Agent Scheduled noc
 SetEnvIf User-Agent QuepasaCreep noc
 ...
 ....
 
 
 deny from env=noc
 | 
 
 
 For 2.4 have a look at https://www.apachelounge.com/viewtopic.php?t=5438
 |  |  
| Back to top |  |  
| syncmaster913n 
 
 
 Joined: 23 Aug 2015
 Posts: 2
 Location: Poland, Warsaw
 
 | 
|  Posted: Sun 23 Aug '15 12:43    Post subject: |   |  
| 
 |  
| Hi Steffan, 
 Thank you for the reply.
 
 The problem is that I don't know the exact names of the bots I want to block, because they change the names every once in a while, and also new crawlers appear from time to time so I would have to keep monitoring them an adding their names to the htaccess file constantly, which would be too much work. The only thing that those bots have in common is that their user-agent includes either the word "bot" or "spider" so I need to block based on those two words, while making an exception for Googlebot and bingbot to not be blocked.
 
 The rules you provided in your message seems to be appropriate if I have a list of all the bots I want to block, but I don't have such a list. This is why I have Good_bots and Bad_bots.
 
 I hope that I am managing to make my message understandable? Considering the above, do you think my htaccess rules are appropriate?
 |  |  
| Back to top |  |  
| covener 
 
 
 Joined: 23 Nov 2008
 Posts: 60
 
 
 | 
|  Posted: Sun 23 Aug '15 19:26    Post subject: |   |  
| 
 |  
|  	  | syncmaster913n wrote: |  	  | Hi Steffan, The rules you provided in your message seems to be appropriate if I have a list of all the bots I want to block, but I don't have such a list. This is why I have Good_bots and Bad_bots.
 
 | 
 
 His point was that there's no point to list "good bots" because 'Order deny,allow' already allows everyone by default.
 
 edit: After rereading your lengthy first post -- your htaccess doesn't do what you want. You need to do one of:
 - use !badbot
 - write more robust regexes defining badbot
 |  |  
| Back to top |  |  
| James Blond Moderator
 
  
 Joined: 19 Jan 2006
 Posts: 7442
 Location: EU, Germany, Next to Hamburg
 
 |  |  
| Back to top |  |  
 
 | 
 |  | 
 |  |