4 replies
  • |
Hey everyone just a quick question. Should I let unknown bots crawl my site? There are like 5 different bots with names like unknown robot or bots. What robots besides the obvious ones like google, yahoo and the others should I put in my robots.txt file? Any help would be appreciated.
#bots #unknown
  • Profile picture of the author Bruce Hearder
    I would strongly suggest again blocking these unknown bots are they are most likley part of the major Search Engines(SEs) checking your site.

    There is an increasing number of websites that are now using clocking to artifically increase their search engine rankings. The search engines (especially BigG) now have implemented a range of other bots that come from different IP addresses, and don't identify themselves as coming from Google at all.

    IP Cloacking works like this :- a bot visits a website, the website determines from it IP address that its a bot, and so it gives it a bunch of keyword rich text to spider and index.
    A human visits the site, the site determined that the visitor is not from a search engine, and now redirect the human visitor to another website (usually an affiliate page).

    So the SEs are now trying to find these pages, by sending in bots that look and behave as humans, and others that have no distinguishing details at all. They want to see if the content they see is substantially different from their previous visit. If so, then the site may come up for a human review.

    So, my recommendation is don't block these bots, is you have nothing to hide..

    Hope this helps

    {{ DiscussionBoard.errors[452595].message }}
  • Profile picture of the author DArmbrister
    Thanks for the help Bruce
    {{ DiscussionBoard.errors[453868].message }}
  • Profile picture of the author awesometbn
    What Bruce mentioned is valid, but I just wanted to offer another point of view. In the beginning I didn't care who came by my websites, and I was happy to have the visitors. True a lot of the automated robots (or bots) were related to the search engines, but in the last few years and months I started to see an increasing number of unknown sources. These weren't random, they were hitting the server relentlessly. So I took some advice from a web development company who was fed up with these suspicious connections, and decided to implement a long list for robots.txt, added some rules to .htaccess, and started monitoring everything with a web application firewall called mod_security. Why? Because of the following benefits, which I'm sure you've seen on other websites like botsense.com,
    • Reduced bandwidth costs
    • Reduced server load from illegitimate traffic
    • Stop email scrapers
    • Stop image scrapers
    • Stop copy scrapers
    • Stop snoopers!
    So it is entirely your call. I decided to stop allowing connections I did not understand. If we have nothing to hide, then why do we allow bots to hide themselves and their potentially malicious intentions? I make exceptions if I can fully understand who the source is, why they need the connection to me, and exactly what they are doing to my server.

    Sorry to sound negative, but with some of the security issues I've dealt with, it becomes an advantage to take a defensive position to protect myself and my client's assets.
    {{ DiscussionBoard.errors[492689].message }}
    • Profile picture of the author netbie
      Just take a look at some recent trends:

      Google 50%
      Yahoo 23%
      MSN 10%
      AOL 6%
      ASK 2%
      Others 9%

      So as you can see the another 9% do not make a big difference, with exeption of Alexa. Then is a very good idea to allow the "well know" spiders and block the rest.
      {{ DiscussionBoard.errors[492861].message }}

Trending Topics