robots.txt - is this blocking googlebot?

by GACS
19 replies
  • SEO
  • |
I'm not so good with web code so I was hoping someone could help me figure out whether my website's robots.txt is blocking googlebot or other search engines.

This is the robots.txt I see in Webmaster Tools:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Sitemap: greenaircleaningsystems com/sitemap.xml.gz

I'm receiving a message stating:
"
greenaircleaningsystems com/: Googlebot can't access your site
Over the last 24 hours, Googlebot encountered 5 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 83.3%.
"

Thanks I appreciate any advice.
#blocking #googlebot #robotstxt
  • Profile picture of the author Mkj
    Did you alter all of the robots.txt file? I can see you added the sitemap location. Google doesn't pay any attention to that though as far as I know.

    You are obviously using Wordpress. I am not sure what version you have but the version I have on one of my sites - latest version - doesn't use a robots.txt file at all so I think you could safely remove everything from within yours. Leave it something like this:

    User-agent: *
    Allow: /

    Add your sitemap as instructed in the usual way within Webmaster tools.

    If all you added was the part about the sitemap then just remove that part so that the robots.txt file is back as it was originally.
    {{ DiscussionBoard.errors[7118242].message }}
  • Profile picture of the author GACS
    Thanks Mkj, I changed the robots.txt to the one you suggested and I think it has helped. When I fetch as google is says "success" but I'm still receiving the error message. Hopefully google will be able to access the file and index my website. Much appreciated!
    {{ DiscussionBoard.errors[7118430].message }}
    • Profile picture of the author Mkj
      Originally Posted by GACS View Post

      Thanks Mkj, I changed the robots.txt to the one you suggested and I think it has helped. When I fetch as google is says "success" but I'm still receiving the error message. Hopefully google will be able to access the file and index my website. Much appreciated!
      You haven't cos I just checked. Your robots.txt file is as follows:

      User-agent: *
      Disallow: /wp-admin/
      Disallow: /wp-includes/
      Sitemap: http://www.greenaircleaningsystems.com/sitemap.xml.gz

      How are you attempting to alter the file?

      Check the read/write permissions of the file.
      {{ DiscussionBoard.errors[7118631].message }}
  • Profile picture of the author UMS
    Your robots.txt file is fine, although if you are using WordPress, you don't need to manually generate one.

    It's more than likely you have rules in .htaccess which is blocking access.
    {{ DiscussionBoard.errors[7118627].message }}
  • Profile picture of the author UMS
    Please note that your robots.txt entries are completely standard for WordPress sites. You don't need to change anything.
    {{ DiscussionBoard.errors[7118648].message }}
    • Profile picture of the author Mkj
      Originally Posted by UMS View Post

      Please note that your robots.txt entries are completely standard for WordPress sites. You don't need to change anything.
      Exactly. As I said he should put it back as it was originally.

      He might have some typo error with the robots.txt file as it is.

      The error messages he is getting are all to do with the robots.txt file and nothing else. It is highly unlikely he has messed with the htaccess file.
      {{ DiscussionBoard.errors[7118677].message }}
  • Profile picture of the author jamaks
    Hi, the robots.txt file is a strange one. If you check with robotstxt.org there is no such thing as an allow statement and yet if you look at the google one they use it repeatedly. Personally I would err on the side of caution and use
    Code:
    User-agent: *
    Disallow:
    which is giving unrestricted access to all robots to all of your site. Once you are happy that works correctly you could then add in the additional lines to exclude your directories
    Code:
    User-agent: *
    Disallow: 
    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Hope this helps. Jim
    Signature

    jamaks

    {{ DiscussionBoard.errors[7118702].message }}
    • Profile picture of the author keokeo123
      Banned
      Originally Posted by jamaks View Post

      Hi, the robots.txt file is a strange one. If you check with robotstxt.org there is no such thing as an allow statement and yet if you look at the google one they use it repeatedly. Personally I would err on the side of caution and use
      Code:
      User-agent: *
      Disallow:
      which is giving unrestricted access to all robots to all of your site. Once you are happy that works correctly you could then add in the additional lines to exclude your directories
      Code:
      User-agent: *
      Disallow: 
      Disallow: /wp-admin/
      Disallow: /wp-includes/
      Hope this helps. Jim

      This is sometime very dangerious, because the hacker know your admin page and some important URL of your site, they will attrack easy your site.
      {{ DiscussionBoard.errors[7125707].message }}
  • Profile picture of the author GACS
    Thanks for all of the feedback. I'm still stuck on what to do. From the responses here, it seems that my robots.txt file is standard for a wordpress site.

    My site also dropped off google search engine pages. Do you think it dropped off because googlebot can't search my site anymore? Or is because of Penguin or Panda update?

    Also another note - webmaster tools says there are 36 crawl errors 404's not found.

    Thanks for any advice.
    {{ DiscussionBoard.errors[7122923].message }}
  • Profile picture of the author wlasikiewicz
    Have put your permissions on your .htaccess file to 755?
    {{ DiscussionBoard.errors[7122931].message }}
    • Profile picture of the author paulgl
      It's not a robots.txt issue. That's why replies are
      saying the robots.txt is fine.

      I've answered this question many times. It's a server
      or host issue. The very first file google looks for is
      a robots.txt file. If after that, they cannot find anymore
      files online, they stop. Since they tried to find the robots.txt
      file, then stops, that's why the message about the robots.txt
      file.

      It's normally google's way of saying we cannot crawl
      your site because it's not being found. Nothing to do about
      the actual robots.txt file.

      Normal people don't go tweaking htaccess or robots.txt
      by accident. Virtually impossible. Unless your site got hacked,
      or some arcane plugin. Again, not very likely.

      You don't need a robots.txt file. Google looks for it first, and
      if not found, returns a crawl error. But then it goes crawling
      your site normally. Most webmasters hate crawl errors, but
      this one is moot. I keep stressing that because if no other
      files are found, it stops and gives the first crawl error.

      Your server or host might have hiccuped while google was
      crawling it. Or, the site is offline.

      Paul
      Signature

      If you were disappointed in your results today, lower your standards tomorrow.

      {{ DiscussionBoard.errors[7122966].message }}
  • Profile picture of the author GACS
    Thanks for the advice man. I haven't changed a single thing in terms of robots.txt or .htaccess and wouldn't even know how to.

    I'll contact the host server and hopefully this is something he can fix. My site hasn't been crawled in a long time and I just dropped off google search engine results.
    {{ DiscussionBoard.errors[7123011].message }}
  • Profile picture of the author jamaks
    Hi, do not know if this is the cause of your problem but worth researching and/or notifying your hosting company.
    Missing nameservers reported by parent FAIL: The following nameservers are listed at your nameservers as nameservers for your domain, but are not listed at the parent nameservers (see RFC2181 5.4.1). You need to make sure that these nameservers are working. If they are not working ok, you may have problems!
    This is from a DNS report on the first named website in your signature. I do not claim to understand the relevence but this appears to be concerned with ranking data and might well be worth checking out. Jim
    Signature

    jamaks

    {{ DiscussionBoard.errors[7123583].message }}
  • Profile picture of the author bhushan@rancor
    You should alter your files as i think and do.I think you have missed that.
    Signature
    Interactive Bees Pvt Ltd best known for Quality Web Development Solutions and Online Marketing Services.
    {{ DiscussionBoard.errors[7126266].message }}
  • Profile picture of the author caslado1250
    You're wrong basic thing:
    Sitemap: greenaircleaningsystems com/sitemap.xml.gz
    Bot dont understand this line.
    {{ DiscussionBoard.errors[7126559].message }}
  • Profile picture of the author engagedotscrm
    A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site.
    {{ DiscussionBoard.errors[7127599].message }}
  • Profile picture of the author mangomedia1
    In my view point, no chance to blocked in googlebot, because , you didn't mention googlebot code in robots file,
    User-agent: * Disallow: Disallow: /wp-admin/ Disallow: /wp-includes/ here.
    {{ DiscussionBoard.errors[7127889].message }}
  • Profile picture of the author Igal Zeifman
    Hi

    A. This is not a robots.txt issue.

    B. In you case I the error rate may indicate a very high downtime rate or/and wrong security settings.
    (i.e. some providers will block all Chinese traffic including Goolgbot, which will use Chinese IPs...)

    C. For all who suggested not to use any Googlebot filters (htacsses/robots.txt or others...) I must say that I disagree.
    You should always be mindful of your robots.txt, as it can be used to prevent duplicated content issues, to help mask irrelevant or yet-to-be developed content and etc...

    D. For those who are concerned with leaving "clues" for hackers in robots.txt..
    Generally speaking this is a well places concern but in this case its an irrelevant one, as every hacker on the planet already knows the default URL for WP admins...
    If you really want to be secure you`ll need to use a custom/modified URL and to mask it locally with "Meta-Robots" tags. But even in this scenario, a decent hacker will find a loop-hole.

    Talk to your provider. If motivated enough, they should be able to help you zero-in on the source of this problem.

    To speed things up you can also use Pingdom to get your own dowtime stats. and Google WMT "Fetch" feature to get more accurate information.

    Hope this helps.
    {{ DiscussionBoard.errors[7133387].message }}

Trending Topics