robots.txt - What should it look like for wordpress site?

4 replies
Hello warriors,

I am in the process of putting together a site that is using wordpress and I am wondering what a good robots.txt should look like. After some googling and combinig it with what I use on other sites of mine I have this so far:

User-agent: *
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/


User-agent: Googlebot

Disallow: /wp-content/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /feed/
Disallow: /archives/
Disallow: /index.php
Disallow: /*?
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: */feed/
Disallow: */trackback/
Disallow: /page/
Disallow: /tag/
Disallow: /category/

User-agent: Googlebot-Image
Disallow: /wp-includes/

User-agent: Mediapartners-Google*
Disallow:

User-agent: ia_archiver
Disallow: /

User-agent: duggmirror
Disallow: /

So, any one wants to add something or suggest I delete something?

Also, if I want to have a ebook that people can download, how should I go about it to somewhat protect it (it's free but I do want people to sign up to a list to get it) and keep google from indexing it?
So far I've made a folder in the main directory, put a Disallow: /foldername in robots.txt as well as put an empty index.html in the folder. Anything else I can/should do to keep search engines from indexing it?

Thank you for your feedback!
#robotstxt #site #wordpress
  • Profile picture of the author sober
    A comprehensive tutorial for wordpress is available at

    askapache.com
    yoast.com
    {{ DiscussionBoard.errors[1266273].message }}
  • Profile picture of the author TheRichJerksNet
    Looks like a great deal of over-kill to me ... You do not need none of that if you are trying to do security. That will not protect your wordpress from hackers.

    Also please note: the spiders do not always pay attention to the robots.txt file. I have seen google go right past it and index stuff anyways. Not saying this happens often but I have seen it done more then once.

    ia_archiver - is alexa.com incase you did not know, not that it matters as their stats are wrong anyways..

    James
    {{ DiscussionBoard.errors[1266304].message }}
  • Profile picture of the author KristiDaniels
    In a prior life, I wrote robot code.

    My recommendation as an Internet marketer is to never have a robots.txt file.

    If you ever wrote a robot you would understand why. It is a very ambiguous standard. It is difficult to fully parse and obey completely.

    If you want to write a robot quickly and make sure you don't break any of the robots.txt standards, you often write a routine that looks something like this:

    sub allowedbyrobotstxt(domain:string):boolean
    {
    returnval=false;
    if not fileexists(domain."/robots.txt")
    {
    returnval=true;
    }
    contents=getfile(domain);
    if (strlen(contents)==0)
    {
    returnval=true;
    }
    return returnval;
    }

    If you didn't follow that pseudocode, it basically first assumes that all robots are prohibited from the site.

    But then it checks to see if robots.txt isn't even on the site. If it is missing, then the robot knows that it is actually allowed on the site.

    It also checks to see if it exists, but is empty. It knows that is good too. It can roam the site without any rules.

    But nothing else is implemented. It doesn't bother to parse the robots.txt file at all because once you start doing any parsing, you are pretty much stuck with all of it.

    And it isn't pretty. There are some ambiguities that are still implemented differently from one robot to another and some people claim certain popular robots still aren't fully compliant with the robots.txt standard.

    It is a no win scenario. So some robots don't even do the above test. They ignore the robots.txt standard. Ironically those are the ones that you least want on your site. They don't care about breaking rules so they are probably not doing something that is good for your site.

    Then there are the ones that do the above minimal checking. Those robots you usually DO want on your site. They want to follow the rules, but the rules are too complex. So they follow the most extreme example of the rules possible. If there is no robots.txt or it is empty then everyone agrees that all robots are welcome.

    And then there are the famous robots that everyone knows about. They aren't very consistent, but they do try at least somewhat to follow the rules. They can follow simple robots.txt files so you won't have a major problem with them if you have a robots.txt file.

    But you will miss out on all of the robots you don't really know about that only check for the existence of robots.txt like the ones I wrote and some other ones I know about that were written by others with the same short cuts.

    That means less exposure and less traffic, less unsolicited listings in certain directories, less unsolicited links from other sites, less chance that a news site will cover something that you do and less chance for all of your pages to have a lot of versions in archive.org.

    That isn't compatible with Internet marketing. You want maximum exposure as an Internet marketer.

    So don't use a robots.txt file at all.

    Instead, if you have pages that you don't want to be indexed by robots because of security concerns, then use some real security. Make sure those pages aren't available to the public at all whether human or robot readers. Make sure you have to sign in to get to those pages.

    Lately there is a new concern I haven't really looked into much. There is a new robots.txt standard for pointing at your sitemap. Maybe that is a justification for having a small robots.txt, but I would still keep it minimal. And I know that all of my old robot code that is still being run by some organizations today won't index your site if you have a non-empty robots.txt.

    If there is another way to get your sitemap seen without using robots.txt, I would do that instead. Maybe just link to it from your home page!

    My vote is for no robots.txt for a wordpress blog or any other kind of site.

    I hope that helps.
    {{ DiscussionBoard.errors[1266323].message }}
    • Profile picture of the author Bruce Hearder
      Interesting comments about robots.txt

      In my experience, I have found that not having a robots.txt seriously slows down the initail spidering of a website.

      It seems that the SEs will hit a site, check for robots.txt.
      If not there (404 error), they go away and try again anywhere from 1hour to 1 day later.

      They will keep doing this until some arbitary time has been reached and then they will start spidering the site.

      If I put up a basic robots.txt file , as follows :

      User-agent: *
      Disallow:

      Then I find that the SEs will start spidering my site within minutes of hitting it the first time..


      Bruce
      {{ DiscussionBoard.errors[1267164].message }}

Trending Topics