Go Back   WarriorForum - Internet Marketing Forums > The Warrior Forum > Main Internet Marketing Discussion Forum
Register Blogs FAQ Social Groups CalendarHelp Desk

Reply
 
LinkBack Thread Tools
Old 10-10-2009, 07:15 PM   #1
HyperActive Warrior
 
Join Date: Jun 2008
Location: , , USA.
Posts: 301
Thanks: 8
Thanked 15 Times in 14 Posts
Default robots.txt - What should it look like for wordpress site?

Hello warriors,

I am in the process of putting together a site that is using wordpress and I am wondering what a good robots.txt should look like. After some googling and combinig it with what I use on other sites of mine I have this so far:

User-agent: *
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/


User-agent: Googlebot

Disallow: /wp-content/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /feed/
Disallow: /archives/
Disallow: /index.php
Disallow: /*?
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: */feed/
Disallow: */trackback/
Disallow: /page/
Disallow: /tag/
Disallow: /category/

User-agent: Googlebot-Image
Disallow: /wp-includes/

User-agent: Mediapartners-Google*
Disallow:

User-agent: ia_archiver
Disallow: /

User-agent: duggmirror
Disallow: /

So, any one wants to add something or suggest I delete something?

Also, if I want to have a ebook that people can download, how should I go about it to somewhat protect it (it's free but I do want people to sign up to a list to get it) and keep google from indexing it?
So far I've made a folder in the main directory, put a Disallow: /foldername in robots.txt as well as put an empty index.html in the folder. Anything else I can/should do to keep search engines from indexing it?

Thank you for your feedback!
J smith is offline   Reply With Quote
Old 10-10-2009, 11:08 PM   #2
Warrior Member
War Room Member
 
sober's Avatar
 
Join Date: Oct 2009
Location: Toronto
Posts: 28
Thanks: 5
Thanked 1 Time in 1 Post
Default Re: robots.txt - What should it look like for wordpress site?

A comprehensive tutorial for wordpress is available at

askapache.com
yoast.com
sober is offline   Reply With Quote
Old 10-10-2009, 11:25 PM   #3
TheRichJerksNet
Guest
 
Posts: n/a
Default Re: robots.txt - What should it look like for wordpress site?

Looks like a great deal of over-kill to me ... You do not need none of that if you are trying to do security. That will not protect your wordpress from hackers.

Also please note: the spiders do not always pay attention to the robots.txt file. I have seen google go right past it and index stuff anyways. Not saying this happens often but I have seen it done more then once.

ia_archiver - is alexa.com incase you did not know, not that it matters as their stats are wrong anyways..

James
  Reply With Quote
Old 10-10-2009, 11:35 PM   #4
InternetBusinessBox.com
 
Join Date: Sep 2009
Posts: 391
Thanks: 6
Thanked 102 Times in 62 Posts
Default Re: robots.txt - What should it look like for wordpress site?

In a prior life, I wrote robot code.

My recommendation as an Internet marketer is to never have a robots.txt file.

If you ever wrote a robot you would understand why. It is a very ambiguous standard. It is difficult to fully parse and obey completely.

If you want to write a robot quickly and make sure you don't break any of the robots.txt standards, you often write a routine that looks something like this:

sub allowedbyrobotstxt(domain:string):boolean
{
returnval=false;
if not fileexists(domain."/robots.txt")
{
returnval=true;
}
contents=getfile(domain);
if (strlen(contents)==0)
{
returnval=true;
}
return returnval;
}

If you didn't follow that pseudocode, it basically first assumes that all robots are prohibited from the site.

But then it checks to see if robots.txt isn't even on the site. If it is missing, then the robot knows that it is actually allowed on the site.

It also checks to see if it exists, but is empty. It knows that is good too. It can roam the site without any rules.

But nothing else is implemented. It doesn't bother to parse the robots.txt file at all because once you start doing any parsing, you are pretty much stuck with all of it.

And it isn't pretty. There are some ambiguities that are still implemented differently from one robot to another and some people claim certain popular robots still aren't fully compliant with the robots.txt standard.

It is a no win scenario. So some robots don't even do the above test. They ignore the robots.txt standard. Ironically those are the ones that you least want on your site. They don't care about breaking rules so they are probably not doing something that is good for your site.

Then there are the ones that do the above minimal checking. Those robots you usually DO want on your site. They want to follow the rules, but the rules are too complex. So they follow the most extreme example of the rules possible. If there is no robots.txt or it is empty then everyone agrees that all robots are welcome.

And then there are the famous robots that everyone knows about. They aren't very consistent, but they do try at least somewhat to follow the rules. They can follow simple robots.txt files so you won't have a major problem with them if you have a robots.txt file.

But you will miss out on all of the robots you don't really know about that only check for the existence of robots.txt like the ones I wrote and some other ones I know about that were written by others with the same short cuts.

That means less exposure and less traffic, less unsolicited listings in certain directories, less unsolicited links from other sites, less chance that a news site will cover something that you do and less chance for all of your pages to have a lot of versions in archive.org.

That isn't compatible with Internet marketing. You want maximum exposure as an Internet marketer.

So don't use a robots.txt file at all.

Instead, if you have pages that you don't want to be indexed by robots because of security concerns, then use some real security. Make sure those pages aren't available to the public at all whether human or robot readers. Make sure you have to sign in to get to those pages.

Lately there is a new concern I haven't really looked into much. There is a new robots.txt standard for pointing at your sitemap. Maybe that is a justification for having a small robots.txt, but I would still keep it minimal. And I know that all of my old robot code that is still being run by some organizations today won't index your site if you have a non-empty robots.txt.

If there is another way to get your sitemap seen without using robots.txt, I would do that instead. Maybe just link to it from your home page!

My vote is for no robots.txt for a wordpress blog or any other kind of site.

I hope that helps.
KristiDaniels is offline   Reply With Quote
Old 10-11-2009, 09:14 AM   #5
Advanced Warrior
War Room Member
 
Bruce Hearder's Avatar
 
Join Date: May 2004
Location: Perth, Australia.
Posts: 717
Thanks: 4
Thanked 182 Times in 138 Posts
Social Networking View Member's Twitter Profile  View Member's YouTube Profile
Contact Info
Send a message via Skype™ to Bruce Hearder
Default Re: robots.txt - What should it look like for wordpress site?

Interesting comments about robots.txt

In my experience, I have found that not having a robots.txt seriously slows down the initail spidering of a website.

It seems that the SEs will hit a site, check for robots.txt.
If not there (404 error), they go away and try again anywhere from 1hour to 1 day later.

They will keep doing this until some arbitary time has been reached and then they will start spidering the site.

If I put up a basic robots.txt file , as follows :

User-agent: *
Disallow:

Then I find that the SEs will start spidering my site within minutes of hitting it the first time..


Bruce

-----------------
Get Your Backlinks indexed quicker at BackLinks2RSS

Create Full Text Feeds from Partial RSS Feeds at FeedExpander.com. See the WarriorForum post about it here
Bruce Hearder is offline   Reply With Quote
Reply

  WarriorForum - Internet Marketing Forums > The Warrior Forum > Main Internet Marketing Discussion Forum

Tags
robotstxt, site, wordpress

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off



All times are GMT -6. The time now is 12:49 PM.