Conversion Rate Optimization Blog

About the Invesp Blog

This blog is brought to you by the team at Invesp Consulting, an e-commerce conversion optimization company.


Meet the authors of the invesp blog: Ayat, Khalid, and Chris.

Subscribe

RSS Subscribe via RSS Feed

Or, receive weekly updates by email:

Landing page optimization

Does your PPC campaign need help?
Invesp offers
landing page
optimization

By khalid on November 13, 2007 1:10 am
Posted in (Blogging)

The other day, I was examining our robots.txt file and discovered few things that should be corrected in it. As I was doing that, I thought this would be a great blog topic. So, I decided to share couple of things that can be helpful to others.

A “robots.txt” is a simple text file used to instruct search engines which pages should not be indexed on the site. By using the file, you will direct crawlers to the pages and directories to crawl into and which ones to ignore.

Assuming that wordpress is installed in the directory /wordpress on your site, your posts can be crawled and then indexed in several ways:

mysite.com/blog/category/title-of-post.html

mysite.com/ blog /page/1

mysite.com/ blog /trackback/title-of-post.html

mysite.com/ blog /tag/some-tag-on-the-post

This dilutes the power of your site since the same page is indexed several ways. And of course you might suffer from duplicate content penalty from search engines as well.

How to fix this problem?

Here is a sample robots.txt that fixes the above problem.

User-agent: *

Disallow: /blog/wp-

Disallow: /blog/feed/

Disallow: /blog/trackback/

Disallow: /blog/pages/

Disallow: /blog/tags/

Disallow: /blog/images/

The last directory is the images directory which you may or may not want search engines to crawl into.

If wordpress is installed in the root directory, then your robots.txt will look something like this:

User-agent: *

Disallow: /wp-

Disallow: /feed/

Disallow: /trackback/

Disallow: /pages/

Disallow: /tag/

Disallow: /images/

what does this mean?

User-agent: * >This means that the robots file should apply to all crawlers.

Disallow: /wp- >This means that you will stop crawlers from indexing any wordpress specific files

/feed/

/trackback/

/pages/

/tags/

These directives instruct crawlers not to double index your pages. The pages will be indexed directly via their own unique urls.

After you create the robots.txt, upload it to the root html directory. And there you have it, a few tips that make a world of difference for you!

Do you have any thoughts on robot.txt or anything else that we can apply to our systems in order to avoid potential problems?

If you enjoyed this post, please consider subscribing to the Invesp blog feed
to have future articles delivered to your feed reader.

Or, receive weekly updates by email:

18 Responses to “ How to create a “robots.txt” file?”

 
One of the Best robots.txt Explanations That I’ve Seen | Another Opinion Among Many Says -- November 13th, 2007 at 7:26 am

[...] of your content indexed by the search engines or image scrapers, then Khalid’s article on the invesp blog is a valuable resource.  It presents the necessary elements of a robots.txt file in the [...]

 
Steven Bradley Says -- November 13th, 2007 at 1:23 pm

Nice review Khalid. A little over a year ago I had some issues with Google indexing the feed of my blog throwing all my pages into the supplemental index. Traffic from Google started to approach nonexistent. The solution to the supplemental problem for me turned out to be some entries in my robots.txt file similar to what you added here.

 
Karl Says -- November 14th, 2007 at 1:12 am

Great tips for beginners, I bookmarked this page for reference!

 
Paul Says -- November 14th, 2007 at 2:00 pm

This is kinda technical but Ill try creating this robot.txt file. Thanks for the tutorial.

 
Propet Says -- November 15th, 2007 at 10:01 pm

Nice Article. I actually never used robots.txt and now I can see there are many benefits for it. Thanks, I will implement it.

 
This Week In SEO - 11/16/07 - TheVanBlog Says -- November 15th, 2007 at 10:44 pm

[...] How to create a “robots.txt” file? [...]

 
This Week In SEO | SEO:Search Says -- November 17th, 2007 at 7:02 pm

[...] How to create a robots.txt file? [...]

 
Chris Says -- November 24th, 2007 at 5:14 am

Thanks a lot for the tip that I’ve been longing for, you saved my day.

 
Leslie Says -- December 1st, 2007 at 1:34 pm

I need to know how to create the robot.text file in the first place, can you help?

 
Phil Booker Says -- December 27th, 2007 at 10:08 am

Ahh… Don’t crawl on certain pages is fine. but how do you stop all random crawls that try and recover emails and stuff from your website?
Please… I want Google, Yahoo and MSN etc but not the spamming crawlers.
What to do?
Phil

 
Quick Optimization Fix for WordPress Links Says -- February 24th, 2008 at 1:39 pm

[...] should automatically change the link structure. *Update:* I’ve found another quick fix at invesp.com with a quick fix to your robots.txt which also stops crawlers from double-indexing pages on your [...]

 
Quick SEO fix for Wordpress blogs Says -- February 26th, 2008 at 11:10 am

[...] Reading: I’ve found another quick fix at invesp.com with a quick fix to your robots.txt which also stops crawlers from double-indexing pages on your [...]

 
 
Quick SEO fix for Wordpress blogs Says -- March 22nd, 2008 at 7:27 pm

[...] change the link structure. Recommended Reading: I’ve found another quick fix at invesp.com with a quick fix to your robots.txt which also stops crawlers from double-indexing pages on your [...]

 
Andy Bolton Says -- March 28th, 2008 at 1:05 pm

Thanks mate… Makes things a little clearer. Keep these blogs coming I may want more.

 
Jason Says -- April 11th, 2008 at 12:58 am

Your blog is very informative, I have learned so much from it. It is like daily newspaper :). Added to fav

 
Afef Says -- May 29th, 2008 at 5:57 am

I just realized i didn’t have a robots text file - good thing i came across your blog post! Thanks for the info! Those are some really helpful tips for beginers! :)

 
xHTML Coding Says -- September 15th, 2008 at 10:23 am

With Wordpress, Google seems to like to dump the dupe content penalty in the tags and the archive pages.

The robots.txt is the best possible way to sort this major problem which could be a real pain in the…

 

What do you think?