Marketing and Conversion Optimization Blog

About the Invesp Blog

This blog is brought to you by the team at Invesp Consulting, an e-commerce conversion optimization company.

Meet the authors of the invesp blog: Ayat, Khalid , and Chris.

More about Invesp Consulting

Subscribe

RSS Subscribe via RSS Feed

Or, receive weekly updates by email:


Free Landing Page Templates

Landing page templates

Two easy to customize, highly converting landing page templates. Download Now!

By khalid on November 13, 2007 1:10 am
Posted in (Blogging)

The other day, I was examining our robots.txt file and discovered few things that should be corrected in it. As I was doing that, I thought this would be a great blog topic. So, I decided to share couple of things that can be helpful to others.

A “robots.txt” is a simple text file used to instruct search engines which pages should not be indexed on the site. By using the file, you will direct crawlers to the pages and directories to crawl into and which ones to ignore.

Assuming that wordpress is installed in the directory /wordpress on your site, your posts can be crawled and then indexed in several ways:

mysite.com/blog/category/title-of-post.html

mysite.com/ blog /page/1

mysite.com/ blog /trackback/title-of-post.html

mysite.com/ blog /tag/some-tag-on-the-post

This dilutes the power of your site since the same page is indexed several ways. And of course you might suffer from duplicate content penalty from search engines as well.

How to fix this problem?

Here is a sample robots.txt that fixes the above problem.

User-agent: *

Disallow: /blog/wp-

Disallow: /blog/feed/

Disallow: /blog/trackback/

Disallow: /blog/pages/

Disallow: /blog/tags/

Disallow: /blog/images/

The last directory is the images directory which you may or may not want search engines to crawl into.

If wordpress is installed in the root directory, then your robots.txt will look something like this:

User-agent: *

Disallow: /wp-

Disallow: /feed/

Disallow: /trackback/

Disallow: /pages/

Disallow: /tag/

Disallow: /images/

what does this mean?

User-agent: * >This means that the robots file should apply to all crawlers.

Disallow: /wp- >This means that you will stop crawlers from indexing any wordpress specific files

/feed/

/trackback/

/pages/

/tags/

These directives instruct crawlers not to double index your pages. The pages will be indexed directly via their own unique urls.

After you create the robots.txt, upload it to the root html directory. And there you have it, a few tips that make a world of difference for you!

Do you have any thoughts on robot.txt or anything else that we can apply to our systems in order to avoid potential problems?