A Blogger’s Pocket Guide to Meta Robots and Robots.txt

Bookmark and save for later.

[This is part of the The Blogger’s Essential Guide to Search Engine Optimization Series.]

We’ve previously had an overview of meta data and the importance of these general elements in your blog (so feel free to review). In this post and later in the series I want to spend some time to dive into specific meta elements that you, as a blogger, should know about.

The first being Robots.txt and Meta Robots Tag. Not too sure what they are? The first differentiator is that robots.txt is a text file that sits on your server and the latter, a meta tag, is a line of code typically found in your header.

So, quick review: One is a physical file while the other is a line of code. Both can control whether a search engine crawler, bot, or spider is able to index a page on your blog and site. They can also tell the robots if they should follow the links from that particular page or not (similar to nofollow).

So why should the typical blogger care about that type of stuff? It’s actually very simple: By becoming a better manager of your pages you are able to control the passing of value (PageRank) and are are also able to prioritize “link juice” to the pages and blog posts that matter the most.

Here’s a general example:

  1. You have a number of static pages that aren’t used very heavily but stil exist for your use or for particular visitor’s use.
  2. These pages begin to accumulate links and begin to acquire link juice – but do not fit into your blog’s link or linking architecture.
  3. By using a robots.txt file (or in this case a meta robots would be better) you can stop the passing of value by asking the search engines not to crawl them in ways that might impact your overall search engine ranking.
  4. In other words, you don’t want any of these pages coming up in SERPs before your more impactful and important blog posts and pages. Heck, you don’t even want the possibility of diluting search queries.

Make sense? Here’s another even more practical example:

  1. You have an Archives Page and/or Sitemap (for humans) that has (obviously) tons of links to pages and blog post content.
  2. This page is going to accumulate a lot of value and PageRank, but you don’t want search engines to return your archives or sitemap on a search engine results page, especially before content and blog posts that address the user’s needs more directly.
  3. You specify via robots that you don’t want search engines to index these particular pages.
  4. You now breathe a sigh of relief as you’ve optimized your search engine returns.

Make sense? This is a very simple example but it should prove to help you generally understand the concept.

How to Create a Robots.txt File:

This is actually quite easy as you can quickly create a robots.txt file on your computer with just a text editor.

Just create a new file, save it as robots.txt and you’re done.

But, you’ll want to add some elements to tell it what to do exactly. Here are some of the elements you can include:

Allow All Content for All Crawlers:

[cc]
User-agent: *
Allow: /[/cc]

Most people have this robots.txt file as it simply allows everything through.

Block All Content from All Crawlers:

[cc]User-agent: *
Disallow: /[/cc]

Block a Specific Folder for a Specific Web Crawler:

[cc]User-agent: Googlebot
Disallow: /so-sad-googlebot/[/cc]

Please note the following user agents:

  • Google – googlebot
  • Bing, MSN, Live Search – msnbot
  • Yahoo! – slurp

Block a Specific Page for a Specific Web Crawler:

[cc]User-agent: Googlebot
Disallow: /so-sorry-googlebot/page.html[/cc]

Allow a Specific Page for a Specific Web Crawler:

[cc]User-agent: *
Disallow: /bots/page.html

User-agent: Googlebot
Allow: /bots/page.html[/cc]

Sitemap Parameter:

[cc]User-agent: *
Disallow:

Sitemap: http://www.URL.com/sitemap-folder/sitemap.xml[/cc]

Sweet, right?

Ok ok. I’ll let you in on a secret that’ll make all this much easier; you can have Google create the robots.txt file for you through Google Webmasters:

Google Webmasters makes it super easy. Generate and then download!

All you have to do is generate what you need, download the file, and then upload to your root directory in your blog!

To be optimal you’ll want it in the root folder of your site and/or blog.

Easy-peasy.

How to Create Meta Robots Tags:

Not interested in using robots.txt? That’s fine. You can use meta tags in your header and declare your intentions there via some code, just make sure that you place all your content values in one meta tag as this is highly recommended and makes the tags easier to read and reduces the chance for conflict (which would be bad).

For example,

[cc]
[/cc]

will be interpreted the same way as:

[cc][/cc]

In terms of conflicts Google, as an example, will only use the more restrictive declaration. So, if it saw this it would use the NOINDEX:

[cc]

[/cc]

Got it? Cool. Here are some other things that you might care about:

Spacing

Googlebot understands spacing differences in your declaration. You must use a comma in between each call but you don’t need a space (if you don’t want to). Both of these are the same:

[cc]


[/cc]

Casing

Googlebot also understands lower and uppercase. These are all the same:

[cc]

[/cc]

Got it? Good. Here are some of the other tag values that you can use:

  • NOINDEX – This prevents the page from being included in the index.
  • NOFOLLOW – This prevents Googlebot from following any links on the page. It is good to note that this is different than the nofollow attribute at the link level. See more here.
  • NOARCHIVE – This prevents the page from being cached and available in SERPs.
  • NOSNIPPET – Don’t like a description appearing below the page link in search returns? Stop the crawlers from capturing and displaying it.
  • NOCACHE – Same as NOARCHIVE but used by MSN/Live.
  • NOODP – This blocks the Open Directory Project description from being used in the search page description. I talk more about here about DMOZ.
  • NOYDIR – This is like NOODP but for Yahoo!
  • NONE – This is the same as “NOINDEX, NOFOLLOW”.

You can, of course, learn more from each of the major search engines as they have documentation as well:

You’re so pro!

So, What is the Difference? What Should I Use?

There are a few different and varying perspectives and your decision to use one over the other depends on your goals as a webmaster and site architect.

Generally though robots.txt is understood as telling search engines to not crawl a given URL but allows them to keep the page in the index and display it as a result. This occurs when other links have linked to it historically (or presently).

Meta NoIndex tells search engine they can visit but are not allowed to present the results in search. Many SEO professionals prefer this if they’re doing any site sculpting.

Please remember that content is still available and open to anyone who has a direct link! Also, any information in the robots.txt file is free available to be read and interpreted by humans and bots. In some cases, even exploited if someone has any malicious intent.

Good luck and let me know if you have any questions!

[This is part of the The Blogger’s Essential Guide to Search Engine Optimization Series.]

Comments are closed.