Search bots and robots.txt

What is robots.txt and why do we need it?

Share this:

Search Engines run small programs that crawl through all the pages of a website regularly and index them, so that they can be showed to users in search engine results. These programs are called bots. These bots periodically go through every website on the internet and crawl all the accessible pages on it. Each search engine like Google Search, Bing, Yahoo, and many other tools will have its own bots.

Robots.txt is a file that tells these bots how to crawl the website.

It tells them what is allowed to be crawled, and what is not allowed to be crawled. It also gives them the location of a file called as a Sitemap (which mentions the location of all the pages that need to be crawled). It is a good practice to have a robots.txt on your website.

Where is robots.txt kept?

Robots.txt is always kept in the root folder (/) of the website. For eg., if the website is www.example.com, the robots.txt will be at

www.example.com/robots.txt

Since the file is for bots, it is publicly accessible. Most websites will have one. In fact, you can check Google’s robots.txt at https://google.com/robots.txt

Robots.txt of Google.com website.
Fun Exercise: 
Try this for any of the websites you regularly visit. Just type the main website URL and write "/robots.txt" at the end. 

Why do we need a robots.txt?

There are many reasons why we would want such a file on our website. Here are the most important reasons:

  1. We may not want certain pages to show up on the public internet search.

    For eg. say certain policy pages, that are more for employees than the general public. It could also be say a demo section of a website or a section under development. Robots.txt can tell bots
  2. Most bots do not crawl pages on the website indefinitely. In fact, they have a budget (called intuitively as crawl budget) on how many pages will they crawl in one go.

    So, to make sure our best pages show up on search results, we would want the bots to use the crawl budget to access the right pages. Robots.txt helps in that.

Understanding a robots.txt file

The robots.txt file is actually easy to understand. There are five main terms in it. (not all are mandatory in the file). Only one instruction is mentioned per line.

1. User-agent

The term “User-agent:” basically mentions, to which bot, the instructions apply.

For eg.
User-agent: * – means the instructions under it apply to all bots.
User-agent: Googlebot – means the rules below this, apply to Google bots.

2. Disallow

Disallow: mentions which section of the website is not allowed to be crawled.

For eg.
Disallow: / – means none of the sections in the website is allowed to be crawled.
Disallow: /images – any page under the folder /images on the website is not allowed. (eg. www.example.com/images/toy.png)

3. Allow

You can also mention which sections of the website are allowed to be crawled using the syntax “Allow:”.

Fun exercise: Can you look at nytimes' robots.txt below and explain what sections of their website are they allowing and not allowing to crawl and to which bots?

4. Sitemap

This mentions the location of the sitemap on the website. As mentioned, a sitemap contains the list of all the URLs that bots should crawl. It is a good practice to have this parameter in the robots.txt file.

If you go to robots.txt of Nytimes website, you will see the below sitemaps mentioned at the bottom:

5. Crawl-delay

There is another less-often-used parameter called Crawl-delay. It basically tells the bot how many milliseconds they should wait after crawling a page and moving to the next one.

This is often used to avoid overwhelming the website with too many bots crawling and bringing the website down.

Final thoughts

This is basically everything you would need to know about a robots.txt file, unless you plan to edit it yourself. As a business owner or a marketing head, you should check if you have a robots.txt on your website and if it is instructing the bots to crawl the right pages.

But if you wish to learn more about it, here are some really good resources:

  1. Semrush on Robots.txt
  2. Google search help doc on Robots.txt
  3. Ahrefs on Robots.txt

Further reading on SG.com

You can read all our SEO blogs here. Some notable mentions you might find useful are on how to improve your website loading speed without spending a bomb and how to set the right SEO budget and measure its ROI.

If you are serious of growing your organic traffic, you must link your website to Google Search Console(GSC). This blog explains how to link your website to GSC.

That’s all for now.

To greater organic traffic on your websites,
Shyam

A video explainer

Share this:
Scroll to Top