Table of Contents
Search Engines run small programs that crawl through all the pages of a website regularly and index them, so that they can be showed to users in search engine results. These programs are called bots. These bots periodically go through every website on the internet and crawl all the accessible pages on it. Each search engine like Google Search, Bing, Yahoo, and many other tools will have its own bots.
Robots.txt is a file that tells these bots how to crawl the website.
It tells them what is allowed to be crawled, and what is not allowed to be crawled. It also gives them the location of a file called as a Sitemap (which mentions the location of all the pages that need to be crawled). It is a good practice to have a robots.txt on your website.
Where is robots.txt kept?
Robots.txt is always kept in the root folder (/) of the website. For eg., if the website is www.example.com, the robots.txt will be at
Since the file is for bots, it is publicly accessible. Most websites will have one. In fact, you can check Google’s robots.txt at https://google.com/robots.txt
Fun Exercise: Try this for any of the websites you regularly visit. Just type the main website URL and write "/robots.txt" at the end.
Why do we need a robots.txt?
There are many reasons why we would want such a file on our website. Here are the most important reasons:
- We may not want certain pages to show up on the public internet search.
For eg. say certain policy pages, that are more for employees than the general public. It could also be say a demo section of a website or a section under development. Robots.txt can tell bots
- Most bots do not crawl pages on the website indefinitely. In fact, they have a budget (called intuitively as crawl budget) on how many pages will they crawl in one go.
So, to make sure our best pages show up on search results, we would want the bots to use the crawl budget to access the right pages. Robots.txt helps in that.
Understanding a robots.txt file
The robots.txt file is actually easy to understand. There are five main terms in it. (not all are mandatory in the file). Only one instruction is mentioned per line.
The term “User-agent:” basically mentions, to which bot, the instructions apply.
User-agent: * – means the instructions under it apply to all bots.
User-agent: Googlebot – means the rules below this, apply to Google bots.
Disallow: mentions which section of the website is not allowed to be crawled.
Disallow: / – means none of the sections in the website is allowed to be crawled.
Disallow: /images – any page under the folder /images on the website is not allowed. (eg. www.example.com/images/toy.png)
You can also mention which sections of the website are allowed to be crawled using the syntax “Allow:”.
Fun exercise: Can you look at nytimes' robots.txt below and explain what sections of their website are they allowing and not allowing to crawl and to which bots?
This mentions the location of the sitemap on the website. As mentioned, a sitemap contains the list of all the URLs that bots should crawl. It is a good practice to have this parameter in the robots.txt file.
If you go to robots.txt of Nytimes website, you will see the below sitemaps mentioned at the bottom:
There is another less-often-used parameter called Crawl-delay. It basically tells the bot how many milliseconds they should wait after crawling a page and moving to the next one.
This is often used to avoid overwhelming the website with too many bots crawling and bringing the website down.
This is basically everything you would need to know about a robots.txt file, unless you plan to edit it yourself. As a business owner or a marketing head, you should check if you have a robots.txt on your website and if it is instructing the bots to crawl the right pages.
But if you wish to learn more about it, here are some really good resources:
That’s all for now.
To greater organic traffic on your websites,