The robots.txt file is a file that helps us manage the pages that crawlers from other search engines can access.
It should be located at the root of the site and we should not use it to control what we want to position or not, nor to manage what we want to index. The robot’s txt helps us control the crawl pages that we want to be crawled and thus be able to optimize, at least in part, our crawl budget.
It is important to take into account several aspects when creating our robot file :
- We must place it at the root of the site and it must have only one robot per domain
- Must be a plain text file
- It can be made up of one or several rules that allow or block bots’ access to a given route or URL.
When creating our robots.txt file, we must follow a series of steps. If you want to create a robot’s txt file that helps you improve your web positioning, keep reading.
What is Robots.txt?
The robots.txt is a plain text file, without format, that contains a series of directives oriented to the different web crawlers.
The first thing that the bots of the different search engines do when they arrive at a web page is to look at the root of the domain, which is where the robot’s txt file is located, they read it and start crawling the site.
The robots.txt is not a way to prevent a page from appearing in Google. If we want to control what appears in the SERPs we must use the “ no index ” tag, not include the page in the sitemap, or protect the page with a password, registration, or a chastity belt not suitable for bots.
Before continuing I am going to leave you this video so you can see the difference between crawling and indexing:
Difference Between Crawling and Indexing
How useful is the robots txt file in SEO
Many times we know that we have to do things but we do not understand what those things are for.
My goal today is for you to understand what the robots.txt file is for.
The robots txt will help manage the crawling of web pages by the different bots, spiders, crawlers, crawlers (call them whatever you want) of search engines.
Thanks to the robot’s file we can also prevent search engines from tracking certain files.
What we must be clear about is that the robot’s txt file is not used to prevent indexing, for that we must use the no-index tag and not enter the URL in the sitemap. Other methods may be to protect the page with a password.
Many times, forgive me father because I have sinned, we put a no-follow to a page to avoid its indexing. However, if this page is linked to elsewhere on the web it may end up in google search results.
Where are the robots file located? Rules to keep in mind
The robots. txt is located in the root of the domain and is used for the domain, although if there are subdomains they need their robots.txt file.
One of the peculiarities of the robot file is that it is not mandatory, it is only recommended.
Some rules that we must take into account are:
- The file must be named robots.txt
- We should only have one robots file per site
- We must include it in the root of the domain.
- They can also be applied to subdomains.
- They must be encoded in UTF-8
Therefore, to create our robot’s file we can use our computer’s notepad but not word processors such as Microsoft Word.
Directives are supported by the robots.
The directives are the rules that the different crawlers read and they will always be: user-agent, allow, disallow, and sitemap.
Let’s see what each of these rules that make up a robots.txt file consists of.
User-Agent
The user-agent refers to the bots to which the rules that we place just below that line are directed.
Each set of rules that we introduce in the robots.txt must be oriented to a tracker, or what is the same: each set of rules must be oriented to a user-agent.
When we put an asterisk (*) as a user-agent, it applies to all crawlers.
Let’s say the asterisk is the wildcard.
Although there is an exception: the asterisk is not used for ad spots(ad trackers). For these crawlers, we must create a set of rules explicitly for them.
When we introduce several user agents to the robots, the crawler in question will pay attention exclusively to their block of directives and will discard the rest.
Disallow
In each of the directives that we introduce in the robots, there must be at least one line that includes disallowing or allowing.
This rule indicates which directory or page of your domain you do not want the user-agent to crawl.
We must not include the full URL in the same way that it appears in the browser. We only have to include what goes after the “.com” and it must start with “/” and can be the URL of the page or a folder and also end with the bar.
Examples :
If I don’t want it to crawl the contact page:
Disallow: /contact/
If I don’t want the SEO podcast folder to be crawled:
Disallow: /category/podcast-SEO/
Allow
As we said before, each rule that we introduce must have at least one line that includes allowing or disallowing.
This directive which directories or other pages we want search engine crawlers to crawl.
In addition, it serves to help generate an exception to the disallow or to override it.
It works the same way as disallow: start with “/” and if it’s a page, but it the same way it appears in the browser. If it is a directory, include the folder between “/”.
Sitemap
It is interesting to include the sitemap in our robots.txt file, in this way it indicates to the crawlers the location of the sitemap and thus we facilitate the content that we want to be taken into account indexing.
The asterisk can be used in any rule except in the sitemap directive. In addition, we can use it in any location within a route.
The sitemap URL is the only one that must be included in its entirety in the robots.
How to create a robots.txt file in 3 steps
The robots.txt file must be made up of a group made up of the following elements:
- User-agent
- One or more allow or disallow
That’s one block and each robots.txt can have multiple blocks and each block must target a user agent. An example of a robots file block would be:
User-agent: *
Disallow: /2018/
Allow: /2018/award-we-are-the-best/
What we are saying is that any crawler that goes through our website does not crawl the 2018 folder. However, in this folder, there is content from an award that we received that I do want to crawl.
As we said before, the file must have the UTF-8 encoding. You can do it with a notepad.
Now that we know the elements that make up the robot’s file, we are going to assemble it. To create the robots.txt we must add a series of rules that are going to tell the crawlers what they should or should not crawl.
How to add rules to the robots.
- Consists of one or more groups
- Each group consists of several rules
- Add one rule per line
- Each group must start by specifying the user-agent
- In each group we must provide the following information:
- Which trackers the information is directed to (user-agent)
- Which directories or files that crawler cannot access
- What directories or files the crawler can access
- The rules are case-sensitive.
- Just like in HTML, the hash sign (#) marks the start of a comment.
- Adding the sitemap to the end of all the robot directives helps to facilitate the crawling of the web for later indexing.
Test that the robot’s txt file works correctly.
The best way to check if the robot’s txt works are through Google Search Console.
The txt robots tester tool helps us to check the accessibility of certain pages.
In this way, we will be able to know if we have configured the robot’s file correctly or if we have blocked something that we should not have blocked.
After testing that our robot’s txt directives are correct, the next thing you have to do is…nothing.
With this, the next time the crawlers access your web page, the crawlers know what to expect when they perform their crawl.
How to create a robot txt file step by step
All this that I tell you is related to robots directives, format, and location we have to capture in our robots file. Taking advantage of the fact that I had to optimize my robots, I have recorded this video to show you how it is done:
Learn how to configure your robots file step by step
Limitations to be aware of the robots.
- We must be clear that these rules are lentils: the trackers themselves are the ones who decide whether to follow the directives or not.
- We should also keep in mind that each crawler understands the syntax differently…
- As we said: it is possible that we have put a disallowed page and that despite this it is part of the search engine results sheet because it is linked from other places.
- If the page is not crawlable, Google cannot index it. But oximeter: you can show a snippet of that page in the SERPs
- The only block that should not contain the user-agent line is the sitemap block.
URL matching and how to work them in robots.txt
Within a page, many URLs have a common syntax and to address them in the robot’s file, a series of parameters are necessary to make the task easier.
I leave you a series of parameters that will be very useful when configuring your robots file:
- The asterisk (*). It is the wildcard and indicates that any character is valid.
- The dollar sign ($) indicates the end of a URL.
If we put, for example :
disallow: /mark*.jpg > we are telling you that any path that starts with mark and ends with .jpg will not be traced.
This route implies that you do not trace, for example:
- content-marketing.jpg
- marketers.jpg
- score-a-goal.jpg
Priorities when following robots file directives
When a tracker or crawler reaches our root domain and finds the robots.txt file, it prioritizes taking into account the following order:
- Specificity rules. Or what is the same: they prioritize the most specific rule based on the length of their route.
- If there are conflicting policies, Google opts for the more lax one (least restrictive).
First off, Google crawls absolutely everything.
Therefore, in the robots, we should not boldly include the allow directive because this directive is only used to override the previous disallow (marking an exception).
Conclusions when configuring the robots.txt file
Configuring the robot’s file correctly is a very good way to optimize the crawling of our web page and to have control over what we want web crawlers to see.
We should not do it quickly and run, to configure it well we must analyze all the URLs we have, including those that are generated in internal searches or when we share our content on Facebook.
In the same way, we must be especially careful with everything we put in disallow because we may block a URL that is important for our positioning.
For this reason, it is just as important to take time when configuring our robot’s file as it is to take time to verify that it works perfectly for the relevant pages for our positioning.
Now that you know this, do you dare to optimize your robot’s file?