Search Engines are using robots (or so called User-Agents) to crawl your pages. The robots.txt file is a text file that defines which parts of a domain can be crawled by a robot. In addition, the robots.txt file can include a link to the XML-sitemap,
Robots.txt file is also known as robots exclusion protocol or standard. It's a simple file that tells search engine bots whether they can and can not crawl your website. You can also tell search bots about the webpages that you do not need to get crawled, like the areas that contain duplicate content or are not developed.
A robot.txt file includes User-Agent, and under it, you can also be able to write other directions such as Allow, Disallow, or Crawl-Delay, if you write it manually it will take lots of time. But by using this tool you can generate your file in seconds.
The basic format of the robots.txt file is
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
If you think that it's easy, then you are wrong. One wrong line or tiny mistake can exclude your page from the indexation queue.
Note: Make sure don't add your main page in disallow directive.
If you generate a robots.txt file, you must be aware of few important terms used in that file. There are five standard terms you're likely to come across in a robots.txt file. They include:
- User-agent: A specific web crawler (usually a search engine) to which you are giving crawl instructions.
- Disallow: That command instructs the web crawler not to index the particular URL. Only one "Disallow" line is allowed for each URL.
- Allow: This command instructs the web crawler to index the particular URL. This command is applicable for Google bots also. It tells the Google bots to index a page or subfolder even though its parent page or subfolder may be disallowed.
- Crawl-delay: That command instructs how many seconds the web crawler should wait before loading and crawling page content. Crawl-delay is treated differently by different web crawlers from search engines. For Bing, it is like a time window in which the bot will visit the site only once. For Yandex, it is a wait between successive visits. The Google bots do not acknowledge this command. However, you can set the crawl rate in Google Search Console.
- Sitemap: It calls out the location of any XML sitemap(s) associated with the URL. However, currently, Google, Bing, and Yahoo support that command.
How to make a robots.txt file for Google robots by using a robots.txt file generator?
Manually creating a robots.txt file is a complicated thing. But the online tools make that process relatively easy.
To generate the robots.txt file.
- Open the Robots.txt Generator.
- When you open the tool, you see a couple of options. Not all the options are mandatory. But you need to choose carefully. The first row contains default values for all robots/web crawlers and a crawl delay. If you want a crawl delay, you can select the value in seconds as per your requirements.
- The second row is about a sitemap. Make sure you have one, and don't forget to mention it in your robot.txt file.
- The next couple of lines contain the search engine bots if you want a specific search engine bot to crawl your website. Then select "Allowed" from the dropdown for that bot. And if you don't wish for a particular search engine bot to crawl your website. Then select "Refused" from the dropdown for that bot.
- The last row is for disallowing if you want to restrict the crawlers from indexing the areas of the page. Ensure to add the forward-slash before filling the field with the address of the directory or page.
- After generating the robots.txt file, test your robots.txt with the robots.txt Tester.