Why do you need robots.txt?

There is a number of reasons to use a robots.txt file.

  • When our site is in development mode and some of the parts or pages are still under construction.
  • We have few directories or files which we want to keep private for some reasons like drafted designs or experimental data.
  • Disallowing search engines from indexing certain files on your website (images, PDFs, etc.)

But any misconfiguration in the robots.txt file can result in a total loss.! It can be very risky if you accidentally disallow Googlebot from crawling and indexing your entire site. Be aware of the rule set and make sure you are not making any mistakes.

About Robots.txt

Search engines have web robots which frequently crawl the web. A robots.txt file simply instructs to web robots, search engine spider or searchbot about the directories and pages of a website that whether they are allowed to crawl the certain areas or files of a website.

For example, if you do not wish your page - thank-you.html to be scanned by the web robots then you can restrict them.

User-agent: *
Disallow: /thank-you.html

How to create a robots.txt file

Creating robots.txt file is easy. Get started by opening any text editor and save with .txt extension. Now simply upload robots.txt to the root directory of your website.

User-agent directive

"User-agent:" part specifies which search engine robot you want to block. An asterisk (*) is used as a wildcard with User-agent for all search engines. So the below robots.txt code snippet will disallow all the user-agents to crawl or index the website. 

Disallow all indexing (Universal  Match)

User-agent: *
Disallow: /

Disallow specific User-agents

In case you want to block specific user-agent for example, Googlebot then we write

User-agent: Googlebot
Disallow: /images/

Wildcard Matching

To block access to all URLs that include a question mark (?), you could use the following entry:

User-agent: *
Disallow: /*?

We can also use $ to block the URLs ending with the specific file type. For example, if we want to block the URLs that end with .php then

User-agent: Googlebot
Disallow: /*.php$

What happens if your website has no robots.txt 

Well, in this case, a robot comes and visit each directory, page and content to index. This is true for an empty robots.txt as well.

Allow

This may be a little complex. So let's have an example to understand it properly. Let's assume that in our web directory there is a sub-directory which has few pages. We want to allow only a specific single page to be crawled and indexed.

User-agent: *
Allow: /site2/special-page.php
Disallow: /site2/

So we disallowed the directory 'site2' but allowed the 'special-page.php' under the same directory. To make this rule widely applicable to the number of robots, it is necessary to place the Allow directive(s) first, followed by the Disallow. If the robots follow the standard only when the order is followed as well.

Multiple user-agents

We can specify rules for multiple user-agents as shown in the example below -

User-agent: googlebot                           # all Google services
Disallow: /private/                             # disallow specified directory

User-agent: googlebot-news                      # only the news service
Disallow: /                                     # disallow everything

User-agent: *                                   # any robot
Disallow: /something/                           # disallow specified directory