Google announced early this month that they will be discontinuing their unofficial support of noindex directives within robots.txt files. As of 1 September 2019, publishers who are currently relying on robots.txt crawl-delay, nofollow, and noindex directives will need to find another way to instruct search engine robots how to crawl their sites’ pages.
These policy updates will mean rapid changes are necessary for many web publishers in preserving their approach to search engine optimisation. Are you one of them? Continue reading to learn more about Google’s major policy update and what it will mean moving forward.
Webmasters create robots.txt text files to direct user agents (web crawlers) how to crawl the pages on their websites. These text files either allow or disallow web robots such as search engine crawlers to engage in specified behaviour.
Robots.txt files can be as short as two lines long. The first line specifies a user agent to which the directive applies. The second provides specific instructions to that user agent, such as allow or disallow. Specific web crawlers will disregard robots.txt files that are not directed at them, and some will ignore the files that are. Google will soon count themselves among the latter group.
Google has long discouraged publishers from using crawl-delay, nofollow, and noindex directives within robots.txt files, but have followed most directives in spite of having no standardised policy toward them.
Google has spent years trying to standardise their robot exclusion protocol so that they can move ahead of this change. This is why Google has also long encouraged publishers to find alternatives to robots.txt directives.
In their announcement, Google said they were making the Robots Exclusion Protocol (REP) an internet standard. To do so, they have open-sourced the C++ library they used to parse and match roles in robots.txt files. The 20-plus-year-old library, along with a testing tool offered by Google, can help developers create the parsing tools of the future.
Disregarding robots.txt noindex does not leave publishers without means to control crawling on their sites.
Use robots meta tags to noindex
Robots meta tags are supported in HTTP response headers, as well as HTML.
Use HTTP status codes 404 and 410
These status codes tell crawlers that the page does not exist, thus dropping it from Google’s index once they’ve been crawled and processed.
Hide content behind password protections
Unless you’ve signaled subscription or paywall content with markup, content that is concealed behind login pages will often remove it from Google’s index.
Disallow in robots.txt
If search engines are disallowed from crawling a page, that content cannot be indexed.
Use the Search Console Remove URL tool
The Search Console Remove URL tool allows for the temporary removal of a URL from Google’s search results.
If your sites have relied on robots.txt noindex directives to avoid search engine indexing, you have until 1 September to make the necessary changes. If you’re not sure whether your site uses noindex directives, it would be wise to double check. Indexing certain pages you want concealed can cost your website in search rankings.