Both .htaccess
and robots.txt
are text-based configuration files for Apache web servers. However, this is where the similarities between them end—and the differences begin.
.htaccess vs. robots.txt
While .htaccess
is mainly used for directory-level access control, URL redirection, and URL rewriting, robots.txt
instructs search engine crawlers which pages on a website they can and cannot crawl.
To put it simply: If you want to control the way users can access and view your site’s URLs, you need to make changes to the .htaccess
file. If you want to control the way search engine bots crawl your website, you must make changes to your robots.txt
file.
What Is .htaccess?
.htaccess
is a text-based configuration file for Apache servers. It’s intended for access control, URL redirection, and URL rewriting but has a number of other uses thanks to its ability to set cookies and determine HTTP response headers.
Some of the most common use cases for this file, according to Red Hat Software, include:
- Redirecting specific URLs or URL patterns
- Loading custom error pages, like user-friendly 404 pages
- Forcing the pages on your website to load as HTTPS instead of HTTP
- Allowing or forbidding specific IP addresses and IP address ranges from visiting your website or specific pages on it
- Protecting certain directories on your server with “basic” HTTP authentication
The name “.htaccess” is short for hypertext access. The prepended dot (.) makes the file hidden on Unix systems.
The .htaccess
file is a directory-level configuration file.
It basically allows you to override the Apache server’s default settings in the server-level configuration file, httpd.conf
or apache2.conf
(found in the /conf/ directory), for specific directories.
(Directory-level settings can also be set in the <Directory> section of an Apache web server’s main configuration file.)
An .htaccess
file must be written in the Apache Directives variant of the Perl Compatible Regular Expressions (PCRE) language. In this variant, there are no regex delimiters to mark the beginning and end of a regular expression—instead, spaces are used as delimiters.
According to documentation from software giant Oracle, you should avoid .htaccess
files whenever possible as it has an inevitable impact on the server’s performance.
.htaccess
files are read on every HTTP GET request. If your website has a complex structure of directories, all .htaccess
files will be requested and parsed on every HTTP requested.
If you want to decrease your Apache web server’s response time, it is best to set configurations at the level of the server. If this isn’t an option, reduce the number of .htaccess
files and set rules optimized for performance.
What Is robots.txt?
robots.txt
is a text-based configuration file that tells search engine crawlers, such as Googlebot, Bingbot, and DuckDuckBot, which URLs on your website they are allowed to crawl and which ones they are not. The purpose of robots.txt is to prevent search engine crawlers from overloading your website with requests.
Some of the most common use cases for this file include:
- Allow all or specific crawlers to crawl your whole website or a specific directroy
- Disallow all or specific crawlers to crawl your whole website or a specific directory
- Point crawlers to the URLs of the sitemap(s) for this website
The robots.txt
file must be placed in the topmost, root directory of a website running the HTTP or HTTPS protocol, so that search engine crawlers can fetch it using a non-conditional GET request.
The rules in the file apply to the domain and all its directories, which are indicated by a forward slash (/), but not to any of its subdomains, protocols, or port numbers. If the robots.txt
file is hosted in a subdirectory, it is usually considered invalid and disregarded.
Google is currently formalizing the most widely used rules in what is called the Robots Exclusion Protocol Specification. So far, AOL, Baidu, DuckDuckGo, Google, Yahoo!, and Yandex are fully compliant with the specification; Bing not as it can’t inherit settings from the wildcard (*) character.
Did you know?
For those of you interested in trivia, robots.txt
was created in 1994 by Martijn Koster, a webmaster then working for the cybersecurity company Nexor at the time, after a crawler caused a denial-of-service attack on a web server he managed.
Koster posted it to the www-talk mailing list, where the pioneers of the World Wide Web communicated at the time. It was adopted almost immediately and quickly became the de-facto standard for mediating between search engine bots and webmasters.