The Beginner's Guide to robots.txt — What It Is and How to Test It

There is a plain-text file sitting at the root of almost every website on the internet that most website owners have never looked at — and that has the power to make your entire site invisible to Google with a single misplaced line. It's called robots.txt, and understanding it is one of the most important (and most neglected) basics of technical SEO.

The good news: it's not complicated. Once you understand what it does and how it works, you can read, write, and test robots.txt files with confidence. This guide covers everything you need to know.

What robots.txt Is and What It Does

robots.txt is a simple text file that lives at the root of your website — always at yourdomain.com/robots.txt. It uses a standardized protocol called the Robots Exclusion Standard to tell web crawlers (Googlebot, Bingbot, and others) which pages or directories on your site they're allowed to access and which they should skip.

When a search engine crawler arrives at your site, the very first thing it does — before visiting any page — is check your robots.txt file. If the file tells it not to crawl certain areas, it will respect those instructions and move on. If there's no robots.txt file, crawlers will attempt to access everything.

It's important to understand what robots.txt does not do: it doesn't prevent pages from being indexed. A page that's blocked in robots.txt can still appear in search results if other sites link to it — Google may index the URL based on those external links even without crawling the page itself. If you want a page to be definitively excluded from search results, you need a noindex meta tag, not a robots.txt block. (We covered this distinction in the guide to how search engines crawl and index your website.)

The Syntax: How robots.txt Is Written

robots.txt uses a straightforward syntax built from just a few directives. Here's what a typical file looks like:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Breaking that down line by line:

User-agent: * — The asterisk is a wildcard meaning "all crawlers." You can target specific crawlers by name instead (e.g. User-agent: Googlebot), but most sites use the wildcard for simplicity.
Disallow: /admin/ — Tells crawlers not to access anything under the /admin/ directory. The trailing slash is important — it means the entire directory, not just a single page.
Disallow: /private/ — Same pattern for another directory.
Allow: / — Explicitly permits access to everything else. In most cases this line isn't strictly necessary (it's the default), but it's good practice for clarity.
Sitemap: — An optional but strongly recommended line that tells crawlers where to find your XML sitemap. Including it here means crawlers can find your sitemap even if you haven't submitted it to Google Search Console.

A few syntax rules that trip people up:

robots.txt is case-sensitive. Disallow: /Admin/ and Disallow: /admin/ are different rules.
Each rule must be on its own line. Combining directives on one line breaks the syntax.
Comments can be added with a # at the start of a line — useful for documenting why a rule exists.
A blank Disallow: with nothing after it means "allow everything" — not "disallow everything." This is a common source of confusion.
Conversely, Disallow: / means "disallow everything" — one of the most dangerous lines you can have in production.

The staging site trap: The single most common catastrophic robots.txt mistake happens during website launches. Staging and development environments are routinely set to Disallow: / to keep them out of search results while the site is being built — which is correct. The disaster occurs when the staging configuration is copied to the live server during launch without removing that line. The result is a live website telling Google not to crawl anything. This has happened to sites of every size, and it can take weeks to recover from once Google processes the change.

What to Block — and What Not To

Most sites should block a relatively small set of directories that serve no value to searchers and waste crawl budget. Common legitimate blocks include:

/admin/, /wp-admin/ — CMS administration areas. No searcher needs to find these, and crawlers have no business there.
/cart/, /checkout/ — E-commerce transaction pages with no standalone search value.
Search result pages — Internal search results pages (/search?q=) are usually thin, duplicate-content-heavy pages that add no value to Google's index.
Duplicate content paths — URL parameter variants, print versions, and session-ID URLs that generate duplicate content without canonical tags.
Staging or test directories — Any /staging/, /test/, or /dev/ paths that might exist on a production server.

What you should not block — even though people do it by mistake:

CSS and JavaScript files. Google needs to render your pages to fully understand them, and rendering requires access to your CSS and JS. Blocking these files — common in older SEO advice — prevents Google from seeing your pages as users do.
Image directories. Unless you have specific reasons to exclude images from search, blocking your image folders prevents your images from appearing in Google Image Search.
Any page you want to rank. This sounds obvious, but it happens. A page blocked in robots.txt cannot be crawled, which means Google's understanding of it is severely limited — even if the URL somehow ends up indexed.

How to Test Your robots.txt File

Writing a robots.txt file is one thing — confirming it actually does what you intend is another. The syntax is unforgiving: a missing slash, a wrong path, or a misplaced rule can produce results completely different from what you expect.

The Robots.txt Tester fetches the live robots.txt file from any domain and lets you test specific URLs against it — showing you immediately whether Googlebot can or can't access a given page based on your current rules. Run it against:

Your homepage — should always be allowed
Your most important landing pages and blog posts — should always be allowed
Your admin and private directories — should be blocked
Any URL you're unsure about

It's also worth checking competitor domains occasionally — their robots.txt files are publicly accessible and sometimes reveal interesting information about their site structure and what they're deliberately keeping out of search results.

robots.txt vs. noindex: Knowing Which to Use

This distinction matters enough to state clearly:

Use robots.txt Disallow when you don't want pages crawled at all — admin areas, private directories, pages that serve no purpose in search.
Use noindex meta tag when you want pages crawled (so Google can process the noindex directive) but not included in search results — thank-you pages, tag archives, thin content pages.
Never use robots.txt to block a page you've marked noindex. If Googlebot is blocked from crawling the page, it can't read the noindex tag — so the page may still appear in search results based on external links. The two directives work against each other in this combination.

Keeping both your robots.txt and your indexability situation clean is a core part of technical SEO maintenance. For the full picture of how these elements interact with crawling, the guide to what technical SEO covers ties it all together.

SEO Insights — Tips & Strategies

The Beginner's Guide to robots.txt — What It Is and How to Test It

What robots.txt Is and What It Does

The Syntax: How robots.txt Is Written

What to Block — and What Not To

How to Test Your robots.txt File

robots.txt vs. noindex: Knowing Which to Use

Related Articles

SEO Insights — Tips & Strategies

What robots.txt Is and What It Does

The Syntax: How robots.txt Is Written

What to Block — and What Not To

How to Test Your robots.txt File

robots.txt vs. noindex: Knowing Which to Use

Related Articles

How Search Engines Crawl and Index Your Website

What Is an XML Sitemap and Does Your Site Actually Need One?