โ€ข 7 min read

robots.txt: The Complete Guide (with Examples)

robots.txt crawling technical seo

A misconfigured robots.txt can take an entire site out of Google's index in one line. A well-tuned one saves crawl budget, hides noisy URLs, and points crawlers at your sitemap. This guide covers the format, the common rules you'll actually use, and the mistakes we see weekly in audits.

What robots.txt is (and isn't)

robots.txt is a plain text file at /robots.txt on your domain. It tells crawlers which URLs they may or may not request. It is not a security mechanism, anything you disallow is still publicly visible, just not crawled. And it does not remove pages from the index: a disallowed page that already has links pointing at it can still appear in search results (without a snippet). To remove pages, use noindex or remove them entirely.

The format in 60 seconds

User-agent: *
Disallow: /admin/
Disallow: /cart
Allow: /admin/help.html

Sitemap: https://example.com/sitemap.xml
  • User-agent: which crawler the block applies to. * means everyone. Use a specific name (e.g. Googlebot, Bingbot, GPTBot) to target one crawler.
  • Disallow: paths the crawler should not request. Path-prefix match, case-sensitive. Disallow: / blocks everything; Disallow: (empty) allows everything.
  • Allow: overrides a Disallow for a more specific path. Useful for poking holes in broad blocks.
  • Sitemap: absolute URL of your sitemap. You can list multiple Sitemap: lines.

Recipes you'll actually use

Standard ecommerce

User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /*?sort=
Disallow: /*?filter=
Allow: /*?utm_

Sitemap: https://example.com/sitemap.xml

Block staging from all crawlers

User-agent: *
Disallow: /

Better still, return a X-Robots-Tag: noindex header from the staging server and password-protect it. Robots.txt is advisory, bad bots ignore it. (Also: never let this file ship to production. We've seen it.)

Allow Googlebot, block AI training crawlers

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow:

The mistakes we see most

  1. Blocking CSS or JS. Google needs to render your page to judge mobile-friendliness and Core Web Vitals. Disallow: /assets/ is almost always wrong.
  2. Disallow + noindex on the same URL. If a page is disallowed, Google can't crawl it to see the noindex. The page can still appear in search results. To remove a page, use noindex alone (and let it be crawled).
  3. Trailing slash confusion. Disallow: /admin blocks /admin, /admin/, /admin-tools, and /administrator. If you only want the directory, write Disallow: /admin/.
  4. Wildcards used wrong. Only * (any sequence) and $ (end of URL) are supported. Disallow: /*.pdf$ blocks all PDFs.
  5. Forgetting the sitemap directive. It's the cheapest way to advertise your sitemap to every crawler. See our XML sitemap guide.
  6. Case sensitivity. Disallow: /Admin/ does NOT block /admin/. URLs in directives are case-sensitive.

Testing before you ship

  • Search Console robots.txt Tester: paste your file, test specific URLs against specific user-agents.
  • Curl: curl -I https://example.com/robots.txt must return 200 OK and Content-Type: text/plain.
  • Crawl your own site with a tool like Screaming Frog after each change to verify the right pages are still reachable.

One last thing: budget

For sites under ~10k URLs, crawl budget is rarely the bottleneck, content quality and links are. For larger sites, every URL Googlebot wastes on faceted navigation or session IDs is one fewer it spends on real content. Your robots.txt is the cheapest lever you have to fix that.

Run an AuditAI scan โ†’ to see how your site looks to a crawler right now, including any robots.txt issues we detect.

Ready to audit your site?

Run an AI-powered SEO audit in under 30 seconds. Free, no signup required.

Run a free audit โ†’