SEO Technical: robots.txt

Guides configuration and auditing of robots.txt for search engine and AI crawler control.

When invoking: On first use, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.

Scope (Technical SEO)

Robots.txt: Configure Disallow/Allow, Sitemap, Clean-param; audit for accidental blocks
Crawler access: Path-level crawl control; AI crawler allow/block strategy
Differentiation: robots.txt = crawl control (who accesses what paths); noindex = index control (what gets indexed). See indexing for page-level exclusions.

Initial Assessment

Check for project context first: If .claude/project-context.md or .cursor/project-context.md exists, read it for site URL and indexing goals.

Identify:

Site URL: Base domain (e.g., https://example.com)
Indexing scope: Full site, partial, or specific paths to exclude
AI crawler strategy: Allow search/indexing vs. block training data crawlers

Best Practices

Purpose and Limitations

Point	Note
Purpose	Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)
Advisory	Rules are advisory; malicious crawlers may ignore
Public	robots.txt is publicly readable; use noindex or auth for sensitive content. See indexing

Crawl vs Index vs Link Equity (Quick Reference)

Tool	Controls	Prevents indexing?
robots.txt	Crawl (path-level)	No—blocked URLs may still appear in SERP
noindex (meta / X-Robots-Tag)	Index (page-level)	Yes. See indexing
nofollow	Link equity only	No—does not control indexing

When to Use robots.txt vs noindex

Use	Tool	Example
Path-level (whole directory)	robots.txt	`Disallow: /admin/`, `Disallow: /api/`, `Disallow: /staging/`
Page-level (specific pages)	noindex meta / X-Robots-Tag	Login, signup, thank-you, 404, legal. See indexing for full list
Critical	Do NOT block in robots.txt	Pages that use noindex—crawlers must access the page to read the directive

Paths to block in robots.txt: /admin/, /api/, /staging/, temp files. Paths to use noindex (allow crawl): /login/, /signup/, /thank-you/, etc.—see indexing.

Location and Format

Item	Requirement
Path	Site root: `https://example.com/robots.txt`
Encoding	UTF-8 plain text
Standard	RFC 9309 (Robots Exclusion Protocol)

Core Directives

Directive	Purpose	Example
`User-agent:`	Target crawler	`User-agent: Googlebot`, `User-agent: *`
`Disallow:`	Block path prefix	`Disallow: /admin/`
`Allow:`	Allow path (can override Disallow)	`Allow: /public/`
`Sitemap:`	Declare sitemap absolute URL	`Sitemap: https://example.com/sitemap.xml`
`Clean-param:`	Strip query params (Yandex)	See below

Path Format Rules (Critical)

Different directives use different path formats — a common source of errors:

Directive	Format	Correct	Wrong
Disallow / Allow	Root-relative path only (starts with `/`)	`Disallow: /admin/`	`Disallow: https://example.com/admin/`
Sitemap	Absolute URL only	`Sitemap: https://example.com/sitemap.xml`	`Sitemap: /sitemap.xml`

Wildcards: * matches any character sequence (Disallow: /tmp/*). $ marks exact URL ending (Allow: /news/.html$).

Priority: More specific paths take precedence. Allow: /shop/shoes/ overrides Disallow: /shop/. Path matching is case-sensitive: Disallow: /PDF/ does not match /pdf/.

Critical: Do Not Block

Do not block	Reason
CSS, JS, images	Google needs them to render pages; blocking breaks indexing
`/_next/` (Next.js)	Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See indexing
Pages that use noindex	Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that

Only block: paths that don't need crawling: /admin/, /api/, /staging/, temp files.

AI Crawler Strategy

robots.txt is effective for all measured AI crawlers. Set rules per user-agent; check each vendor's docs for current tokens.

User-agent	Purpose	Typical	Notes
OAI-SearchBot	ChatGPT search	Allow	Respects robots.txt
GPTBot	OpenAI training	Disallow	Respects robots.txt; shares crawl data with OAI-SearchBot if both allowed
ChatGPT-User	User-initiated browsing	N/A	No longer respects robots.txt (Dec 2025); use server-side controls instead
Claude-SearchBot	Claude search	Allow	Respects robots.txt
Claude-User	Anthropic user-initiated browsing	Allow	Respects robots.txt (unlike ChatGPT-User)
ClaudeBot	Anthropic training	Disallow	Respects robots.txt

Deprecated: ~~anthropic-ai~~ — retired by Anthropic, replaced by ClaudeBot / Claude-User / Claude-SearchBot. References to anthropic-ai in robots.txt have no effect. | PerplexityBot | Perplexity search | Allow | Respects robots.txt | | Google-Extended | Gemini training | Disallow | Respects robots.txt | | CCBot | Common Crawl (LLM training) | Disallow | Respects robots.txt | | Bytespider | ByteDance | Disallow | Respects robots.txt | | Meta-ExternalAgent | Meta | Disallow | Respects robots.txt | | AppleBot | Apple (Siri, Spotlight); renders JS | Allow for indexing | Respects robots.txt |

Allow vs Disallow: Allow search/indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot); Disallow training-only bots (GPTBot, ClaudeBot, CCBot) if you don't want content used for model training.

Important — ChatGPT-User exemption: As of December 2025, ChatGPT-User no longer respects robots.txt directives. OpenAI considers it a proxy for human-initiated browsing. If you need to block it, use server-side controls (WAF rules, IP rate-limiting), not robots.txt. See site-crawlability for AI crawler optimization (SSR, URL management).

Clean-param (Yandex)

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid

Output Format

Current state (if auditing)
Recommended robots.txt (full file)
Compliance checklist
References: Google robots.txt

Related Skills

indexing: Full noindex page-type list; when to use noindex vs robots.txt; GSC indexing diagnosis
page-metadata: Meta robots (noindex, nofollow) implementation
xml-sitemap: Sitemap URL to reference in robots.txt
site-crawlability: Broader crawl and structure guidance; AI crawler optimization
rendering-strategies: SSR, SSG, CSR; content in initial HTML for crawlers

skills/seo/technical/robots