Back to blog
GEOTechnical SEOAI Crawlersrobots.txtAI Search

robots.txt for AI Crawlers: The Complete Allow List for 2026

Luis D. González8 min readUpdated

TL;DR

If AI crawlers cannot read your site, AI engines cannot cite it. Allow four categories in robots.txt: training crawlers (GPTBot, ClaudeBot, CCBot), search crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot), user-triggered fetchers (ChatGPT-User, Perplexity-User, Google-Agent), and opt-out tokens (Google-Extended, Applebot-Extended). Then verify at the CDN layer — Cloudflare bot protection can silently block AI bots even when robots.txt allows them — and confirm real visits in your server logs.

If AI crawlers cannot read your site, AI engines cannot cite it — robots.txt is the gate, and most businesses have never checked theirs against the 2026 list of AI user-agents. A single security plugin update or CDN setting can silently cut you out of ChatGPT, Perplexity, and Claude citations entirely.

This guide gives you the complete 2026 allow list by category, the exact robots.txt block to copy, and the two places where AI bots get blocked even when robots.txt says allow.

What are the four categories of AI user-agents in 2026?

Training crawlers, search crawlers, user-triggered fetchers, and opt-out tokens — each with a different job and a different cost if you block it.

Training crawlers
Agents
GPTBot, ClaudeBot, anthropic-ai, CCBot
What blocking costs you
Long-term: models never learn your brand exists
Search crawlers
Agents
OAI-SearchBot, Claude-SearchBot, PerplexityBot
What blocking costs you
Immediate: zero live citations on that engine
User-triggered fetchers
Agents
ChatGPT-User, Perplexity-User, Google-Agent
What blocking costs you
Users asking AI about YOUR site get an error
Opt-out tokens
Agents
Google-Extended, Applebot-Extended
What blocking costs you
Directive-only: controls training use, never crawls

Verdict: allow all four categories unless you have a specific licensing strategy — the citation upside outweighs content-reuse concerns for most businesses.

What should my robots.txt actually say?

Copy this block — it explicitly allows every major AI agent of 2026:

User-agent: GPTBot Allow: /

User-agent: ClaudeBot Allow: /

User-agent: anthropic-ai Allow: /

User-agent: CCBot Allow: /

User-agent: OAI-SearchBot Allow: /

User-agent: Claude-SearchBot Allow: /

User-agent: PerplexityBot Allow: /

User-agent: ChatGPT-User Allow: /

User-agent: Perplexity-User Allow: /

User-agent: Google-Agent Allow: /

User-agent: Google-Extended Allow: /

User-agent: Applebot-Extended Allow: /

One warning: a wildcard block (User-agent: * with Disallow: /) overrides nothing here — specific user-agent rules take precedence — but check you do not have one blocking everything by accident.

Why do AI bots get blocked even when robots.txt allows them?

Two silent killers: the CDN layer and JavaScript-only rendering.

The CDN layer. Cloudflare Bot Fight Mode, Super Bot Fight Mode, Sucuri, and similar WAF features classify AI crawlers as hostile bots and block them at the network layer — before robots.txt is ever read. If you use Cloudflare, check Security → Bots and add exceptions for verified AI crawlers.

JavaScript-only content. Some AI crawlers do not execute JavaScript. If your page is an empty div until JS runs, the crawler sees nothing — allowed or not. Test with a curl request: if your main content is not in the raw HTML response, you have a rendering problem, not a permissions problem.

How do I verify AI bots are actually visiting?

Check your server logs monthly for each agent name — allowing bots in robots.txt does not prove they are visiting. In cPanel raw access logs or your hosting analytics, search for GPTBot, ClaudeBot, OAI-SearchBot, and PerplexityBot. A healthy pattern shows GPTBot and ClaudeBot visiting weekly and PerplexityBot multiple times per day.

If a major bot stops appearing for two weeks or more, investigate immediately: the usual suspects are a CMS security plugin update, a CDN setting change, or an accidental robots.txt edit.

What about ChatGPT Atlas and agent-mode browsers?

You cannot control them with robots.txt — and you should not try to block them by user-agent. Agentic browsers like ChatGPT Atlas (~5M monthly users) and Perplexity Comet (~3M) drive real Chrome sessions on behalf of users, sending standard Chrome signatures. Blocking by user-agent pattern would block real human visitors. These are potential customers delegating a task to an AI — your job is to make the site work for them, not to keep them out.


Allow the crawlers, verify at the CDN, confirm in the logs. Fifteen minutes of robots.txt work plus a monthly log check is the cheapest insurance in GEO: every other technique depends on the engines being able to read your site at all.

Frequently asked questions

Which AI crawlers should I allow in my robots.txt file?

Allow GPTBot, ClaudeBot, anthropic-ai, CCBot (training); OAI-SearchBot, Claude-SearchBot, PerplexityBot (live search); ChatGPT-User, Perplexity-User, Google-Agent (user-triggered fetches); and Google-Extended plus Applebot-Extended (training opt-out tokens). Blocking any search crawler makes citation impossible on that engine.

Does blocking AI crawlers protect my content from being used by AI?

Partially — it stops declared crawlers, but it also eliminates your AI citation potential entirely. And agent-mode browsers (ChatGPT Atlas, Perplexity Comet) use regular Chrome signatures that robots.txt cannot control. For most businesses, the citation upside outweighs the content-reuse concern.

Why are AI bots not visiting my site even though robots.txt allows them?

The most common cause is CDN-level blocking: Cloudflare Bot Fight Mode, Sucuri, and some WordPress security plugins block AI bots at the network layer before robots.txt is even read. Check your CDN bot settings and verify actual visits in server logs.

What is the difference between GPTBot and OAI-SearchBot?

GPTBot collects content for training OpenAI models — long-term brand knowledge. OAI-SearchBot builds the live search index that ChatGPT quotes when answering with web results — near-term citations. Most businesses should allow both; blocking OAI-SearchBot kills live citations.

Does the llms.txt file replace robots.txt for AI crawlers?

No. llms.txt is a community convention that major AI citation engines still do not fetch for retrieval — adoption sits around 10% of domains and studies show no measurable citation impact. robots.txt remains the operative file for crawler permissions.

Ready to build your website?

Use the same technology and process that built this site. Your website live in hours.

Get started now