TL;DR
If AI crawlers cannot read your site, AI engines cannot cite it. Allow four categories in robots.txt: training crawlers (GPTBot, ClaudeBot, CCBot), search crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot), user-triggered fetchers (ChatGPT-User, Perplexity-User, Google-Agent), and opt-out tokens (Google-Extended, Applebot-Extended). Then verify at the CDN layer — Cloudflare bot protection can silently block AI bots even when robots.txt allows them — and confirm real visits in your server logs.
If AI crawlers cannot read your site, AI engines cannot cite it — robots.txt is the gate, and most businesses have never checked theirs against the 2026 list of AI user-agents. A single security plugin update or CDN setting can silently cut you out of ChatGPT, Perplexity, and Claude citations entirely.
This guide gives you the complete 2026 allow list by category, the exact robots.txt block to copy, and the two places where AI bots get blocked even when robots.txt says allow.
What are the four categories of AI user-agents in 2026?
Training crawlers, search crawlers, user-triggered fetchers, and opt-out tokens — each with a different job and a different cost if you block it.
- Agents
- GPTBot, ClaudeBot, anthropic-ai, CCBot
- What blocking costs you
- Long-term: models never learn your brand exists
- Agents
- OAI-SearchBot, Claude-SearchBot, PerplexityBot
- What blocking costs you
- Immediate: zero live citations on that engine
- Agents
- ChatGPT-User, Perplexity-User, Google-Agent
- What blocking costs you
- Users asking AI about YOUR site get an error
- Agents
- Google-Extended, Applebot-Extended
- What blocking costs you
- Directive-only: controls training use, never crawls
| Category | Agents | What blocking costs you |
|---|---|---|
| Training crawlers | GPTBot, ClaudeBot, anthropic-ai, CCBot | Long-term: models never learn your brand exists |
| Search crawlers | OAI-SearchBot, Claude-SearchBot, PerplexityBot | Immediate: zero live citations on that engine |
| User-triggered fetchers | ChatGPT-User, Perplexity-User, Google-Agent | Users asking AI about YOUR site get an error |
| Opt-out tokens | Google-Extended, Applebot-Extended | Directive-only: controls training use, never crawls |
Verdict: allow all four categories unless you have a specific licensing strategy — the citation upside outweighs content-reuse concerns for most businesses.
What should my robots.txt actually say?
Copy this block — it explicitly allows every major AI agent of 2026:
User-agent: GPTBot Allow: /
User-agent: ClaudeBot Allow: /
User-agent: anthropic-ai Allow: /
User-agent: CCBot Allow: /
User-agent: OAI-SearchBot Allow: /
User-agent: Claude-SearchBot Allow: /
User-agent: PerplexityBot Allow: /
User-agent: ChatGPT-User Allow: /
User-agent: Perplexity-User Allow: /
User-agent: Google-Agent Allow: /
User-agent: Google-Extended Allow: /
User-agent: Applebot-Extended Allow: /
One warning: a wildcard block (User-agent: * with Disallow: /) overrides nothing here — specific user-agent rules take precedence — but check you do not have one blocking everything by accident.
Why do AI bots get blocked even when robots.txt allows them?
Two silent killers: the CDN layer and JavaScript-only rendering.
The CDN layer. Cloudflare Bot Fight Mode, Super Bot Fight Mode, Sucuri, and similar WAF features classify AI crawlers as hostile bots and block them at the network layer — before robots.txt is ever read. If you use Cloudflare, check Security → Bots and add exceptions for verified AI crawlers.
JavaScript-only content. Some AI crawlers do not execute JavaScript. If your page is an empty div until JS runs, the crawler sees nothing — allowed or not. Test with a curl request: if your main content is not in the raw HTML response, you have a rendering problem, not a permissions problem.
How do I verify AI bots are actually visiting?
Check your server logs monthly for each agent name — allowing bots in robots.txt does not prove they are visiting. In cPanel raw access logs or your hosting analytics, search for GPTBot, ClaudeBot, OAI-SearchBot, and PerplexityBot. A healthy pattern shows GPTBot and ClaudeBot visiting weekly and PerplexityBot multiple times per day.
If a major bot stops appearing for two weeks or more, investigate immediately: the usual suspects are a CMS security plugin update, a CDN setting change, or an accidental robots.txt edit.
What about ChatGPT Atlas and agent-mode browsers?
You cannot control them with robots.txt — and you should not try to block them by user-agent. Agentic browsers like ChatGPT Atlas (~5M monthly users) and Perplexity Comet (~3M) drive real Chrome sessions on behalf of users, sending standard Chrome signatures. Blocking by user-agent pattern would block real human visitors. These are potential customers delegating a task to an AI — your job is to make the site work for them, not to keep them out.
Allow the crawlers, verify at the CDN, confirm in the logs. Fifteen minutes of robots.txt work plus a monthly log check is the cheapest insurance in GEO: every other technique depends on the engines being able to read your site at all.
Frequently asked questions
Which AI crawlers should I allow in my robots.txt file?
Allow GPTBot, ClaudeBot, anthropic-ai, CCBot (training); OAI-SearchBot, Claude-SearchBot, PerplexityBot (live search); ChatGPT-User, Perplexity-User, Google-Agent (user-triggered fetches); and Google-Extended plus Applebot-Extended (training opt-out tokens). Blocking any search crawler makes citation impossible on that engine.
Does blocking AI crawlers protect my content from being used by AI?
Partially — it stops declared crawlers, but it also eliminates your AI citation potential entirely. And agent-mode browsers (ChatGPT Atlas, Perplexity Comet) use regular Chrome signatures that robots.txt cannot control. For most businesses, the citation upside outweighs the content-reuse concern.
Why are AI bots not visiting my site even though robots.txt allows them?
The most common cause is CDN-level blocking: Cloudflare Bot Fight Mode, Sucuri, and some WordPress security plugins block AI bots at the network layer before robots.txt is even read. Check your CDN bot settings and verify actual visits in server logs.
What is the difference between GPTBot and OAI-SearchBot?
GPTBot collects content for training OpenAI models — long-term brand knowledge. OAI-SearchBot builds the live search index that ChatGPT quotes when answering with web results — near-term citations. Most businesses should allow both; blocking OAI-SearchBot kills live citations.
Does the llms.txt file replace robots.txt for AI crawlers?
No. llms.txt is a community convention that major AI citation engines still do not fetch for retrieval — adoption sits around 10% of domains and studies show no measurable citation impact. robots.txt remains the operative file for crawler permissions.
Related articles
Generative Engine Optimization (GEO): The Complete Guide for 2026
GEO is the practice of getting your brand cited in AI answers from ChatGPT, Perplexity, and Gemini. The techniques with real evidence behind them, how to measure results, and how long it takes — from the team that runs them.
How to Get Your Content Cited by ChatGPT: 7 Techniques That Work in 2026
ChatGPT cites sources that lead with the answer, back claims with named statistics, and structure every section as a standalone passage. The 7 evidence-backed techniques, with the exact targets we use.
AI Implementation for Small Businesses in Southern California: The 2026 Playbook
A practical 2026 framework for SoCal small businesses (10–50 employees) to deploy AI in 30 days for under $5,000 — with cost benchmarks, vendor comparisons, and the highest-ROI workflows for Orange County and LA County.