Robots.txt for AI Search: Stop Blocking the Bots You Want to Cite You
You can publish the most helpful page in your niche.
But if you lock the door, no one gets in.
And in AI search, if the bots can’t get in, you don’t get cited. Simple as that.
The problem (and why it happens)
Most “we aren’t getting cited” problems are not content problems.
They’re access problems.
Somewhere between your server and the crawler, there’s a NO sign.
The three gates every citation has to pass
- Robots gate: robots.txt and meta directives tell crawlers what they can access.
- Firewall gate: your CDN/WAF (Cloudflare, AWS WAF, etc.) decides who gets a 200 and who gets a 403.
- Index gate: for Google-connected systems, the page still needs to be indexable (noindex/canonical issues can quietly kill you).
Quick takeaway: If you are missing citations, start with the gates, not the writing.
See Also: LLM Citation Optimization Checklist (ChatGPT, Perplexity, Gemini)
Crawl vs fetch: the detail that trips people up
Not every bot visit is the same thing.
There is crawling (building an index of pages) and there is user-triggered fetching (a user asks an AI to open a page right now).
Your controls and logs can look different depending on which one is happening.
ChatGPT: OAI-SearchBot vs GPTBot vs ChatGPT-User
If you only remember one thing about ChatGPT crawling, remember this:
OAI-SearchBot is the crawler associated with search inclusion (the stuff that can show up as cited sources in ChatGPT Search).
GPTBot is the crawler associated with training data collection (separate control from search).
ChatGPT-User is a user-initiated fetcher (someone asked ChatGPT to open a page).
The practical implication: you can allow search discovery without necessarily opting into training.
Robots.txt recipe: allow ChatGPT search, block training
This is the pattern most brands want:
User-agent: OAI-SearchBot
Allow: /
User-agent: GPTBot
Disallow: /
Yes, you should still test it. And yes, give it time to propagate.
See Also: Authority and Entity Footprint: Become the Safe Citation
Perplexity: PerplexityBot vs Perplexity-User (and why your WAF is mad)
Perplexity is built around citations, so it crawls and links aggressively.
That is great for visibility, and terrible for fragile firewall rules.
Two names matter:
- PerplexityBot: the crawler intended to surface and link websites in results.
- Perplexity-User: user-triggered fetching (often behaves differently than standard crawling).
Robots.txt recipe: allow PerplexityBot
User-agent: PerplexityBot
Allow: /
WAF/CDN reality check
If you run Cloudflare, AWS WAF, or anything similar, do not assume robots.txt is enough.
Robots is a polite request. Your firewall is a bouncer.
If Perplexity (or any crawler) gets blocked at the edge, it doesn’t matter what your robots.txt says.
- Check your bot rules, rate limits, and managed challenges.
- If you allowlist, do it by published IP ranges plus user-agent verification (not user-agent alone).
- Watch for false positives: some WAF rules treat bots like “scrapers” by default.
Gemini: it’s Google indexing with a different hat
Gemini lives in Google’s ecosystem. Translation: Google indexability still matters.
Two common mistakes:
- Using robots.txt as “security” (it is not).
- Blocking a page in robots.txt and then expecting a noindex tag to work (Google needs to crawl the page to see noindex).
What to do instead?
- If you want a page excluded from search: use noindex (meta tag or HTTP header).
- If you want it included: don’t block crawling, and confirm it’s indexable in Search Console (URL Inspection).
- Keep your canonical tags sane. Self-canonical is your friend for single pages.
Troubleshooting: “We allowed the bots, but still no citations”
Here is the short list I run through when a page should be eligible, but isn’t showing up.
- Robots caching: are you serving an old robots.txt? Check CDN cache and origin.
- Wrong user-agent rules: did you allow the bot you meant to allow?
- 403/401/429 errors: crawlers get blocked more often than humans notice.
- Intermittent bot challenges: JavaScript challenges and CAPTCHAs can silently stop crawlers.
- Noindex or canonical misfires: the page you want cited may be pointing at a different URL.
- Content gated behind popups: if the main content is not accessible fast, extraction gets harder.
- Time: crawlers do not update instantly. Give changes time to be seen and processed.
See Also: Answer capsules (the 25-word pattern that gets lifted into AI answers).
References
- OpenAI Platform Docs – Bots (OAI-SearchBot, GPTBot, ChatGPT-User):
- Perplexity Docs – Perplexity Crawlers (PerplexityBot, Perplexity-User, IP ranges):
- Google Search Central – Introduction to robots.txt:
- Google Search Central – Block indexing with noindex:
- Google Search Console Help – URL Inspection tool:
About The Author
Dave Burnett
I help people make more money online.
Over the years I’ve had lots of fun working with thousands of brands and helping them distribute millions of promotional products and implement multinational rewards and incentive programs.
Now I’m helping great marketers turn their products and services into sustainable online businesses.
How can I help you?





