Robots.txt for AI Search: Stop Blocking Bots That Cite You

Robots.txt for AI Search: Stop Blocking the Bots You Want to Cite You

You can publish the most helpful page in your niche.

But if you lock the door, no one gets in.

And in AI search, if the bots can’t get in, you don’t get cited. Simple as that.

The problem (and why it happens)

Most “we aren’t getting cited” problems are not content problems.

They’re access problems.

Somewhere between your server and the crawler, there’s a NO sign.

The three gates every citation has to pass

Robots gate: robots.txt and meta directives tell crawlers what they can access.
Firewall gate: your CDN/WAF (Cloudflare, AWS WAF, etc.) decides who gets a 200 and who gets a 403.
Index gate: for Google-connected systems, the page still needs to be indexable (noindex/canonical issues can quietly kill you).

Quick takeaway: If you are missing citations, start with the gates, not the writing.

Crawl vs fetch: the detail that trips people up

Not every bot visit is the same thing.

There is crawling (building an index of pages) and there is user-triggered fetching (a user asks an AI to open a page right now).

Your controls and logs can look different depending on which one is happening.

ChatGPT: OAI-SearchBot vs GPTBot vs ChatGPT-User

If you only remember one thing about ChatGPT crawling, remember this:

OAI-SearchBot is the crawler associated with search inclusion (the stuff that can show up as cited sources in ChatGPT Search).

GPTBot is the crawler associated with training data collection (separate control from search).

ChatGPT-User is a user-initiated fetcher (someone asked ChatGPT to open a page).

The practical implication: you can allow search discovery without necessarily opting into training.

Robots.txt recipe: allow ChatGPT search, block training

This is the pattern most brands want:

User-agent: OAI-SearchBot

Allow: /

User-agent: GPTBot

Disallow: /

Yes, you should still test it. And yes, give it time to propagate.

Perplexity: PerplexityBot vs Perplexity-User (and why your WAF is mad)

Perplexity is built around citations, so it crawls and links aggressively.

That is great for visibility, and terrible for fragile firewall rules.

Two names matter:

PerplexityBot: the crawler intended to surface and link websites in results.
Perplexity-User: user-triggered fetching (often behaves differently than standard crawling).

Robots.txt recipe: allow PerplexityBot

User-agent: PerplexityBot

Allow: /

WAF/CDN reality check

If you run Cloudflare, AWS WAF, or anything similar, do not assume robots.txt is enough.

Robots is a polite request. Your firewall is a bouncer.

If Perplexity (or any crawler) gets blocked at the edge, it doesn’t matter what your robots.txt says.

Check your bot rules, rate limits, and managed challenges.
If you allowlist, do it by published IP ranges plus user-agent verification (not user-agent alone).
Watch for false positives: some WAF rules treat bots like “scrapers” by default.

Gemini: it’s Google indexing with a different hat

Gemini lives in Google’s ecosystem. Translation: Google indexability still matters.

Two common mistakes:

Using robots.txt as “security” (it is not).
Blocking a page in robots.txt and then expecting a noindex tag to work (Google needs to crawl the page to see noindex).

What to do instead?

If you want a page excluded from search: use noindex (meta tag or HTTP header).
If you want it included: don’t block crawling, and confirm it’s indexable in Search Console (URL Inspection).
Keep your canonical tags sane. Self-canonical is your friend for single pages.

Troubleshooting: “We allowed the bots, but still no citations”

Here is the short list I run through when a page should be eligible, but isn’t showing up.

Robots caching: are you serving an old robots.txt? Check CDN cache and origin.
Wrong user-agent rules: did you allow the bot you meant to allow?
403/401/429 errors: crawlers get blocked more often than humans notice.
Intermittent bot challenges: JavaScript challenges and CAPTCHAs can silently stop crawlers.
Noindex or canonical misfires: the page you want cited may be pointing at a different URL.
Content gated behind popups: if the main content is not accessible fast, extraction gets harder.
Time: crawlers do not update instantly. Give changes time to be seen and processed.

References

OpenAI Platform Docs – Bots (OAI-SearchBot, GPTBot, ChatGPT-User):
Perplexity Docs – Perplexity Crawlers (PerplexityBot, Perplexity-User, IP ranges):
Google Search Central – Introduction to robots.txt:
Google Search Central – Block indexing with noindex:
Google Search Console Help – URL Inspection tool:

About The Author

Dave Burnett

I help people make more money online.

Over the years I’ve had lots of fun working with thousands of brands and helping them distribute millions of promotional products and implement multinational rewards and incentive programs.

Now I’m helping great marketers turn their products and services into sustainable online businesses.

How can I help you?

See author's posts

Robots.txt for AI Search: Stop Blocking Bots That Cite You

Robots.txt for AI Search: Stop Blocking the Bots You Want to Cite You

The problem (and why it happens)

The three gates every citation has to pass

Crawl vs fetch: the detail that trips people up

ChatGPT: OAI-SearchBot vs GPTBot vs ChatGPT-User

Robots.txt recipe: allow ChatGPT search, block training

Perplexity: PerplexityBot vs Perplexity-User (and why your WAF is mad)

Robots.txt recipe: allow PerplexityBot

WAF/CDN reality check

Gemini: it’s Google indexing with a different hat

What to do instead?

Troubleshooting: “We allowed the bots, but still no citations”

References

About The Author

Dave Burnett

Categories

RECENT POSTS

How AI Search Engines Generate Answers (And Choose Sources)

How To Get Found By AI on the Top 100+ News Sites

Robots.txt for AI Search: Stop Blocking Bots That Cite You

Why Generative Engine Optimization (GEO) Matters Now?

USA

CANADA