Signal map (what AI uses)
Machines don’t “get the vibe.” They get a pile of signals: URLs, headers, structured data, content blocks, and whatever they can reliably fetch from the public web.
If the signals agree, you get clean attribution. If the signals fight, the machine guesses. That’s when your service turns into your competitor’s service, your location becomes “nearby,” and your brand gets described like a stranger wearing your name tag.
The modern reality: AI answers are stitched from web signals
Most AI experiences that cite sources are not inventing facts out of thin air. They’re pulling from one or more of these pipelines:
- Search index retrieval (pages that are crawled, indexed, and eligible to show).
- On-demand crawling (a bot fetches a page because a user asked for it).
- Knowledge/entity understanding (who a brand is, what it offers, where it operates).
- Content usage controls (which bots are allowed to crawl, index, summarize, or train).
Different products use different mixes. But the inputs are boringly consistent:
- crawl,
- index,
- canonical,
- entity identity, and
- performance
That’s your signal stack.
Signal stack overview (the layers that make machines confident)
| Layer | What it tells the machine | Common breakpoints | What to fix first |
|---|---|---|---|
| Crawl access | Can a bot fetch the URL and its resources? | robots.txt blocks, auth walls, 4xx/5xx | Allow crawling for important URLs, fix server errors |
| Index control | Should the URL be stored and eligible to appear? | noindex mistakes, X-Robots-Tag misuse | Remove accidental noindex, check headers |
| Canonical truth | Which version is the “real” one? | conflicting canonicals, parameter duplicates | Standardize canonical URLs and redirects |
| Interpretation | What is this page about (and what type is it)? | thin headings, ambiguous copy, missing schema | Clarify intent, add structured data tied to visible content |
| Entity identity | Who provides this (organization/location/service)? | inconsistent NAP, missing sameAs, no stable IDs | Create stable entity IDs and connect pages to the org |
| Rendering & UX | Can the page be rendered on mobile reliably? | slow LCP, shifting layout, mobile content gaps | Fix Core Web Vitals and mobile parity |
The signal map (what AI engines actually look at)
Here’s the practical map. If you’re a technical product owner, this is the stuff you can hand to engineering, measure, and ship.
1) Fetch signals (HTTP + robots)
Machines start with one question: “Can I get the page?”
- HTTP status (200 vs 3xx vs 4xx/5xx).
- robots.txt allow/disallow rules.
- Crawl budget efficiency (duplicate URL explosions, infinite calendars, faceted search).
- Resource availability (JS/CSS/images needed to render primary content).
One of the sneakiest failures is blocking crawling and expecting indexing rules to work anyway. Index directives like robots meta tags and X-Robots-Tag are discovered when a URL is crawled. If you disallow crawling in robots.txt, the crawler can’t see the directive and may ignore it.
2) Indexing signals (what should be stored)
Indexing is where “visible to machines” turns into “retrievable by machines.”
- Meta robots and X-Robots-Tag (index/noindex, follow/nofollow).
- Sitemaps (what you consider important).
- Internal linking (what you reinforce as important).
- Duplicate handling (near-identical pages can collapse into one canonical cluster).
If your service pages aren’t indexable, they won’t reliably show up as retrievable sources. That’s not an AI problem. That’s an indexing problem.
3) Canonical & duplication signals (the ‘one true URL’ rule)
AI systems are allergic to duplicates because duplicates create conflicting truth. Canonicalization tells machines which URL is the official version of a page when multiple versions exist.
- rel=canonical tags.
- Redirects (when you’re deprecating a URL).
- Sitemap canonical URLs.
- Consistent internal links pointing to the canonical.
When these signals conflict, the machine has to decide. Google documents multiple canonicalization methods and recommends avoiding conflicting signals across them.
4) Structured data signals (schema that reduces ambiguity)
Structured data is not a ranking trick. It’s a labeling system.
Google explicitly explains that it uses structured data it finds to understand a page’s content and gather information about the web and the world (entities like organizations). When your entity data is clean, machines stop guessing.
- Organization schema: who you are (and which logo/social profiles belong to you).
- LocalBusiness schema: where you operate (hours, address, departments).
- Service schema: what you provide (and who provides it).
- FAQPage schema: explicit Q&A structure (even when rich results are limited).
Key rule: structured data must reflect what’s visible to users on the page. Marking up invisible content is a fast way to create trust problems (and in Google’s world, policy problems).
5) Entity signals (attribution and disambiguation)
This is where a lot of “AI misattribution” originates.
If you have multiple locations, multiple brands, or multiple services with overlapping language, the machine needs stable anchors to connect your pages to your brand entity.
- Consistent organization name, URL, logo, and contact info across the site.
- sameAs links to authoritative profiles (LinkedIn, YouTube, industry directories).
- Stable entity IDs (@id) reused across schema blocks to form a connected graph.
- Consistency between on-page copy and markup (same phone, same address, same brand).
6) Bot access signals (what can crawl what)
If you want to show up in AI search experiences, you need to know which crawlers matter and whether you’re blocking them.
- OpenAI documents OAI-SearchBot and GPTBot, and how site owners can control access via robots.txt.
- Google documents Google-Extended as a token to manage use of crawled content for Gemini model training and grounding use cases.
This doesn’t replace SEO. It sits beside it.
If you block the bots you want, you’re shouting into a pillow and wondering why the room is quiet.
Quick prioritization: what to fix in what order
When everything is on fire, prioritize like this:
- Crawl access and server health (bots must fetch pages reliably).
- Index controls (remove accidental noindex; confirm you’re not disallowed from crawling).
- Canonical consistency (pick one truth per page).
- Entity anchors (Organization + LocalBusiness where relevant, stable @id, sameAs).
- Performance and mobile parity (mobile-first indexing means mobile is the source of truth).
Common failure modes (AKA: why AI gets you wrong)
- You have 5 URLs for the same service page and 3 different canonicals depending on the template.
- Your structured data exists on desktop but is missing on mobile templates.
- Your locations are listed in copy, but your schema has incomplete or inconsistent addresses.
- robots.txt blocks crawling, so bots can’t see your noindex/canonical directives.
- Your logo URL changes by environment (staging/production), so the entity is unstable.
- Performance is slow enough that important content loads late, especially on mobile.
Sources
[1] OpenAI. Overview of OpenAI Crawlers. Accessed January 11, 2026.
[2] OpenAI Help Center. Publishers and Developers FAQ. Accessed January 11, 2026.
[3] Google. Google’s common crawlers (includes Google-Extended). Accessed January 11, 2026.
[4] Google Search Central. Introduction to structured data markup in Google Search. Accessed January 11, 2026.
[5] Google Search Central. Robots Meta Tags Specifications. Accessed January 11, 2026.
[6] Google Search Central. How to specify a canonical URL (consolidate duplicate URLs). Accessed January 11, 2026.
[7] Google Search Central. Build and submit a sitemap. Accessed January 11, 2026.
[8] Google Search Central. Mobile-first indexing best practices. Accessed January 11, 2026.
[9] Google Search Central. Understanding Core Web Vitals and Google search results. Accessed January 11, 2026.
About The Author
Dave Burnett
I help people make more money online.
Over the years I’ve had lots of fun working with thousands of brands and helping them distribute millions of promotional products and implement multinational rewards and incentive programs.
Now I’m helping great marketers turn their products and services into sustainable online businesses.
How can I help you?





