How AI Engines Choose What to Cite: The Retrieval Pipeline Explained (2026)

TL;DR

AI engines pick citations through a four-stage pipeline: (1) crawl — if bots cannot read your site, nothing else matters; (2) chunk — pages get split into passages, so self-contained 200-400 word sections control your own chunk boundaries; (3) retrieve — passages compete on relevance to the query, where question-formatted headings and answer-first writing win; (4) select — the model quotes passages with verifiable specifics from entities it recognizes. Each engine weighs stages differently, which is why only ~11% of sites are cited by both ChatGPT and Perplexity, and "Lost in the Middle" research explains why the top of your page carries 44.2% of citations.

Every AI citation passes through the same four-stage pipeline: crawl, chunk, retrieve, select. Understand what each stage rewards and every GEO technique stops being a trick and becomes obvious engineering. This is the mental model behind everything else in our GEO guide — and the reason seemingly identical websites get wildly different citation results.

Stage 1: Crawl — can the engine read you at all?

Binary gate: if AI crawlers cannot fetch your pages, you do not exist in the pipeline. Three failure points, in order of frequency: robots.txt blocks (sometimes inherited accidentally from a security plugin), CDN-level bot blocking (Cloudflare bot protection silently rejecting GPTBot or PerplexityBot before robots.txt is even read), and JavaScript-only rendering (some AI crawlers never execute JS — if your content is not in the raw HTML, it was never crawled).

What this stage rewards: explicit allow rules for all four AI user-agent categories, CDN exceptions for verified AI bots, server-side rendering, and a monthly server-log check confirming bots actually visit.

Stage 2: Chunk — how does your page get split into passages?

Engines do not index your page as one unit — they split it into chunks, and each chunk lives or dies alone. Retrieval pipelines typically segment by structure: headings, paragraphs, length thresholds. A 2,000-word page becomes perhaps 6-10 passages.

This is the stage most websites lose without knowing it. A section that starts with "As we mentioned above..." produces a chunk that makes no sense in isolation — unretrievable regardless of quality. A 900-word wall of text gets split at arbitrary points, orphaning key claims from their evidence.

What this stage rewards: controlling your own chunk boundaries. Sections of 200-400 words under question-formatted headings, each fully self-contained, each opening with its conclusion. You are pre-chunking your content so the pipeline cannot mangle it.

Stage 3: Retrieve — does your passage match the query?

When a user asks, the engine searches its chunk index for the passages most relevant to that question — phrasing match matters enormously. A heading that reads "How much does a bilingual website cost in California?" is a near-exact match for the question a real user asks. A heading that reads "Pricing" is not.

"Lost in the Middle" research (Liu et al., 2023) adds a positional dimension: models attend to the start and end of context far better than the middle. The citation data agrees — 44.2% of ChatGPT citations come from the first 30% of a page's text (Zyppy, 2025).

What this stage rewards: headings phrased as natural user questions, answer-first sentences under each heading, the thesis and key statistic in the first 200 words, and a restated takeaway at the end.

Stage 4: Select — why does the model quote one passage over another?

Among retrieved candidates, the model favors passages with verifiable specifics from entities it recognizes. Princeton's GEO-Bench quantified the selection levers: cited statistics lift visibility 37-41%, expert quotations up to 40%, source citations +31.4% combined with other techniques — while keyword stuffing performs worse than doing nothing (Aggarwal et al., 2024).

Entity recognition is the quiet multiplier here. A passage from a source the engine can verify — Organization schema chained to Wikidata, real authors with LinkedIn-corroborated profiles — beats an equally good passage from an anonymous domain. And brand mentions across the web (~3× more correlated with AI visibility than backlinks, per Ahrefs) tell the engine which names are authorities before it ever reads your page.

Why do different engines cite different sites?

Because each engine weighs the four stages differently — only ~11% of websites are cited by both ChatGPT and Perplexity. ChatGPT leans on Bing-fed indexes and entity recognition. Perplexity crawls aggressively, returns the most citations per answer (~7+ on average), and favors long-form structured content with comparison tables. Gemini draws on Google's index and Knowledge Graph. Claude searches conservatively and cites sparingly, favoring high-authority editorial sources.

The practical consequence: optimize the pipeline stages — which are common to all engines — and then measure per engine, because the selection weights are not.

Crawlable, chunkable, retrievable, citable — in that order, because each stage gates the next. Audit your site against the four stages and the gaps become a prioritized to-do list: fix crawl access before structure, structure before phrasing, phrasing before authority. That sequence is the entire discipline of GEO in one sentence.

Frequently asked questions

What is RAG and why does it matter for getting cited by AI?

RAG (Retrieval-Augmented Generation) is the pipeline AI engines use to answer with web content: they retrieve relevant passages from an index, then generate an answer citing the best ones. Every GEO technique maps to a RAG stage — crawl access, chunk-friendly structure, retrievable phrasing, or citable specifics.

Why does the same question give different citations each time I ask an AI?

Because generation is probabilistic and retrieval often returns more candidates than the answer can cite. Small variations in sampling change which retrieved passages make the final cut. This is why measurement requires running each prompt 3 times — and why no one can guarantee a specific citation.

Why do ChatGPT and Perplexity cite such different websites?

Different indexes and different selection weights. ChatGPT draws partly on Bing\u2019s index and favors recognized entities; Perplexity crawls aggressively, retrieves more citations per answer (~7+ on average), and rewards long-form structured content. Only ~11% of sites get cited by both — each engine needs its own optimization attention.

What is the "Lost in the Middle" problem and how do I use it?

LLM research (Liu et al., 2023) showed models recall information at the start and end of their context far better than the middle. Applied to your pages: put the thesis and key statistic in the first 200 words and restate the takeaway in the last 100 — citation data confirms it, with 44.2% of citations coming from the first 30% of page text.

Do AI engines verify facts before citing a source?

Increasingly, yes — selection favors passages whose claims are specific, attributed, and corroborated by other sources and by entity records (Knowledge Graph, Wikidata). Vague claims and unattributed statistics lose to named, sourced specifics. This is why fabricating data is both unethical and self-defeating.

Generative Engine Optimization (GEO): The Complete Guide for 2026

GEO is the practice of getting your brand cited in AI answers from ChatGPT, Perplexity, and Gemini. The techniques with real evidence behind them, how to measure results, and how long it takes — from the team that runs them.

GEO vs SEO: What Changes, What Stays, and How to Split Your Budget in 2026

GEO does not replace SEO — Google organic still drives 40-60% of inbound traffic while AI referrals sit under 1%. What actually differs (unit of competition, measurement, volatility), what transfers, and a practical budget split for small businesses.

How to Get Your Content Cited by ChatGPT: 7 Techniques That Work in 2026

ChatGPT cites sources that lead with the answer, back claims with named statistics, and structure every section as a standalone passage. The 7 evidence-backed techniques, with the exact targets we use.