How does a language model decide which brands to mention?

Two mechanisms. Pre-training bakes brand associations into the model weights based on frequency and context in the training corpus. Retrieval-augmented generation pulls live sources at query time and cites them in the answer. Different AI products weight these differently.

Can you directly influence a language model's training data?

Not directly. You can influence what gets into the corpus by publishing on sources the models crawl. What gets picked, and how it's weighted, is decided by the model's training pipeline.

What's the difference between pre-training and retrieval?

Pre-training happens once, on a fixed snapshot of internet data. Retrieval happens at query time, pulling current sources the model can quote. ChatGPT with search enabled uses both. Classic ChatGPT without search uses only pre-training.

Does schema markup help with AEO?

Schema helps with retrieval-based systems like Google's AI Overviews that pull from structured data. It helps less with pure pre-trained systems like classic ChatGPT. The overall effect is positive but smaller than the effect of off-site citations.

How Answer Engine Optimization Actually Works

Most AEO advice online is written at the level of “publish good content and earn citations.” That’s true as far as it goes, but it leaves out the actual mechanism — what happens inside the AI system when it decides to mention your brand. Without understanding the mechanism, the advice becomes a set of superstitions. “I tried this and it worked” versus “I tried this and it didn’t” with no way to explain why.

This post goes one layer deeper. Not deep enough to satisfy an ML researcher, but deep enough to make the tactical advice actually make sense. (For the practical playbook, see the AEO guide for 2026.)

The two pathways

When you ask a language model a question and it produces an answer that names specific brands, the brand names got into the answer through one of two pathways. Sometimes both.

Pre-training pathway. The model was trained on a massive corpus of internet text. During training, the model learned statistical associations between words and concepts. If your brand name appeared frequently in the training corpus alongside certain topics, products, or descriptions, the model learned those associations. When a user later asks a question that activates those associations, the model generates an answer that includes your brand name. This pathway doesn’t require any live internet access. A model that’s been cut off from the internet since its training date can still mention your brand if you were in the corpus.

Retrieval pathway. At query time, the AI system runs a search against a live web index or a specialized retrieval corpus, pulls a handful of relevant documents, and passes them into the model’s context window along with the original question. The model generates an answer that’s grounded in those documents, and typically cites them with inline links. This pathway does require live internet access, and the citations reflect what exists on the web right now rather than what existed when the model was trained.

Different AI products use these pathways differently.

Classic ChatGPT (no web access) uses only the pre-training pathway.
ChatGPT with search enabled uses both.
Perplexity is predominantly retrieval-based. It searches the web on every query and grounds its answer in retrieved sources.
Google AI Overviews is also retrieval-based, using Google’s own index as the retrieval source.
Claude without web access uses pre-training. With web search enabled, it uses retrieval too.
Gemini in most consumer surfaces uses both.

Knowing which product you’re targeting changes which pathway you should optimize for. (For a focused look at one platform, see how to get cited in ChatGPT.)

Pre-training: how brands get baked in

During pre-training, a language model processes a huge corpus of text — hundreds of billions of tokens from crawled web pages, books, code, and specialized datasets. For every token in the corpus, the model updates its parameters to make the token more predictable given the surrounding context.

For a brand like “Acme Payroll” to become something the model will mention in an answer about small business payroll, two things need to be true in the training corpus.

First, the brand name needs to appear frequently. A brand that’s mentioned 10 times in a trillion-token corpus is statistical noise. A brand that’s mentioned 10,000 times has signal. The specific threshold varies by how distinctive the brand name is and how clustered the mentions are, but the general rule is that volume matters.

Second, the mentions need to appear in contexts that associate the brand with the topics you want to be recalled for. If Acme Payroll is mentioned 10,000 times but 9,000 of those are in a single subreddit thread about a specific controversy, the model learns to associate the brand with the controversy, not with the payroll category. You want your brand mentioned in contexts that look like “Acme Payroll, a payroll platform for small businesses, processed X in transactions this year” — topic-aligned mentions, across many different sources, with consistent framing.

This is exactly what press coverage, trade publication features, review roundups, and knowledge panel establishment do. Not one giant mention but thousands of small, consistent mentions in trusted sources over time. That’s the pattern that moves pre-training.

Retrieval: how citations get picked at query time

Retrieval is more tractable because it’s happening in real time and the decisions are observable.

When a retrieval-based AI product receives a query, the workflow is roughly:

The query gets rewritten or expanded into one or more search queries.
Those queries get sent to a search index (Google’s index for AI Overviews, Bing’s for Copilot, their own crawler for Perplexity).
The top N results are retrieved. N is usually between 5 and 20.
The retrieved pages are fetched, parsed, and ranked for relevance to the original question.
The top few pages (often 3 to 8) are passed into the model’s context window as source material.
The model generates an answer that’s grounded in those sources, with inline citations.

Winning citations in this pathway is essentially a two-step problem. First, your page needs to rank in the underlying search. Second, your page needs to look useful to the model when it’s reading the retrieved content.

The first step is classical SEO. The second step is where AEO-specific tactics come in. The model is looking for pages that directly answer the question, use clean structure, contain specific facts and numbers, and look like authoritative sources. A page that ranks fifth on Google but has the clearest direct answer to the question sometimes gets cited over the page that ranks first with a hedged marketing answer.

This is why featured snippet optimization and AEO optimization overlap so heavily. The signals that make a page extractable for a featured snippet also make it extractable for a retrieval-based AI citation.

The hybrid case

Most consumer AI products use both pathways, weighted differently for different query types.

On a factual query with a clear single answer (“what year did WWII end”), retrieval usually dominates. The model pulls a reliable source and uses it.

On a subjective or exploratory query (“what are some good podcasts about history”), pre-training matters more. The model draws on associations baked in during training, with retrieval supplementing.

On a brand-specific query (“tell me about Acme Payroll”), both matter. The model uses pre-training associations to frame the answer and retrieval to get current facts.

Your AEO program has to address both. The on-site content and structural work addresses the retrieval pathway. The press, review, and citation work addresses the pre-training pathway. Neither alone is enough.

Why schema helps less than you think

Structured data markup (Schema.org / JSON-LD) is often pitched as a major AEO lever. The honest answer is that it helps, but less than the marketing suggests.

Schema directly helps retrieval-based systems that parse it — Google’s AI Overviews, and some of Bing’s systems. It helps those systems understand page structure, entity types, and relationships.

Schema does not directly help pre-training. Language models trained on raw web text don’t get any special signal from JSON-LD markup because the markup looks like weird JSON to them, not structured data.

That said, the indirect effect is real. Pages with good schema tend to rank better in search, which means they get retrieved more often, which means the text of those pages ends up in more AI contexts. Schema is a ranking input more than an AEO input directly.

Worth doing. Not worth overhyping. Spend 10 percent of your AEO effort on schema and 90 percent on content and off-site work.

The retrieval signal list

Here’s the concrete list of things the retrieval-based systems seem to weight highly, based on observed behavior and some public documentation.

Page ranks in underlying search index for the query.
Direct answer to the question in the first 500 characters of the page.
Clear H-tag structure with question-phrased headings.
Factual specificity — numbers, dates, citations to sources.
Domain authority and freshness signals.
Presence of structured data (FAQ schema, Article schema, Product schema).
Low ad density and clean HTML.

If your page checks most of these boxes, it will start showing up as a retrieved source for queries in its topic area.

The pre-training signal list

Less observable but here’s what the pattern suggests based on brand frequency analysis in answers from models trained on different snapshots.

Total mention frequency in the training corpus.
Diversity of sources mentioning the brand (not concentrated in one site).
Quality of sources (news, trade press, educational > random blogs).
Topical consistency of the mentions (brand + category phrases repeated).
Entity disambiguation (Wikipedia, Wikidata, knowledge graph entries).
Freshness — recent sources weigh more if the model was trained recently.

This is the harder pathway to influence because the feedback loop is slow (training happens every 6 to 18 months) and you can’t see what’s in the corpus directly. But the tactical moves are clear: earn press, build entity recognition, publish on authoritative sources, be consistent in how you describe your company.

The takeaway

AEO is two different optimization problems layered on top of each other. One is near-real-time retrieval optimization that rewards the same signals SEO has always rewarded, plus some new structural ones around question-answer formatting. The other is long-cycle brand building through citation density in authoritative sources, which happens over months and shows up in future model training runs.

The brands winning at AEO are doing both. The ones that treat it as pure content work are getting some retrieval wins but missing the pre-training pathway. The ones that only do PR are getting the pre-training work but failing at retrieval because their on-site content isn’t extractable.

The combined program is harder, but it’s also more defensible — the companies that establish strong positions in both pathways compound their advantage over competitors who only do one.

How Answer Engine Optimization Actually Works

The two pathways

Pre-training: how brands get baked in

Retrieval: how citations get picked at query time

The hybrid case

Why schema helps less than you think

The retrieval signal list

The pre-training signal list

The takeaway

Frequently asked

Explore the Journal

Ready to get published?

The two pathways

Pre-training: how brands get baked in

Retrieval: how citations get picked at query time

The hybrid case

Why schema helps less than you think

The retrieval signal list

The pre-training signal list

The takeaway

Frequently asked

Keep reading

Explore the Journal

Ready to get published?