2026-06-09

AEO
AI search

How to get cited by ChatGPT, Perplexity and more

ChatGPT, Perplexity, Gemini, Copilot and Grok each pick sources their own way. Here is what every major answer engine rewards, and how to earn a citation.

Being cited is not the same as ranking

When someone asks Perplexity a question, it returns a short answer and lists the handful of sources it leaned on. ChatGPT, when it browses, does the same: a paragraph of synthesis with a few links underneath. Getting your site into that short list is a different job from ranking on Google's page one.

ChatGPT and Perplexity get most of the attention, but they are two of several. Google's Gemini and AI Overviews, Microsoft Copilot, and xAI's Grok all answer questions and all draw on the web, each with its own crawler and its own habits. The shared foundation is good SEO and readable pages. The differences are in how each engine fetches, what it trusts, and whether you can steer it at all.

This piece covers both: the universal moves that work everywhere, then the specifics for each answer engine one by one. If you want the plumbing-level detail on which bot does what, AI crawlers, explained is the companion piece.

How these engines pick sources

Every retrieval-based engine works roughly the same way under the hood, whether it is ChatGPT, Perplexity, Gemini, or Copilot. They run your question against a search step, pull back a set of candidate pages, read them, and then write an answer grounded in what they read. The pages they cite are the ones the model could (a) retrieve, (b) read cleanly, and (c) lift a confident, self-contained statement from.

That last point is the one most people miss. A model is not ranking ten links and picking the best. It is looking for a passage it can stand behind and attribute. If your page makes that easy, you get named. If it makes the model work to extract a claim, it quietly pulls from a competitor who made it easy.

So the question to ask of every important page is not "does this rank?" but "can a model lift one clean sentence from this and feel safe citing it?"

The engines, one by one

The mechanics above are shared. The details are not. Here is what is actually distinct about each major engine, and what it means for you.

ChatGPT

ChatGPT does not use one crawler, it uses several, and the difference matters. OAI-SearchBot is the one that powers ChatGPT's search results; allowing it is what makes your pages eligible to be surfaced and cited. GPTBot is for model training, ChatGPT-User handles user-initiated fetches and GPT Actions, and OAI-AdsBot validates ad landing pages. OpenAI's bots documentation lists each one and the controls that apply.

The practical takeaway: if you want to appear in ChatGPT answers, do not blanket-block OpenAI in robots.txt. You can disallow GPTBot if you object to training while still allowing OAI-SearchBot to retrieve and cite you. They are separate decisions. And be careful with ChatGPT-User: because it acts on a person's direct request, robots.txt may not apply to it the way it does to the autonomous crawlers.

One more thing that trips people up: GPTBot does not run JavaScript. It reads raw HTML. Vercel and MERJ tracked over 500 million GPTBot fetches and saw zero JavaScript execution. If your content only appears after a client-side render, OpenAI's crawlers see an empty shell. Server-render it.

Perplexity

Perplexity is the most citation-forward of the lot: it shows its sources right under the answer, every time, which makes it the clearest signal of whether your work to be quotable is paying off. It runs two agents, PerplexityBot for indexing and Perplexity-User for fetches triggered by a specific query.

Like the others, PerplexityBot reads HTML and does not execute JavaScript. The same Vercel and MERJ study found the major AI crawlers, Perplexity included, do not render. So the contained-answer and real-text advice below is what moves the needle here. Perplexity rewards a page it can read in one pass and lift a clean line from.

Google (Gemini and AI Overviews)

Google is the one most people get wrong, so be precise. AI Overviews and AI Mode are fed by Googlebot and the regular search index, the same one that powers blue links. There is no separate AI crawler to court. Google's own AI optimization guide says the work is simply good, crawlable, helpful SEO.

The confusion is Google-Extended. It controls whether your content is used to train Gemini and for grounding. It does not control AI Overviews. Blocking Google-Extended keeps you out of Gemini training but leaves you fully eligible for AI Overviews, because those run on the normal index. So if your goal is to be cited in AI Overviews, the move is ordinary SEO, not a robots directive.

Googlebot is also the one major crawler that fully renders JavaScript. So client-side-rendered content can rank fine on Google yet be invisible to ChatGPT, Perplexity, and Claude. Do not let Google's tolerance lull you into shipping JS-only content; the rest of the ecosystem cannot see it.

Microsoft Copilot

Copilot has no crawler of its own. It is powered by Bing, so the relevant bot is bingbot and the relevant index is Bing's. If you have spent years optimising only for Google, this is the gap: pages that are strong on Google but thin on Bing can be missing from Copilot answers entirely.

The fix is unglamorous. Verify your site in Bing Webmaster Tools, make sure Bing is crawling and indexing your key pages, and apply the same readable-text and clean-structure fundamentals. Good Bing coverage is the price of admission to Copilot.

xAI Grok

Grok is the awkward one. xAI documents crawlers like GrokBot, xAI-Grok, and Grok-DeepSearch, but in practice it rarely sends them. Instead it tends to spoof ordinary browser user-agents, Go-http-client, Chrome, iPhone Safari, and rotate through residential IP addresses.

The consequence is blunt: you cannot reliably use robots.txt user-agent rules to either court Grok or block it, because the requests do not announce themselves as Grok. There is no clean lever here. What you can do is the only thing that works regardless: keep your important content in plain, server-rendered HTML so that any agent arriving as a normal browser can read it. With Grok, discoverability is the whole strategy.

Write contained answers, not buried ones

The single highest-leverage change is to answer the obvious question directly, near the top of the relevant section, in one or two complete sentences.

Lead with the answer. If a page is about your refund window, the sentence "Refunds are available within 30 days of purchase" should appear early and stand on its own. A model can quote that. It cannot quote a paragraph that circles the topic for four sentences before getting there.
Make sentences self-contained. "It takes about two business days" is useless out of context. "Bank transfers clear in about two business days" survives being lifted onto a results page. Assume every sentence might be quoted with nothing around it.
Pose the question as a heading. A ## How long do refunds take? heading followed by a direct answer mirrors how people ask and how engines retrieve. This is the same instinct that makes a good FAQ block work.

You are not writing for a skim-reader anymore. You are writing for something that will extract a fragment and put your name next to it. Contained, factual, standalone sentences win.

The text has to be real and readable

None of this matters if the engine cannot read the page. The failure modes here are the same ones that quietly sink AI visibility everywhere, and they are worth checking directly.

Real text, not images of text. A price inside a screenshot, a key spec baked into a hero graphic, a policy that only exists in a PDF: invisible. If you cannot select it with your cursor in a browser, a model probably cannot read it either.
Content present without heavy JavaScript. Some AI crawlers render less aggressively than Googlebot does. If your main content only appears after a client-side fetch, it may simply be missing when the engine reads the page.
Clean structure around the body. Semantic HTML and a clear heading hierarchy help the model tell your actual answer apart from nav, cookie banners, and footer noise.

We go deeper on each of these in the guide to making your content discoverable to AI. The short version: discoverability is the floor. You cannot be cited from text that was never parsed.

Authority is what gets you trusted enough to name

Reading your page is necessary but not sufficient. When an engine cites a source, it is staking a small amount of its own credibility on you. It prefers sources it has reason to trust.

You cannot game that, but you can earn it the same way you earn it with people:

Show who wrote it and why they would know. A named author with real expertise, a clear "we tested this" or "here is our data" framing, an about page that establishes who you are. This is plain Experience, Expertise, Authoritativeness, and Trustworthiness, and trust is the part that matters most.
Be the primary source for something. Original data, a real pricing table, a methodology you actually run, a definition you can defend. Engines reach for primary sources over rehashed summaries.
Get mentioned in places that already have authority. Citations cluster. Sites that other trusted sites reference get pulled into answers more often.

What you should not do is chase shortcuts. Do not spin up thin pages for every query variation, do not buy inauthentic mentions, and do not rewrite real content into model-bait. Those tactics get described as scaled content abuse for a reason, and they do not survive the engines getting smarter.

One expectation to set straight: earning authority now affects retrieval, not training. The search bots, OAI-SearchBot, Perplexity-User, Googlebot, bingbot, fetch live pages and cite them in real time, so today's work shows up in answers quickly. The training bots, GPTBot, ClaudeBot, Google-Extended, collect data that is frozen at a model's cutoff. Getting cited more next month does not retroactively change what a model was trained on. Aim your effort at retrieval; that is where citations are won.

What about llms.txt?

You will hear that an llms.txt file is the key to AI citations. Be precise about this. For Google's AI features, it does nothing: Google has said it does not use it, and you should not create one for Google's benefit. For the wider ecosystem of AI crawlers and agents, llms.txt is an emerging convention that some tools are starting to respect. It is a reasonable, low-cost signal of intent for the non-Google world, not a magic switch, and never a substitute for readable pages and real authority.

Where to start

You do not need a separate AI strategy. You need pages a model can read and a reason to be trusted:

Pick your ten most important pages and open each one. Can you select the key facts with your cursor? Fix anything that is locked in an image.
For each page, find the obvious question it answers and make sure that answer appears early, in one self-contained sentence, under a heading that sounds like the question.
Add a named author and a line about how you know what you are claiming.
Check that your main content is in the HTML, not assembled by script after load.

Do those four things and you are doing most of the job. The deeper playbook, covering both Google and everything else, lives in our guide to getting found by AI search.

// next step

See what AI actually reads on your site.

Free first audit. No credit card. Your Discoverability Score in under two minutes.

Run a free audit →