Navigation menu

← All articles
  • AEO
  • Technical SEO
  • AI search

AI crawlers, explained: ChatGPT to Grok

A plain-language reference on how OpenAI, Anthropic, Perplexity, Google, Copilot, and Grok crawl your site, and how to let the right bots in to cite you.


Why this matters now

Every AI company that answers questions about the web sends bots to read it. Some of those bots train models. Some fetch live pages to cite in an answer. Some are triggered by a user clicking a link. They behave differently, they obey different rules, and a few barely obey rules at all.

If you do not know which bot does what, you end up either blocking the ones that could be citing you, or leaving the door open to ones you would rather keep out. This is the reference. One company at a time.

First, two distinctions that the rest of this hangs on.

Training vs retrieval. Training crawlers (GPTBot, ClaudeBot, Google-Extended) collect data to train a model. That data is frozen at the model's cutoff. Retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot) fetch pages in real time to answer questions and cite sources. Earning a mention today influences retrieval and citation now. It does not retroactively change what a model already learned.

JavaScript. Most AI crawlers read raw HTML and do not execute JavaScript. Vercel and MERJ tracked over 500 million GPTBot fetches and saw zero JS execution (the rise of the AI crawler). So content that only appears after client-side rendering can rank fine on Google and still be invisible to ChatGPT and Perplexity. Server-side rendering fixes it. More on that in our piece on JavaScript SEO.

OpenAI (ChatGPT)

OpenAI runs several bots, and the difference between them is the whole point.

  • OAI-SearchBot powers ChatGPT search results. If you want to be eligible to appear when ChatGPT cites sources, this is the one to allow. It respects robots.txt.
  • GPTBot is for model training. It also respects robots.txt. Allowing or blocking it has no effect on whether you get cited in live answers, only on whether your content feeds future training.
  • ChatGPT-User handles user-initiated fetches and GPT Actions, triggered when someone asks ChatGPT to go look at a specific page. robots.txt may not apply here.
  • OAI-AdsBot validates ad landing pages.

The full breakdown is in OpenAI's bots documentation. The practical takeaway: allow OAI-SearchBot if you want to show up in ChatGPT answers, and decide GPTBot (what GPTBot is) separately based on how you feel about training. They are independent. See what OAI-SearchBot is for more.

Anthropic (Claude)

Anthropic splits its bots the same clean way, and lets you control them independently.

  • ClaudeBot is training. (What ClaudeBot is.)
  • Claude-SearchBot is retrieval, the one that fetches pages so Claude can cite them.
  • Claude-User is user-initiated, triggered by someone in a Claude conversation.

Because retrieval and training are separately controllable, you can allow Claude-SearchBot to be cited while disallowing ClaudeBot from training on you. That is a real choice, not a side effect.

Perplexity

Perplexity keeps it simple with two bots:

  • PerplexityBot indexes pages so they can surface in Perplexity answers. (What PerplexityBot is.)
  • Perplexity-User is the user-initiated fetch.

PerplexityBot reads raw HTML and does not run JavaScript, so the SSR point applies in full.

Google (Search and AI features)

Google is the one people get wrong, so read this carefully.

  • Googlebot feeds the regular search index. It also feeds Google's AI features, AI Overviews and AI Mode, through retrieval-augmented generation. It is the only major crawler that fully renders JavaScript (Googlebot docs).
  • Google-Extended controls whether your content is used to train Gemini and for grounding. (What Google-Extended is.)

Here is the trap: blocking Google-Extended does not remove you from AI Overviews. AI Overviews are built from the normal search index, which Googlebot populates. Google-Extended only governs Gemini training and grounding. If you want out of AI Overviews, blocking Google-Extended will not do it. Google states plainly that there is no separate optimization for its AI features, because they draw from the same index as ordinary search (Google's AI optimization guide).

Microsoft Copilot

Copilot is powered by Bing, so the crawler that matters is bingbot. There is no separate Copilot bot to manage. If bingbot can read your site, Copilot can draw on it.

xAI (Grok)

Grok is the honest exception. xAI documents GrokBot, xAI-Grok, and Grok-DeepSearch, but in practice it rarely sends them. Instead it spoofs ordinary browser user-agents, things like Go-http-client, Chrome on desktop, or Safari on an iPhone, and rotates residential IP addresses.

The consequence is blunt: robots.txt user-agent rules do not reliably control Grok, because Grok does not reliably identify itself as Grok. You can write the rules, and a compliant bot would obey them, but you cannot count on Grok being the bot reading them. Plan around that rather than trusting a directive.

How to allow the right bots

For most sites the right move is to let the search and retrieval bots in, since those are the ones that can cite you, and decide training on your own terms. Here is a copy-pasteable robots.txt that welcomes the retrieval crawlers. (What robots.txt is.)

# Retrieval / search bots — allow these to be eligible for citations
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

# Training bots — allow or disallow on your own terms.
# Example below opts OUT of training while keeping retrieval open.
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Two honest caveats baked into that file. First, disallowing Google-Extended keeps you in AI Overviews, because those run off the regular index, not Gemini grounding. If that surprises you, re-read the Google section. Second, none of this controls Grok, which spoofs its way past user-agent rules.

A robots.txt only decides who gets in. It does not make your pages readable once they are inside. Raw-HTML crawlers still need real text, clear headings, and server-rendered content, or there is nothing useful for them to take. That is the rest of the job, and it is what our guide to getting found by AI search walks through end to end.

// next step

See what AI actually reads on your site.

Free first audit. No credit card. Your Legibility Score in under two minutes.

Run a free audit →