Navigation menu

← All articles
  • crawlers
  • AI search
  • AEO

How to check if AI crawlers are actually visiting your site

Want to know if ChatGPT, Perplexity, and Claude crawl your site? Read your server logs. Here are the user-agents to grep, how to read the hits, and what nothing means.


You want to know if the AI bots are really reading you

You allowed the crawlers in robots.txt, you wrote the answer-forward content, and now you want proof that ChatGPT, Perplexity, and Claude are actually fetching your pages. Not a vibe. Evidence.

Good news: there is a definitive way to know. Every well-behaved AI crawler identifies itself when it requests a page, and your server writes that down. So the answer is sitting in your access logs right now. Here is how to find it, read it, and what to do when there is nothing there.

The real answer: your server logs

When a bot fetches a URL, your web server records the request, including the user-agent string the client sent. AI crawlers announce themselves with stable tokens. Find those tokens in your logs and you have caught them in the act.

These are the main user-agent tokens worth searching for:

  • GPTBot — OpenAI's training crawler. (What GPTBot is.)
  • OAI-SearchBot — OpenAI's retrieval bot, the one that fetches pages for ChatGPT search. This is the one you most want to see. (What OAI-SearchBot is.)
  • ChatGPT-User — a user-initiated fetch, fired when someone asks ChatGPT to go read a specific page.
  • ClaudeBot — Anthropic's training crawler. (What ClaudeBot is.)
  • Claude-Web — Anthropic's user-facing fetch.
  • PerplexityBot — indexes pages for Perplexity answers. (What PerplexityBot is.)
  • Google-Extended — governs Gemini training and grounding (it is a robots.txt token, not a separate crawler, so it shows up as a directive more than a log line).
  • Bytespider — ByteDance, the TikTok parent. Aggressive, and known to ignore robots.txt.
  • CCBot — Common Crawl, whose dataset feeds many models downstream.
  • Amazonbot — Amazon's crawler, used for Alexa and other products.
  • Applebot-Extended — controls whether Apple uses your content for its generative models.

The fastest way to check is a single grep over your access log:

grep -iE 'GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-Web|PerplexityBot|Bytespider|CCBot|Amazonbot|Applebot' access.log

The -i makes it case-insensitive; -E lets you pipe several tokens together. On nginx the file is usually /var/log/nginx/access.log; on Apache, /var/log/apache2/access.log or wherever your CustomLog points. Each matching line is one real fetch by one real bot.

Want a quick tally of who visits most? Pull just the user-agent and count:

grep -iE 'GPTBot|OAI-SearchBot|ClaudeBot|PerplexityBot' access.log \
  | grep -oiE 'GPTBot|OAI-SearchBot|ClaudeBot|PerplexityBot' \
  | sort | uniq -c | sort -rn

That prints something like 42 GPTBot, 9 PerplexityBot, and you immediately know who is paying attention.

How to read what you find

A list of hits is a start. The useful signal is in three things.

Frequency. A handful of fetches a month from GPTBot is normal for a small site. Daily visits from OAI-SearchBot or PerplexityBot mean your content is in active rotation for live answers, which is the outcome you actually want. Retrieval beats training: a training crawl feeds a future model, while a retrieval crawl can cite you today. Being fetched is not the same as being cited, though; if the bots are visiting but your name still isn't in the answers, why ChatGPT isn't citing your website covers what sits between the two.

Which URLs. Look at the path on each request. Are the bots fetching your money pages, your guides, your product docs? Or only the homepage and sitemap.xml? If they never reach your best content, that content is not earning citations no matter how good it is.

Status codes. This is the one people skip, and it matters most. The status code is the number after the request in each log line. A 200 means the bot got the page. A 403 or 401 means something blocked it. A 404 means the URL is dead. A run of 200s is healthy. A run of 403s usually means your firewall, CDN bot-protection, or a WAF rule is quietly turning AI crawlers away, and you would never know without reading the code. Plenty of sites "allow" bots in robots.txt and then 403 them at the edge.

If you can't get to the logs

Managed hosting, serverless platforms, and some PaaS setups do not give you raw log files. You still have options.

  • Cloudflare exposes verified-bot and AI-crawler analytics in the dashboard, plus an AI Audit view that breaks down crawl activity by operator. If your site is behind Cloudflare, this is the easiest read available.
  • Fastly and most other CDNs offer real-time log streaming or a bot-traffic dashboard; check your provider's analytics or logging tab.
  • A log-drain or observability service (Datadog, Logtail, Papertrail, and similar) lets you ship logs off the server and run the same user-agent search in a query box instead of over SSH.
  • Your platform's own dashboard. Vercel, Netlify, and friends surface request logs you can filter by user-agent, even when there is no file to grep.

The search is identical wherever you run it: filter the user-agent field for the same tokens listed above.

If you see nothing

No AI crawler hits at all is itself a finding. It usually means one of three things.

You blocked them, on purpose or by accident. robots.txt is the gate. A compliant crawler that is disallowed simply will not appear in your logs, because it read the rule and stayed out. Check your file for a Disallow: / under any AI user-agent, and check that your CDN or WAF is not 403-ing them before they reach the app. Our robots.txt glossary entry covers the syntax, and if you're nervous that allowing them will cost you anything in Search, whether blocking GPTBot hurts your SEO settles that worry.

Your site is new or low-authority. Crawlers prioritise pages that are linked to and talked about. A site nobody references yet gets crawled rarely, if at all. The fix is the slow one: earn mentions and links, and the bots follow.

You are looking at too short a window. AI crawlers do not visit hourly. Search a month of logs, not an afternoon, before concluding you are invisible.

One honest caveat on all of this: some bots lie. A user-agent string is self-reported, and scrapers routinely impersonate GPTBot or ChatGPT-User to slip past blocks, while others (xAI's Grok among them) spoof ordinary browser agents and never identify as a bot at all. For the operators that publish their IP ranges — OpenAI, Anthropic, Perplexity, and Google all do — you can verify a hit by checking the source IP against their published list, or by running a reverse DNS lookup on it. If a "GPTBot" request comes from an IP outside OpenAI's range, it is not GPTBot. For the full picture of who runs what, see AI crawlers, explained.

Let something watch the logs for you

Reading logs once is useful. Doing it every week, spotting the day OAI-SearchBot starts returning 403s, and tying crawl activity to whether you actually get cited, that is a standing job.

Rankport tracks AI-crawler activity and your mentions across the answer engines, so you see who is visiting, whether they are getting clean 200s, and where you show up in ChatGPT and Perplexity answers, without living in a terminal. Check your AI visibility and skip the grep.

// next step

See what AI actually reads on your site.

Free first audit. No credit card. Your Discoverability Score in under two minutes.

Run a free audit →