What is an AI Crawler?

An AI crawler is a web crawler operated by a generative AI platform (OpenAI, Anthropic, Google, Perplexity, etc.) that runs across three separated purposes — model training, search indexing, and user-fetch. Unlike traditional search engine crawlers (Googlebot, bingbot), AI crawlers can be opted out per purpose using user-agent or token entries in robots.txt.

Three-layer separation

OpenAI and Anthropic publish a separate user-agent for each purpose.

Layer	Purpose	When it runs
Training	Collecting model training data	Background crawl
Search Indexing	Building the index that AI retrieves from at answer time	Background crawl
User Fetch	A user asks the AI to fetch a specific URL	Real-time, user-initiated

The point of this split is that training opt-out and visibility maintenance are separable decisions. Blocking only the training bot keeps your content out of the model's training data while still keeping it visible in AI search results.

Comparison across the four platforms

Platform	Training	Search Indexing	User Fetch
OpenAI	GPTBot	OAI-SearchBot	ChatGPT-User
Anthropic	ClaudeBot	Claude-SearchBot	Claude-User
Google	Google-Extended (token)	Googlebot (unchanged)	—
Perplexity	(none — no training)	PerplexityBot	Perplexity-User

Three structural patterns emerge.

OpenAI · Anthropic — symmetric three-layer separation. The most fine-grained opt-out control.
Google — token-based. Google-Extended is a robots.txt token (not a user-agent) that opts out only of Gemini training.
Perplexity — does not train its own LLMs, so there is no training bot — a two-bot structure.

Role in RanketAI Score

RanketAI evaluates GPTBot, ClaudeBot, PerplexityBot, and Google-Extended robots.txt access independently within the AI Infra pillar. Blocking any of these means blocking the GEO measurement surface itself, which is why this is the first layer to inspect.

Frequently Asked Questions

Q. What happens if I block all AI crawlers?

Your content disappears from ChatGPT, Claude, Gemini, and Perplexity search results. Even if you implement all nine GEO strategies validated by Aggarwal et al. KDD 2024, the impact will be zero — robots.txt is the first gate.

Q. Can I block only the training bots and allow the search and user-fetch bots?

Yes. OpenAI and Anthropic separate their training, search, and user-fetch bots — Disallow GPTBot and ClaudeBot while keeping OAI-SearchBot and Claude-SearchBot Allowed. That achieves both training opt-out and continued visibility at the same time.

Q. Why is Google-Extended a token instead of a user-agent?

Google search indexing is still done by the existing Googlebot, and only Gemini training fetches are opt-outable via the separate robots.txt token (Google-Extended). As a result, blocking Google-Extended keeps your Google search exposure intact.

Q. Do AI crawlers always honor robots.txt?

The IETF RFC 9309 voluntarily-compliance model is the foundation, and the four major platforms (OpenAI, Anthropic, Google, Perplexity) have all publicly committed to compliance. That said, some data-collection bots may ignore it — WAF-level IP verification is recommended as a secondary safeguard.

AI Crawler