Skip to main content
AI Infrastructure

AI Crawler

Web crawlers operated by generative AI platforms (ChatGPT, Claude, Gemini, Perplexity, etc.) that separate training, search indexing, and user-fetch into distinct layers

#AI Crawler#GPTBot#ClaudeBot#PerplexityBot#Google-Extended#training bot#search bot#RFC 9309

What is an AI Crawler?

An AI crawler is a web crawler operated by a generative AI platform (OpenAI, Anthropic, Google, Perplexity, etc.) that runs across three separated purposes — model training, search indexing, and user-fetch. Unlike traditional search engine crawlers (Googlebot, bingbot), AI crawlers can be opted out per purpose using user-agent or token entries in robots.txt.

Three-layer separation

OpenAI and Anthropic publish a separate user-agent for each purpose.

Layer Purpose When it runs
Training Collecting model training data Background crawl
Search Indexing Building the index that AI retrieves from at answer time Background crawl
User Fetch A user asks the AI to fetch a specific URL Real-time, user-initiated

The point of this split is that training opt-out and visibility maintenance are separable decisions. Blocking only the training bot keeps your content out of the model's training data while still keeping it visible in AI search results.

Comparison across the four platforms

Platform Training Search Indexing User Fetch
OpenAI GPTBot OAI-SearchBot ChatGPT-User
Anthropic ClaudeBot Claude-SearchBot Claude-User
Google Google-Extended (token) Googlebot (unchanged)
Perplexity (none — no training) PerplexityBot Perplexity-User

Three structural patterns emerge.

  • OpenAI · Anthropic — symmetric three-layer separation. The most fine-grained opt-out control.
  • Google — token-based. Google-Extended is a robots.txt token (not a user-agent) that opts out only of Gemini training.
  • Perplexity — does not train its own LLMs, so there is no training bot — a two-bot structure.

Role in RanketAI Score

RanketAI evaluates GPTBot, ClaudeBot, PerplexityBot, and Google-Extended robots.txt access independently within the AI Infra pillar. Blocking any of these means blocking the GEO measurement surface itself, which is why this is the first layer to inspect.

Frequently Asked Questions

Q. What happens if I block all AI crawlers?

Your content disappears from ChatGPT, Claude, Gemini, and Perplexity search results. Even if you implement all nine GEO strategies validated by Aggarwal et al. KDD 2024, the impact will be zero — robots.txt is the first gate.

Q. Can I block only the training bots and allow the search and user-fetch bots?

Yes. OpenAI and Anthropic separate their training, search, and user-fetch bots — Disallow GPTBot and ClaudeBot while keeping OAI-SearchBot and Claude-SearchBot Allowed. That achieves both training opt-out and continued visibility at the same time.

Q. Why is Google-Extended a token instead of a user-agent?

Google search indexing is still done by the existing Googlebot, and only Gemini training fetches are opt-outable via the separate robots.txt token (Google-Extended). As a result, blocking Google-Extended keeps your Google search exposure intact.

Q. Do AI crawlers always honor robots.txt?

The IETF RFC 9309 voluntarily-compliance model is the foundation, and the four major platforms (OpenAI, Anthropic, Google, Perplexity) have all publicly committed to compliance. That said, some data-collection bots may ignore it — WAF-level IP verification is recommended as a secondary safeguard.

Related Terms

Related terms