AI Crawler
Web crawlers operated by generative AI platforms (ChatGPT, Claude, Gemini, Perplexity, etc.) that separate training, search indexing, and user-fetch into distinct layers
What is an AI Crawler?
An AI crawler is a web crawler operated by a generative AI platform (OpenAI, Anthropic, Google, Perplexity, etc.) that runs across three separated purposes — model training, search indexing, and user-fetch. Unlike traditional search engine crawlers (Googlebot, bingbot), AI crawlers can be opted out per purpose using user-agent or token entries in robots.txt.
Three-layer separation
OpenAI and Anthropic publish a separate user-agent for each purpose.
| Layer | Purpose | When it runs |
|---|---|---|
| Training | Collecting model training data | Background crawl |
| Search Indexing | Building the index that AI retrieves from at answer time | Background crawl |
| User Fetch | A user asks the AI to fetch a specific URL | Real-time, user-initiated |
The point of this split is that training opt-out and visibility maintenance are separable decisions. Blocking only the training bot keeps your content out of the model's training data while still keeping it visible in AI search results.
Comparison across the four platforms
| Platform | Training | Search Indexing | User Fetch |
|---|---|---|---|
| OpenAI | GPTBot | OAI-SearchBot | ChatGPT-User |
| Anthropic | ClaudeBot | Claude-SearchBot | Claude-User |
| Google-Extended (token) | Googlebot (unchanged) | — | |
| Perplexity | (none — no training) | PerplexityBot | Perplexity-User |
Three structural patterns emerge.
- OpenAI · Anthropic — symmetric three-layer separation. The most fine-grained opt-out control.
- Google — token-based. Google-Extended is a robots.txt token (not a user-agent) that opts out only of Gemini training.
- Perplexity — does not train its own LLMs, so there is no training bot — a two-bot structure.
Role in RanketAI Score
RanketAI evaluates GPTBot, ClaudeBot, PerplexityBot, and Google-Extended robots.txt access independently within the AI Infra pillar. Blocking any of these means blocking the GEO measurement surface itself, which is why this is the first layer to inspect.
Frequently Asked Questions
Q. What happens if I block all AI crawlers?
Your content disappears from ChatGPT, Claude, Gemini, and Perplexity search results. Even if you implement all nine GEO strategies validated by Aggarwal et al. KDD 2024, the impact will be zero — robots.txt is the first gate.
Q. Can I block only the training bots and allow the search and user-fetch bots?
Yes. OpenAI and Anthropic separate their training, search, and user-fetch bots — Disallow GPTBot and ClaudeBot while keeping OAI-SearchBot and Claude-SearchBot Allowed. That achieves both training opt-out and continued visibility at the same time.
Q. Why is Google-Extended a token instead of a user-agent?
Google search indexing is still done by the existing Googlebot, and only Gemini training fetches are opt-outable via the separate robots.txt token (Google-Extended). As a result, blocking Google-Extended keeps your Google search exposure intact.
Q. Do AI crawlers always honor robots.txt?
The IETF RFC 9309 voluntarily-compliance model is the foundation, and the four major platforms (OpenAI, Anthropic, Google, Perplexity) have all publicly committed to compliance. That said, some data-collection bots may ignore it — WAF-level IP verification is recommended as a secondary safeguard.