RanketAI Guide #05: The Four AI Crawler Policies — GPTBot · ClaudeBot · Google-Extended · PerplexityBot
Building on IETF RFC 9309, the four major AI platforms — OpenAI, Anthropic, Google, and Perplexity — publish bot policies that separate training, search indexing, and user-fetch layers. This guide compares all four and maps them to the RanketAI probe measurement surface in a single frame.
AI-assisted draft · Editorially reviewedThis blog content may use AI tools for drafting and structuring, and is published after editorial review by the RanketAI Editorial Team.
TL;DR: The first gate of GEO measurement is not content quality — it is robots.txt. ChatGPT (three OpenAI bots), Claude (three Anthropic bots), Gemini (the Google-Extended token), and Perplexity (two bots) each separate training, search indexing, and user-fetch into distinct user-agents. This guide compares the four platforms on top of IETF RFC 9309 and maps each policy decision to the visibility surface that the RanketAI probe actually measures.
Why robots.txt is the first GEO gate
In #04 — GEO Academia × Industry × Measurement, academia and industry agreed that earned media (third-party sources) outweighs brand-owned pages (Chen et al. 2025 · Ahrefs 2026). But there is an even more fundamental gate before that — whether AI can include your pages in the training, indexing, or live-fetch surface in the first place.
Through the GEO measurement frame, visibility forms in three sequential stages.
- Crawl gate — does robots.txt allow the bot through?
- Index/train gate — does the bot absorb the content into training data or a search index?
- Citation gate — does the model cite that content when answering a user?
The nine strategies in Aggarwal et al. KDD 2024 and the earned-media bias in Chen et al. 2025 are about optimizing stages 2 and 3. Yet none of that matters if stage 1 (the crawl gate) is closed. robots.txt policy is the binary switch that decides whether your zero-indicator surface for GEO exists at all. Knowing the four AI platform policies precisely is therefore the starting point for any GEO work.
Axis 1 — IETF RFC 9309 (the de-facto official standard)
The official standard for robots.txt is IETF RFC 9309 (Robots Exclusion Protocol). When the IETF formalized it in September 2022, 30+ years of de-facto robots.txt practice became a recognized internet standard. Three directives matter:
User-agent: <bot name or *>
Disallow: <blocked path>
Allow: <explicitly permitted path>
Match precedence is longest-match wins, and on conflicts Allow beats Disallow. All four platform docs explicitly honor this RFC — meaning the "robots.txt does what you wrote" guarantee rests on the voluntary compliance model that RFC 9309 defines.
⚠ No legal force. As stated in RFC 9309 §1.1, the standard is built on "voluntarily compliance." Some training-data scrapers ignore robots.txt, but OpenAI · Anthropic · Google · Perplexity all publicly committed to RFC 9309 compliance.
Axis 2 — OpenAI's three-bot split (GPTBot · OAI-SearchBot · ChatGPT-User)
OpenAI's official bots documentation operates three separate user-agents.
| User-agent | Purpose | When it runs |
|---|---|---|
| GPTBot | Foundation model (GPT family) training | Background crawl |
| OAI-SearchBot | ChatGPT search indexing | Background crawl |
| ChatGPT-User | Triggered when a user asks ChatGPT to fetch a specific URL | Real-time, user-initiated |
Block training only, allow search and user-initiated fetches:
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
The point of this three-way split is that training opt-out and visibility maintenance are separable decisions. Blocking GPTBot alone keeps your content out of model training data while still allowing OAI-SearchBot to index it for ChatGPT search results — the GEO visibility surface is preserved.
OpenAI also publishes official IP ranges (JSON endpoint) so that WAFs (Cloudflare, etc.) can block user-agent spoofing at the network layer. Of the four platforms, OpenAI offers the most standardized verification infrastructure.
Axis 3 — Anthropic's three-bot split (ClaudeBot · Claude-SearchBot · Claude-User)
Anthropic's official help article follows the same three-bot pattern.
| User-agent | Purpose | When it runs |
|---|---|---|
| ClaudeBot | Claude model training | Background crawl |
| Claude-SearchBot | Claude search indexing | Background crawl |
| Claude-User | Triggered when a user asks Claude to fetch a URL | Real-time, user-initiated |
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
The same three-layer split as OpenAI — meaning separating training from search from user-fetch has become an industry-standard pattern. The main difference is that Anthropic's IP-range publication is less standardized than OpenAI's; some operators rely on Cloudflare's AI bot rules as a secondary measure.
⚠ Watch for legacy user-agents. Some older robots.txt guides list anthropic-ai or claude-web as the user-agent to block. Those are pre-2024 names from before Anthropic consolidated its crawler identity; the current canonical name is ClaudeBot. For safety, list both for backwards compatibility.
Axis 4 — Google-Extended (a token, not a bot)
Google's policy is structurally different from the other three. Google-Extended is a robots.txt token, not a user-agent (Google's announcement · Search Central guide).
# Opt out of Gemini training only — search indexing untouched
User-agent: Google-Extended
Disallow: /
# Googlebot (search) is separate — blocking it removes your search visibility
User-agent: Googlebot
Allow: /
Two key points.
(a) Googlebot itself does not change. Google search indexing is still done by the same crawler (Googlebot), and only the AI-training fetches are opt-outable via the separate token (Google-Extended). Blocking Google-Extended therefore preserves your Google search results.
(b) AI Overviews exposure is a separate system. Google-Extended is a training opt-out only. When AI Overviews retrieves from the Googlebot index to compose answers, that retrieval is governed by different mechanisms. Training opt-out and AI Overviews visibility are separate decisions.
This asymmetry creates a subtle decision burden for GEO work. If you want to be excluded from training but still appear in AI Overviews, a Google-Extended Disallow is enough. But if you want to disappear from AI Overviews too, you need an additional mechanism — a single policy decision will not control both outcomes.
Axis 5 — Perplexity's two-bot structure (PerplexityBot · Perplexity-User)
Perplexity's official documentation uses a two-bot structure.
| User-agent | Purpose | When it runs |
|---|---|---|
| PerplexityBot | Perplexity search indexing | Background crawl |
| Perplexity-User | Real-time fetches when answering user questions | Real-time, user-initiated |
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
Important difference: Perplexity does not run a separate AI-training bot. Perplexity uses external models with a retrieve-and-generate architecture rather than training its own LLMs, so the training opt-out decision simply does not exist in this ecosystem.
For GEO visibility, blocking PerplexityBot means immediate zero results in Perplexity. It is the most immediately costly policy decision among the four — if you want earned-media citations (Chen et al. 2025), blocking PerplexityBot deserves serious thought.
Sidebar — Microsoft / BingBot (no dedicated AI bot)
We did not put Microsoft in the "big four" frame, but its operational impact is far from small. According to the Microsoft Bing webmaster guide, Microsoft runs the following bots:
| User-agent | Purpose |
|---|---|
| bingbot | Bing search indexing (also serves Copilot · Bing Chat backend) |
| MicrosoftPreview | Link-preview fetches triggered by Copilot and similar features |
Two key points.
(a) No training-vs-search separation. Unlike OpenAI or Anthropic, Microsoft does not publish a dedicated training bot. Copilot and Bing Chat retrieve directly from the general bingbot search index when generating answers — there is no separate training opt-out decision to make.
(b) Wide visibility blast radius. Blocking bingbot removes you from Bing search + Copilot answers + (in some setups) ChatGPT search that uses Bing as its backend, all at once. Microsoft sits outside the "big four" frame, but its GEO impact is not negligible.
User-agent: bingbot
Disallow: /
User-agent: MicrosoftPreview
Disallow: /
The reason this guide keeps the "big four" frame is that the cross-LLM authority mapping showed OpenAI · Anthropic · Google · Perplexity cited consistently as Tier 1 authorities by all four LLMs (ChatGPT · Claude · Gemini · Perplexity), while Microsoft appeared only in some answers — Tier 2 in that mapping. Refer to this section when making Bing- or Copilot-related policy decisions.
The 4-platform × 3-purpose matrix
Compiling the four platforms into a single matrix makes the structural differences obvious.
| Platform | Training | Search indexing | User fetch |
|---|---|---|---|
| OpenAI | GPTBot | OAI-SearchBot | ChatGPT-User |
| Anthropic | ClaudeBot | Claude-SearchBot | Claude-User |
| Google-Extended (token) | Googlebot (unchanged) | — | |
| Perplexity | (none — no training) | PerplexityBot | Perplexity-User |
Three patterns emerge.
- OpenAI · Anthropic are symmetric — three-bot separation across training, search, and user-fetch. The most fine-grained opt-out controls.
- Google is token-based — a single Googlebot crawler plus the Google-Extended robots.txt token. Search and AI-training decisions are decoupled by token, not by user-agent.
- Perplexity has no training — a two-bot structure where training opt-out is simply not on the table.
Citing the four policy URLs above in the subjectOf or knowsAbout field of your Schema.org Organization markup helps your site be identified as part of the GEO authority cluster — structured data itself is an authority signal.
Practical robots.txt design — three patterns
Operating-purpose patterns.
Pattern A — Protect (block training, keep visibility)
When you want to keep your content out of model training but stay visible in search and user-initiated answers.
# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Allow search and user-fetch
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
Suited for media, news, legal, and other content-rights-sensitive areas. Whether blocking training has long-term effects on brand visibility is still an open academic question — Aggarwal et al. KDD 2024 does not directly measure the training-data surface.
Pattern B — Balance (allow all + IP verification)
Visibility-first, with WAF-level IP verification of legitimate bots only. Register the published IP ranges in your WAF to block user-agent spoofing.
User-agent: *
Allow: /
# IP verification handled at the WAF layer, not in robots.txt
Suited for GEO measurement, SaaS, B2B SaaS — domains where visibility directly affects revenue. RanketAI itself uses this pattern, because forming the earned-media citation surface is the priority.
Pattern C — Open (everything allowed)
No explicit blocking in robots.txt — every bot allowed. Lowest tracking cost, but no protection against content-rights risk.
# No robots.txt, or:
User-agent: *
Disallow:
Suited for general informational sites, blogs, and documentation where content protection is not a primary concern.
Mapping to the RanketAI probe
The four RanketAI probe measurement areas covered in #04 — brand recall · top placement · citation authority · answer quality — are all downstream surfaces of robots.txt policy. When a probe area is weak, the diagnostic priority follows naturally.
| probe weakness | What to check |
|---|---|
| Weak brand recall (your site does not appear in answers) | Possible blocking of GPTBot · PerplexityBot · Googlebot indexing |
| Weak top placement (other sources outrank you) | Earned-media gap, plus a thin training-data surface for your domain |
| Weak citation authority (your domain is not cited as a source) | Insufficient OAI-SearchBot · PerplexityBot · Googlebot indexing |
| Low answer quality (negative or neutral tone) | Content-signal weakness — unrelated to robots.txt |
In other words, the area where your RanketAI probe grade drops also tells you which robots.txt entries to inspect first. The four-platform comparison is not an abstract guideline — it is a diagnostic frame for interpreting measurement results.
Conclusion — three-axis agreement: academia + standards + platforms
GEO policy decisions, like in #04, should be read through a three-axis frame.
- Academia — Aggarwal et al. KDD 2024 · Chen et al. 2025 — quantitatively confirm that robots.txt is the starting point of the citation gate.
- Standards — IETF RFC 9309 · Schema.org Organization — the de-facto official standards that codify bot policy.
- Platforms — OpenAI Bots · Anthropic help · Google-Extended · Perplexity Crawlers — the four publicly published policies.
Where the three axes agree:
- Three-layer separation (training vs search vs user-fetch) is the four-platform standard (Perplexity uses two layers) — training opt-out and visibility maintenance can be decided independently.
- Only Google uses a token instead of a user-agent (Google-Extended) — decoupling Googlebot indexing from Gemini training opt-out.
- Blocking PerplexityBot equals immediate zero visibility — the most immediately costly policy decision among the four.
- No legal force, but explicit voluntary compliance — the model defined by RFC 9309.
Reviewing robots.txt policy is the first task in any GEO workflow. When your own measurement (the four RanketAI probe area grades) shows weakness, this is the layer to check first. Recommended order: (1) audit your current robots.txt → (2) map it onto the four-platform matrix → (3) cross-check against the weak probe area → (4) pick Pattern A / B / C and apply.
⚠ Policies change frequently. The user-agent names, URLs, and policy details in this guide reflect the state as of 2026-05-07. Verify against the four references above before applying any of this in production.
Read more: #01 — Why SEO Alone Is Not Enough in the AI Search Era · #02 — Anatomy of LLM Citation Algorithms · #03 — Korea's AI Visibility Gap · #04 — GEO Academia × Industry × Measurement
Execution Summary
| Item | Practical guideline |
|---|---|
| Core topic | RanketAI Guide #05: The Four AI Crawler Policies — GPTBot · ClaudeBot · Google-Extended · PerplexityBot |
| Best fit | Prioritize for AI Business, Funding & Market workflows |
| Primary action | Define a measurable success KPI (cost, time, or quality) before starting any AI initiative |
| Risk check | Validate ROI assumptions with a small pilot before committing the full budget |
| Next step | Establish a quarterly review cadence to track KPI movement and adjust scope |
Frequently Asked Questions
How does the approach described in "RanketAI Guide #05: The Four AI Crawler Policies…" apply to real-world workflows?▾
Start with an input contract that requires objective, audience, source material, and output format for every request.
Is RanketAI suitable for individual practitioners, or does it require a full team effort?▾
Teams with repetitive workflows and high quality variance, such as AI Business, Funding & Market, usually see faster gains.
What are the most common mistakes when first adopting RanketAI?▾
Before rewriting prompts again, verify that context layering and post-generation validation loops are actually enforced.
Data Basis
- IETF RFC 9309 — Robots Exclusion Protocol (published 2022-09). Formalizes 30+ years of de-facto robots.txt practice into an internet standard, defining User-agent · Disallow · Allow directives, longest-match precedence, and a voluntarily-compliance model that all four AI platforms explicitly honor.
- OpenAI Bots official documentation (platform.openai.com/docs/bots) — three separate user-agents: GPTBot (foundation model training), OAI-SearchBot (ChatGPT search indexing), and ChatGPT-User (real-time fetches triggered by users), plus a published IP-range JSON for WAF verification.
- Anthropic crawler help article (support.claude.com 8896518) — three user-agents: ClaudeBot (training), Claude-SearchBot (search indexing), Claude-User (user-triggered fetches). Legacy user-agents anthropic-ai and claude-web are retained for compatibility.
- Google-Extended announcement (blog.google/technology/ai/an-update-on-web-publisher-controls/) plus Search Central guide (developers.google.com/search/docs/crawling-indexing/google-extended) — a robots.txt token (not a separate user-agent) that opts out of Gemini and future AI training while leaving Googlebot search indexing untouched.
- PerplexityBot documentation (docs.perplexity.ai/guides/bots) — two user-agents: PerplexityBot (search indexing) and Perplexity-User (user-triggered fetches). No dedicated training crawler because Perplexity does not train its own foundation models.
- Aggarwal et al. "GEO: Generative Engine Optimization" (Princeton · IIT Delhi · Georgia Tech, KDD 2024, arXiv:2311.09735) — the academic origin of the GEO measurement frame. If the robots.txt crawl gate is closed, all nine validated strategies become moot.
- Chen · Wang · Chen · Koudas. "How to Dominate AI Search" (2025-09, arXiv:2509.08919) — quantitatively shows AI search is systematically and overwhelmingly biased toward earned media (third-party sources). robots.txt is the entry gate to that earned-media surface.
- Microsoft Bing webmaster guide (bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0) — bingbot (Bing search + Copilot backend) and MicrosoftPreview (link previews). No dedicated AI training bot; Copilot reuses the general search index.
Key Claims and Sources
This section maps key claims to their supporting sources one by one for fast verification. Review each claim together with its original reference link below.
Claim:IETF RFC 9309 (2022-09) is the de-facto official standard for robots.txt
Source:IETF RFC 9309Claim:OpenAI separates training, search, and user-fetch into GPTBot, OAI-SearchBot, and ChatGPT-User
Source:OpenAI Bots official documentationClaim:Anthropic separates training, search, and user-fetch into ClaudeBot, Claude-SearchBot, and Claude-User
Source:Anthropic crawler help articleClaim:Google-Extended is a robots.txt token (not a user-agent) that controls Gemini training opt-out only
Source:Google blog — web publisher controlsClaim:Perplexity uses a two-bot structure (PerplexityBot for indexing, Perplexity-User for user requests) with no training crawler
Source:Perplexity Crawlers documentationClaim:GEO measurement starts at robots.txt — once crawling or indexing is blocked, the nine strategies (Aggarwal et al.) cannot help
Source:Aggarwal et al. KDD 2024 (arXiv:2311.09735)Claim:AI search is systematically biased toward earned media — robots.txt is the entry gate to that surface
Source:Chen et al. 2025 (arXiv:2509.08919)
External References
The links below are original sources directly used for the claims and numbers in this post. Checking source context reduces interpretation gaps and speeds up re-validation.
- IETF RFC 9309 — Robots Exclusion Protocol
- OpenAI — Overview of OpenAI Crawlers
- Anthropic — Does Anthropic crawl data from the web?
- Google — An update on web publisher controls (Google-Extended)
- Google Search Central — Google-Extended user agent token
- Perplexity — Perplexity Crawlers
- Microsoft Bing — Which crawlers does Bing use?
- Aggarwal et al. — GEO: Generative Engine Optimization (KDD 2024)
- Chen et al. — How to Dominate AI Search (2025)
- Schema.org Organization
Related Posts
These related posts are selected to help validate the same decision criteria in different contexts. Read them in order below to broaden comparison perspectives.
RanketAI Guide #04: GEO Academia × Industry × Measurement — Mapping 9 Strategies to User Signals
Aggarwal et al. (KDD 2024) defined nine GEO strategies. Chen et al. (2025) found AI search is biased toward earned media. Similarweb 2026 GenAI Brand Visibility Index and Ahrefs Brand Radar 2026 (75K brands) confirmed authority-over-scale. This guide aligns all three axes into four user-facing measurement areas.
GEO Playbook — 5 Steps to Win AI Answer Share + Live Test Results (2026)
GEO (Generative Engine Optimization) is the practice of getting your domain cited inside AI answers. This guide covers the 5 core steps, AthenaHQ's +45% answer share live test, and the measure-publish-verify cycle.
GEO Analysis Tool vs AEO Analysis Tool: Which to Use, When (2026)
GEO and AEO analysis tools measure different surfaces. Compare scope, six tool categories, scenario-based selection, the Coverage × Depth × Locale framework, and where RanketAI fits.
What Is an AEO Analysis Tool? 6 Signals, 4 KPIs, and a Self-Audit Checklist (2026)
An AEO analysis tool measures the likelihood that ChatGPT, Gemini, and Perplexity will quote your page inside an answer. Learn the definition, the 6 measured signals, 4 core KPIs, and a 7-step self-audit checklist.
What Is a GEO Analysis Tool? Definition, 5 Signals, and Adoption Guide (2026)
A GEO analysis tool measures how likely ChatGPT, Gemini, and Perplexity are to cite or recommend your site. Learn the definition, the 5 signals it measures, a 4-step adoption workflow, and a selection checklist.