Skip to main content
Back to List
AI Business, Funding & Market·Author: RanketAI Editorial·Updated: 2026-05-07

RanketAI Guide #05: The Four AI Crawler Policies — GPTBot · ClaudeBot · Google-Extended · PerplexityBot

Building on IETF RFC 9309, the four major AI platforms — OpenAI, Anthropic, Google, and Perplexity — publish bot policies that separate training, search indexing, and user-fetch layers. This guide compares all four and maps them to the RanketAI probe measurement surface in a single frame.

AI-assisted draft · Editorially reviewed

This blog content may use AI tools for drafting and structuring, and is published after editorial review by the RanketAI Editorial Team.

TL;DR: The first gate of GEO measurement is not content quality — it is robots.txt. ChatGPT (three OpenAI bots), Claude (three Anthropic bots), Gemini (the Google-Extended token), and Perplexity (two bots) each separate training, search indexing, and user-fetch into distinct user-agents. This guide compares the four platforms on top of IETF RFC 9309 and maps each policy decision to the visibility surface that the RanketAI probe actually measures.

Why robots.txt is the first GEO gate

In #04 — GEO Academia × Industry × Measurement, academia and industry agreed that earned media (third-party sources) outweighs brand-owned pages (Chen et al. 2025 · Ahrefs 2026). But there is an even more fundamental gate before that — whether AI can include your pages in the training, indexing, or live-fetch surface in the first place.

Through the GEO measurement frame, visibility forms in three sequential stages.

  1. Crawl gate — does robots.txt allow the bot through?
  2. Index/train gate — does the bot absorb the content into training data or a search index?
  3. Citation gate — does the model cite that content when answering a user?

The nine strategies in Aggarwal et al. KDD 2024 and the earned-media bias in Chen et al. 2025 are about optimizing stages 2 and 3. Yet none of that matters if stage 1 (the crawl gate) is closed. robots.txt policy is the binary switch that decides whether your zero-indicator surface for GEO exists at all. Knowing the four AI platform policies precisely is therefore the starting point for any GEO work.

Axis 1 — IETF RFC 9309 (the de-facto official standard)

The official standard for robots.txt is IETF RFC 9309 (Robots Exclusion Protocol). When the IETF formalized it in September 2022, 30+ years of de-facto robots.txt practice became a recognized internet standard. Three directives matter:

User-agent: <bot name or *>
Disallow: <blocked path>
Allow: <explicitly permitted path>

Match precedence is longest-match wins, and on conflicts Allow beats Disallow. All four platform docs explicitly honor this RFC — meaning the "robots.txt does what you wrote" guarantee rests on the voluntary compliance model that RFC 9309 defines.

No legal force. As stated in RFC 9309 §1.1, the standard is built on "voluntarily compliance." Some training-data scrapers ignore robots.txt, but OpenAI · Anthropic · Google · Perplexity all publicly committed to RFC 9309 compliance.

Axis 2 — OpenAI's three-bot split (GPTBot · OAI-SearchBot · ChatGPT-User)

OpenAI's official bots documentation operates three separate user-agents.

User-agent Purpose When it runs
GPTBot Foundation model (GPT family) training Background crawl
OAI-SearchBot ChatGPT search indexing Background crawl
ChatGPT-User Triggered when a user asks ChatGPT to fetch a specific URL Real-time, user-initiated

Block training only, allow search and user-initiated fetches:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

The point of this three-way split is that training opt-out and visibility maintenance are separable decisions. Blocking GPTBot alone keeps your content out of model training data while still allowing OAI-SearchBot to index it for ChatGPT search results — the GEO visibility surface is preserved.

OpenAI also publishes official IP ranges (JSON endpoint) so that WAFs (Cloudflare, etc.) can block user-agent spoofing at the network layer. Of the four platforms, OpenAI offers the most standardized verification infrastructure.

Axis 3 — Anthropic's three-bot split (ClaudeBot · Claude-SearchBot · Claude-User)

Anthropic's official help article follows the same three-bot pattern.

User-agent Purpose When it runs
ClaudeBot Claude model training Background crawl
Claude-SearchBot Claude search indexing Background crawl
Claude-User Triggered when a user asks Claude to fetch a URL Real-time, user-initiated
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

The same three-layer split as OpenAI — meaning separating training from search from user-fetch has become an industry-standard pattern. The main difference is that Anthropic's IP-range publication is less standardized than OpenAI's; some operators rely on Cloudflare's AI bot rules as a secondary measure.

Watch for legacy user-agents. Some older robots.txt guides list anthropic-ai or claude-web as the user-agent to block. Those are pre-2024 names from before Anthropic consolidated its crawler identity; the current canonical name is ClaudeBot. For safety, list both for backwards compatibility.

Axis 4 — Google-Extended (a token, not a bot)

Google's policy is structurally different from the other three. Google-Extended is a robots.txt token, not a user-agent (Google's announcement · Search Central guide).

# Opt out of Gemini training only — search indexing untouched
User-agent: Google-Extended
Disallow: /

# Googlebot (search) is separate — blocking it removes your search visibility
User-agent: Googlebot
Allow: /

Two key points.

(a) Googlebot itself does not change. Google search indexing is still done by the same crawler (Googlebot), and only the AI-training fetches are opt-outable via the separate token (Google-Extended). Blocking Google-Extended therefore preserves your Google search results.

(b) AI Overviews exposure is a separate system. Google-Extended is a training opt-out only. When AI Overviews retrieves from the Googlebot index to compose answers, that retrieval is governed by different mechanisms. Training opt-out and AI Overviews visibility are separate decisions.

This asymmetry creates a subtle decision burden for GEO work. If you want to be excluded from training but still appear in AI Overviews, a Google-Extended Disallow is enough. But if you want to disappear from AI Overviews too, you need an additional mechanism — a single policy decision will not control both outcomes.

Axis 5 — Perplexity's two-bot structure (PerplexityBot · Perplexity-User)

Perplexity's official documentation uses a two-bot structure.

User-agent Purpose When it runs
PerplexityBot Perplexity search indexing Background crawl
Perplexity-User Real-time fetches when answering user questions Real-time, user-initiated
User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

Important difference: Perplexity does not run a separate AI-training bot. Perplexity uses external models with a retrieve-and-generate architecture rather than training its own LLMs, so the training opt-out decision simply does not exist in this ecosystem.

For GEO visibility, blocking PerplexityBot means immediate zero results in Perplexity. It is the most immediately costly policy decision among the four — if you want earned-media citations (Chen et al. 2025), blocking PerplexityBot deserves serious thought.

We did not put Microsoft in the "big four" frame, but its operational impact is far from small. According to the Microsoft Bing webmaster guide, Microsoft runs the following bots:

User-agent Purpose
bingbot Bing search indexing (also serves Copilot · Bing Chat backend)
MicrosoftPreview Link-preview fetches triggered by Copilot and similar features

Two key points.

(a) No training-vs-search separation. Unlike OpenAI or Anthropic, Microsoft does not publish a dedicated training bot. Copilot and Bing Chat retrieve directly from the general bingbot search index when generating answers — there is no separate training opt-out decision to make.

(b) Wide visibility blast radius. Blocking bingbot removes you from Bing search + Copilot answers + (in some setups) ChatGPT search that uses Bing as its backend, all at once. Microsoft sits outside the "big four" frame, but its GEO impact is not negligible.

User-agent: bingbot
Disallow: /

User-agent: MicrosoftPreview
Disallow: /

The reason this guide keeps the "big four" frame is that the cross-LLM authority mapping showed OpenAI · Anthropic · Google · Perplexity cited consistently as Tier 1 authorities by all four LLMs (ChatGPT · Claude · Gemini · Perplexity), while Microsoft appeared only in some answers — Tier 2 in that mapping. Refer to this section when making Bing- or Copilot-related policy decisions.

The 4-platform × 3-purpose matrix

Compiling the four platforms into a single matrix makes the structural differences obvious.

Platform Training Search indexing User fetch
OpenAI GPTBot OAI-SearchBot ChatGPT-User
Anthropic ClaudeBot Claude-SearchBot Claude-User
Google Google-Extended (token) Googlebot (unchanged)
Perplexity (none — no training) PerplexityBot Perplexity-User

Three patterns emerge.

  • OpenAI · Anthropic are symmetric — three-bot separation across training, search, and user-fetch. The most fine-grained opt-out controls.
  • Google is token-based — a single Googlebot crawler plus the Google-Extended robots.txt token. Search and AI-training decisions are decoupled by token, not by user-agent.
  • Perplexity has no training — a two-bot structure where training opt-out is simply not on the table.

Citing the four policy URLs above in the subjectOf or knowsAbout field of your Schema.org Organization markup helps your site be identified as part of the GEO authority cluster — structured data itself is an authority signal.

Practical robots.txt design — three patterns

Operating-purpose patterns.

Pattern A — Protect (block training, keep visibility)

When you want to keep your content out of model training but stay visible in search and user-initiated answers.

# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /

# Allow search and user-fetch
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /

Suited for media, news, legal, and other content-rights-sensitive areas. Whether blocking training has long-term effects on brand visibility is still an open academic question — Aggarwal et al. KDD 2024 does not directly measure the training-data surface.

Pattern B — Balance (allow all + IP verification)

Visibility-first, with WAF-level IP verification of legitimate bots only. Register the published IP ranges in your WAF to block user-agent spoofing.

User-agent: *
Allow: /

# IP verification handled at the WAF layer, not in robots.txt

Suited for GEO measurement, SaaS, B2B SaaS — domains where visibility directly affects revenue. RanketAI itself uses this pattern, because forming the earned-media citation surface is the priority.

Pattern C — Open (everything allowed)

No explicit blocking in robots.txt — every bot allowed. Lowest tracking cost, but no protection against content-rights risk.

# No robots.txt, or:
User-agent: *
Disallow:

Suited for general informational sites, blogs, and documentation where content protection is not a primary concern.

Mapping to the RanketAI probe

The four RanketAI probe measurement areas covered in #04brand recall · top placement · citation authority · answer quality — are all downstream surfaces of robots.txt policy. When a probe area is weak, the diagnostic priority follows naturally.

probe weakness What to check
Weak brand recall (your site does not appear in answers) Possible blocking of GPTBot · PerplexityBot · Googlebot indexing
Weak top placement (other sources outrank you) Earned-media gap, plus a thin training-data surface for your domain
Weak citation authority (your domain is not cited as a source) Insufficient OAI-SearchBot · PerplexityBot · Googlebot indexing
Low answer quality (negative or neutral tone) Content-signal weakness — unrelated to robots.txt

In other words, the area where your RanketAI probe grade drops also tells you which robots.txt entries to inspect first. The four-platform comparison is not an abstract guideline — it is a diagnostic frame for interpreting measurement results.

Conclusion — three-axis agreement: academia + standards + platforms

GEO policy decisions, like in #04, should be read through a three-axis frame.

  1. AcademiaAggarwal et al. KDD 2024 · Chen et al. 2025 — quantitatively confirm that robots.txt is the starting point of the citation gate.
  2. StandardsIETF RFC 9309 · Schema.org Organization — the de-facto official standards that codify bot policy.
  3. PlatformsOpenAI Bots · Anthropic help · Google-Extended · Perplexity Crawlers — the four publicly published policies.

Where the three axes agree:

  • Three-layer separation (training vs search vs user-fetch) is the four-platform standard (Perplexity uses two layers) — training opt-out and visibility maintenance can be decided independently.
  • Only Google uses a token instead of a user-agent (Google-Extended) — decoupling Googlebot indexing from Gemini training opt-out.
  • Blocking PerplexityBot equals immediate zero visibility — the most immediately costly policy decision among the four.
  • No legal force, but explicit voluntary compliance — the model defined by RFC 9309.

Reviewing robots.txt policy is the first task in any GEO workflow. When your own measurement (the four RanketAI probe area grades) shows weakness, this is the layer to check first. Recommended order: (1) audit your current robots.txt → (2) map it onto the four-platform matrix → (3) cross-check against the weak probe area → (4) pick Pattern A / B / C and apply.

Policies change frequently. The user-agent names, URLs, and policy details in this guide reflect the state as of 2026-05-07. Verify against the four references above before applying any of this in production.

Read more: #01 — Why SEO Alone Is Not Enough in the AI Search Era · #02 — Anatomy of LLM Citation Algorithms · #03 — Korea's AI Visibility Gap · #04 — GEO Academia × Industry × Measurement

Execution Summary

ItemPractical guideline
Core topicRanketAI Guide #05: The Four AI Crawler Policies — GPTBot · ClaudeBot · Google-Extended · PerplexityBot
Best fitPrioritize for AI Business, Funding & Market workflows
Primary actionDefine a measurable success KPI (cost, time, or quality) before starting any AI initiative
Risk checkValidate ROI assumptions with a small pilot before committing the full budget
Next stepEstablish a quarterly review cadence to track KPI movement and adjust scope

Frequently Asked Questions

How does the approach described in "RanketAI Guide #05: The Four AI Crawler Policies…" apply to real-world workflows?

Start with an input contract that requires objective, audience, source material, and output format for every request.

Is RanketAI suitable for individual practitioners, or does it require a full team effort?

Teams with repetitive workflows and high quality variance, such as AI Business, Funding & Market, usually see faster gains.

What are the most common mistakes when first adopting RanketAI?

Before rewriting prompts again, verify that context layering and post-generation validation loops are actually enforced.

Data Basis

  • IETF RFC 9309 — Robots Exclusion Protocol (published 2022-09). Formalizes 30+ years of de-facto robots.txt practice into an internet standard, defining User-agent · Disallow · Allow directives, longest-match precedence, and a voluntarily-compliance model that all four AI platforms explicitly honor.
  • OpenAI Bots official documentation (platform.openai.com/docs/bots) — three separate user-agents: GPTBot (foundation model training), OAI-SearchBot (ChatGPT search indexing), and ChatGPT-User (real-time fetches triggered by users), plus a published IP-range JSON for WAF verification.
  • Anthropic crawler help article (support.claude.com 8896518) — three user-agents: ClaudeBot (training), Claude-SearchBot (search indexing), Claude-User (user-triggered fetches). Legacy user-agents anthropic-ai and claude-web are retained for compatibility.
  • Google-Extended announcement (blog.google/technology/ai/an-update-on-web-publisher-controls/) plus Search Central guide (developers.google.com/search/docs/crawling-indexing/google-extended) — a robots.txt token (not a separate user-agent) that opts out of Gemini and future AI training while leaving Googlebot search indexing untouched.
  • PerplexityBot documentation (docs.perplexity.ai/guides/bots) — two user-agents: PerplexityBot (search indexing) and Perplexity-User (user-triggered fetches). No dedicated training crawler because Perplexity does not train its own foundation models.
  • Aggarwal et al. "GEO: Generative Engine Optimization" (Princeton · IIT Delhi · Georgia Tech, KDD 2024, arXiv:2311.09735) — the academic origin of the GEO measurement frame. If the robots.txt crawl gate is closed, all nine validated strategies become moot.
  • Chen · Wang · Chen · Koudas. "How to Dominate AI Search" (2025-09, arXiv:2509.08919) — quantitatively shows AI search is systematically and overwhelmingly biased toward earned media (third-party sources). robots.txt is the entry gate to that earned-media surface.
  • Microsoft Bing webmaster guide (bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0) — bingbot (Bing search + Copilot backend) and MicrosoftPreview (link previews). No dedicated AI training bot; Copilot reuses the general search index.

Key Claims and Sources

This section maps key claims to their supporting sources one by one for fast verification. Review each claim together with its original reference link below.

External References

The links below are original sources directly used for the claims and numbers in this post. Checking source context reduces interpretation gaps and speeds up re-validation.

Related Posts

These related posts are selected to help validate the same decision criteria in different contexts. Read them in order below to broaden comparison perspectives.

RanketAI Guide #04: GEO Academia × Industry × Measurement — Mapping 9 Strategies to User Signals

Aggarwal et al. (KDD 2024) defined nine GEO strategies. Chen et al. (2025) found AI search is biased toward earned media. Similarweb 2026 GenAI Brand Visibility Index and Ahrefs Brand Radar 2026 (75K brands) confirmed authority-over-scale. This guide aligns all three axes into four user-facing measurement areas.

2026-05-04

GEO Playbook — 5 Steps to Win AI Answer Share + Live Test Results (2026)

GEO (Generative Engine Optimization) is the practice of getting your domain cited inside AI answers. This guide covers the 5 core steps, AthenaHQ's +45% answer share live test, and the measure-publish-verify cycle.

2026-05-05

GEO Analysis Tool vs AEO Analysis Tool: Which to Use, When (2026)

GEO and AEO analysis tools measure different surfaces. Compare scope, six tool categories, scenario-based selection, the Coverage × Depth × Locale framework, and where RanketAI fits.

2026-05-05

What Is an AEO Analysis Tool? 6 Signals, 4 KPIs, and a Self-Audit Checklist (2026)

An AEO analysis tool measures the likelihood that ChatGPT, Gemini, and Perplexity will quote your page inside an answer. Learn the definition, the 6 measured signals, 4 core KPIs, and a 7-step self-audit checklist.

2026-04-30

What Is a GEO Analysis Tool? Definition, 5 Signals, and Adoption Guide (2026)

A GEO analysis tool measures how likely ChatGPT, Gemini, and Perplexity are to cite or recommend your site. Learn the definition, the 5 signals it measures, a 4-step adoption workflow, and a selection checklist.

2026-04-29