Don't block AI crawlers wholesale. That's the reflex when your bandwidth graph spikes, but it also cuts you out of ChatGPT and Perplexity search results, and it doesn't touch the crawler that ignores robots.txt anyway. The bots aren't really the problem. The hosting billing model that quietly turns their traffic into your invoice is.
On 3 June 2026, Cloudflare CEO Matthew Prince posted that bots had passed humans as a share of web traffic for the first time: roughly 57% of HTTP requests to HTML pages, against 43% from people. The number is real. What it measures is widely misread, and that misreading is exactly what leads to the wrong fix.
TL;DR
- The ~57% bot figure (Cloudflare, June 2026) is all automated traffic. AI crawlers were only about 4.2% of HTML requests in December 2025, but they're the fastest-growing and most expensive slice.
- The cost lands on you as bandwidth and CPU, and increasingly as billable "visits". Read the Docs ate a $5,000 bandwidth bill in one month from a single crawler.
- Hosts handle bot billing very differently. WP Engine excludes known and suspected bots; Kinsta excludes only what it can confidently identify, so novel AI crawlers can still count against you.
- Blanket-blocking is the wrong reflex. Training crawlers and search crawlers are separate, documented bots. Block
GPTBotand you still appear in ChatGPT search; blockOAI-SearchBotand you don't. robots.txtis a polite request, not a fence. Enforcement happens at the server or edge: rate limits, IP-range rules, and a WAF.
Table of contents
- The 57% headline is real, but it's not measuring what you think
- The real cost is compute and bandwidth, not page views
- Who actually pays: the billing models
- Why blanket-blocking is the wrong reflex
- robots.txt is a polite request, not a fence
- A tiered approach that actually holds
- When blocking everything is the right call
- Key takeaways
The 57% headline is real, but it's not measuring what you think
The crossover happened faster than Cloudflare predicted: Prince had expected it around 2027. But the ~57% is every automated request, and the thing that pushed it past half isn't crawlers indexing pages. It's agentic traffic, AI agents executing multi-page tasks on a user's behalf and visiting far more pages than a person would.
AI crawlers proper are a smaller slice. In Cloudflare's 2025 Year in Review, AI bots were about 4.2% of HTML requests, just under Googlebot's 4.5%. Small, but growing fast and weighted heavily toward one job: training crawls account for roughly 80% of AI bot crawling.
The asymmetry shapes every decision that follows. Cloudflare measured how many times each company crawls your site for every visitor it sends back. In August 2025, Anthropic's ClaudeBot ran at nearly 50,000:1. OpenAI's GPTBot sat at 887:1, Perplexity at 118:1. Training crawlers take everything and return almost nothing. I'm not moralizing about that; it's a cost structure, and it tells you which bots earn their place.
The real cost is compute and bandwidth, not page views
A crawler doesn't cost you a "visit". It costs you the bytes it pulls and the PHP workers it ties up generating uncached pages. For anyone running WordPress, that second part is the killer: an aggressive crawler hitting uncached, parameterized URLs drives PHP and database load far out of proportion to its request count.
The numbers from people who measured it are blunt. Read the Docs watched a single crawler pull 73 TB of zipped HTML in one month, nearly 10 TB in one day, for a bandwidth bill over $5,000. After blocking AI crawlers, their traffic dropped 75%. The Wikimedia Foundation reported multimedia bandwidth up 50% since January 2024, with 65% of its most expensive traffic coming from bots that make up only about 35% of pageviews. On his Diaspora instance, Dennis Schubert found 70% of two months of traffic came from LLM bots that returned every six hours regardless of robots.txt.
It gets worse when crawlers stop identifying themselves. Drew DeVault wrote that LLM scrapers were eating anywhere from 20 to 100% of his week, arriving from tens of thousands of residential IPs with user agents that overlap with real browsers, causing dozens of brief outages. GNOME's GitLab got so swamped that of 81,000 requests in one window, only 3.2% were human. Whether that load shows up as a bandwidth overage, a CPU-limit suspension, or just a slow site depends entirely on how your host bills you.
Who actually pays: the billing models
This is where the "tax" gets levied, and hosts split sharply. The catch is the metered "visit", usually defined as one unique IP per 24 hours. Whether a crawler counts as a visit is a policy choice, and most providers make it quietly.
WP Engine has the clearest exclusion. A visit is one unique IP per UTC day, and known bot user agents aren't counted. Since September 2025 they also exclude suspected bots: user agents not on the known list that don't behave like humans. Requests returning 403 and static assets don't count either. That's the strongest documented protection I've found.
Kinsta is softer than its reputation suggests. It excludes "most well-known" bots from visit counts, but its own docs are explicit: bot traffic still consumes RAM, CPU and bandwidth, "as such, some bots might be included in your total visits count". Under its bot-protection feature, only traffic it can confidently identify as automated is excluded; anything merely "likely automated", which is exactly where a new AI crawler lands before it's on anyone's list, can still count. Kinsta's November 2025 move to bandwidth-based pricing promises "lower risk" of bot overages, but that's not the same as excluding bot bandwidth from the bill.
Pressable and Flywheel exclude "known bots" without spelling out whether specific AI crawlers are on those lists. SiteGround takes a different route entirely and blocks training AI crawlers at the infrastructure level while allowing the ones used in live chat sessions, so much of that traffic never reaches your origin. And classic cPanel shared hosting, the "unlimited bandwidth" tier, hides the cost in CPU and process limits: a hungry crawler doesn't blow a bandwidth cap, it trips resource limits and gets your account throttled or suspended for "excessive resource use". Predictable cost is one of the stronger arguments for managed hosting over a cheap shared plan.
Why blanket-blocking is the wrong reflex
The reflex when you see the bill is one line in robots.txt: Disallow: / for everything AI. It's a mistake, because "AI bot" isn't one thing. Each major provider runs separate, documented crawlers for separate jobs, and they cost and repay you very differently.
OpenAI runs three: GPTBot for training, OAI-SearchBot for its search index, and ChatGPT-User for real-time fetches a user triggered. Their own docs are clear that blocking OAI-SearchBot means your site "will not be shown in ChatGPT search answers", while blocking GPTBot only stops training collection. Anthropic mirrors this with ClaudeBot (training), Claude-User, and Claude-SearchBot. Perplexity splits PerplexityBot (search, respects robots.txt) from Perplexity-User. Google bundles all AI-training control into the Google-Extended token, which it states does not affect Search ranking and has no separate user agent you can even see in logs.
Now combine that with the crawl-to-referral ratios. Blocking a training crawler at 50,000:1 costs you essentially zero referral traffic, because it never sent any. Blocking a search crawler costs you something concrete, because that's the one that puts you in front of users. And the users it sends are unusually valuable: Ahrefs found AI search was just 0.5% of its visitors but 12.1% of its signups, converting at 23 times the rate of traditional organic search. Blanket-blocking throws that away to stop traffic that was already cheap to serve. If AI-search visibility matters to you, that trade-off deserves more than a reflex, and it connects directly to why GEO is still mostly SEO.
robots.txt is a polite request, not a fence
Even when you decide correctly, robots.txt only stops bots that choose to obey it. It's a sign on the lawn, not a fence. The well-behaved crawlers honor it. The ones costing you the most often don't.
Cloudflare alleged in August 2025 that Perplexity used stealth, undeclared crawlers to evade no-crawl directives, generating 3 to 6 million daily requests from a generic Chrome user agent and IPs outside its published ranges, reaching honeypot domains that explicitly disallowed it. Cloudflare de-listed Perplexity from its verified-bots program; Perplexity disputed the analysis. It wasn't the first such report: WIRED found in 2024 that an undisclosed Perplexity-linked IP kept hitting Condé Nast sites it had blocked. TollBit measured bots ignoring robots.txt rising from 3.3% to 12.9% in a single quarter.
Two related controls are weaker still. Crawl-delay is advisory, and Google has never supported it; Bing and Yandex do, but a crawler that ignores it faces no consequence. And llms.txt, the proposed file for steering models at inference time, is currently fiction in practice: an August 2025 audit across 1,000 domains recorded zero visits to llms.txt from GPTBot, ClaudeBot or PerplexityBot, and Google's John Mueller said plainly that no AI system uses it yet. Treat it as a bet on the future, not a control today.
A tiered approach that actually holds
So you need layers, each doing what it can. The principle: use robots.txt to express intent to honest bots, and enforce the rest where bots can't argue.
Start with intent. In robots.txt, disallow the training crawlers you don't want feeding models (GPTBot, ClaudeBot, Google-Extended, CCBot) while explicitly allowing the search crawlers that earn their keep (OAI-SearchBot, Claude-SearchBot, PerplexityBot). That one distinction does most of the work.
Then enforce. At the web server, rate-limit by user agent so even a misbehaving crawler can't saturate your workers:
# Map AI crawler user agents to a flag
map $http_user_agent $ai_crawler {
default 0;
~*GPTBot 1;
~*ClaudeBot 1;
~*Bytespider 1;
}
# Cap flagged crawlers to a few requests per minute per IP
limit_req_zone $binary_remote_addr zone=ai_zone:10m rate=6r/m;
server {
location / {
if ($ai_crawler) {
limit_req zone=ai_zone burst=4 nodelay;
}
}
}
User-agent matching catches honest bots; it won't catch the spoofers. For those, block by published IP range (OpenAI, Anthropic and Perplexity all publish theirs) and lean on a WAF that does the heavy lifting. Cloudflare has offered a one-click AI-bot block on every plan, including free, since July 2024, plus per-provider controls and, since July 2025, a pay-per-crawl model that returns HTTP 402 to crawlers that won't pay. By December 2025 it had blocked 416 billion AI bot requests. Edge enforcement matters because it stops the request before it ever reaches your origin, which is the only point where blocking saves you compute.
When blocking everything is the right call
The selective approach is correct for most content-driven sites, but not all. If you run a brochure site, a portfolio, or anything that earns nothing from search discovery and never will, the calculus flips: there's no referral upside to protect, so default-deny is simpler and cheaper. Block the lot at the edge and move on.
The same goes for private or internal applications, staging environments, and anything behind a login. No crawler, training or search, has any business there, and the AI-search trade-off doesn't exist. For those, Disallow: / plus an edge block is exactly right, and the only mistake would be applying that same blanket rule to a public site you want discovered.
For everything in between, the question isn't "bots or no bots". It's which bots pay their way. A training crawler that takes 50,000 pages per referral is a cost with no return. A search crawler that converts at 23 times organic is worth serving, even at scale. The tax isn't the crawlers. It's paying for the wrong ones, on a plan that bills you for the privilege.
Key takeaways
- The ~57% bot-majority figure is real but is all automated traffic; AI crawlers were ~4.2% of HTML requests in late 2025, growing fast and skewed ~80% toward training.
- Crawler cost is bandwidth and CPU, not page views. Uncached, parameterized URLs make WordPress especially exposed. Read the Docs hit a $5,000 monthly bill from one crawler.
- Hosts differ a lot: WP Engine excludes known and suspected bots; Kinsta excludes only confidently-identified ones, so novel AI crawlers can still count; shared cPanel hides the cost in CPU limits.
- Training and search crawlers are separate, documented bots. Block training (50,000:1, no referral) and keep search (converts at 23x organic). Don't block both by reflex.
robots.txt,Crawl-delayandllms.txtare requests, not enforcement. Rate-limit by user agent, block by IP range, and use a WAF at the edge for the rest.