Crawl Budget — How Google Indexes a Large Site
Google won't crawl your site infinitely. It has a limited "budget" for crawling — and if you waste it on junk URLs, your most important pages get indexed slowly or not at all. Crawl budget is a topic small sites can ignore but that decides the visibility of large sites and stores. This guide explains — based on Google's official guide — what it is, who it affects and how to optimize it.
Google has a limited crawl budget — waste it on junk URLs and your important pages get indexed slowly or not at all. I explain what crawl budget is, when it matters (stores, URL parameters) and how to optimize it.
Google's definition: capacity + demand
Google defines crawl budget as "the set of URLs Googlebot can and wants to crawl". It comes down to two elements:
- Crawl capacity limit — how much Googlebot can crawl without overloading your server. It rises when the site responds fast and reliably; it falls with 5xx errors and slow responses.
- Crawl demand — how much Google *wants* to crawl your pages. It depends on URL popularity, freshness, and how Google perceives your site's "inventory" (the more junk URLs, the worse the demand is spent).
Rule: even if the server can take more, low demand means less crawling. Both must be present.
/// CRAWL BUDGET = CAPACITY + DEMAND
- ›Server speed and stability
- ›No 5xx errors / timeouts
- ›Google's resource limits
- ›URL popularity
- ›Content freshness
- ›Perceived "inventory" (less junk)
* Both must be present: even with a fast server, low demand = less crawling.
Who it really affects (Google's thresholds)
Google's official guide is "advanced" and intended for:
- sites with 1,000,000+ unique pages whose content changes about weekly,
- sites with 10,000+ pages that change daily,
- or sites with a large share of URLs in the "Discovered – currently not indexed" status.
Google stresses these are rough thresholds. Small sites (up to a few thousand pages), especially those indexed the same day they publish, don't need to worry.
What wastes crawl budget (Google's list)
| Problem | Effect |
|---|---|
| Faceted navigation and session IDs in URLs | Near-infinite duplicates instead of real pages |
| Duplicate content | Repeatedly crawling the same thing |
| Soft 404 (an empty page returning 200) | Crawling with no value |
| Hacked pages and "infinite spaces" (e.g. calendars) | Googlebot gets stuck |
| Low-quality content and spam | Wasted demand |
| Long redirect chains | Wasted requests |
Faceted navigation — enemy number one
In online stores, a single problem accounts for most of the wasted budget: faceted navigation (color, size, price, brand filters). Every filter combination creates a new URL, and there are millions of combinations — Googlebot can drown in an infinite space of addresses that add nothing new.
The strategy depends on each facet's search value:
| Faceted URL type | Example | What to do with it |
|---|---|---|
| Valuable for SEO (there's demand) | /shoes/nike, /shoes/running | Index it — own content, own title, internal link |
| Not valuable, but needed by users | sorting, list/grid view | canonical to the base version; don't link it for bots |
| Junk combinations and parameters | ?color=red&size=42&sort=price | Block in robots.txt or don't generate crawlable links |
| "Empty result" filters | combinations with no products | Don't generate the link; return a sensible status |
The rule: deliberately pick a handful of valuable facets to index, and cut the rest off from crawling. The most common mistake is leaving every combination as a crawlable link — then even a big server can't keep up with indexing your real products.
Crawl budget in e-commerce — common traps
Stores are where crawl budget hurts most. The most common sources of waste beyond facets:
- Unavailable / discontinued products. Thousands of "out of stock" pages consume crawl. The decision depends on whether the product returns: keep it (200) if it will, 404/410 if it's gone for good, or redirect to a successor.
- Category pagination. Deep pages like /category?page=87 rarely have value — make sure you reach the most important products another way and consider limiting depth.
- Product variants as separate URLs (color, size) — consolidate with canonical to the parent product page if they differ only by an attribute.
- Session IDs and tracking parameters in URLs — the classic generator of infinite duplicates.
Cleaning up these four areas returns budget to Googlebot that lands where you want it: on real, revenue-driving product and category pages.
How to optimize (Google's recommendations)
- Tidy your URL inventory. Consolidate duplicates with `rel=canonical`; remove dead pages with 404/410; eliminate soft 404s.
- Block unimportant URLs in `robots.txt` (filters, actions, infinite spaces). Remember: `Disallow` does not remove a page from the index — `noindex` does that (and it in turn requires the page to be crawlable).
- Return 304 for unchanged pages. Googlebot sends an `If-Modified-Since` header; if content is unchanged, reply 304 Not Modified with no body — you save resources and Google crawls more real URLs.
- Keep a clean sitemap with accurate `lastmod` — only canonical, indexable URLs.
- Flatten redirect chains to a single hop.
- Speed up the server — faster, stable responses raise the capacity limit. Mind your Core Web Vitals.
User-agent: *Disallow: /*?sort=Disallow: /*?filter=Disallow: /cartDisallow: /searchSitemap: https://yourdomain.com/sitemap.xmlNote: the old **crawl-rate slider in Search Console was removed on January 8, 2024** — Google adjusts the rate automatically (slowing down on 5xx/429 errors and rising response times).
Sitemaps and lastmod — steering demand
A sitemap isn't just a list of addresses — it's a demand signal and a map of what you want indexed. Keep only canonical, indexable URLs in it (no redirects, 404s, `noindex` or robots-blocked addresses). The `lastmod` element should reflect the real date of a meaningful content change — Google uses it as a hint about what to refresh. Artificially setting `lastmod` to "today" across the whole sitemap erodes trust in the signal and stops working. For very large sites, split sitemaps by topic (e.g. per category), because GSC then shows immediately which section indexes worse.
JavaScript and crawl budget — the hidden cost
On large sites, JavaScript rendering can quietly eat the budget. Googlebot crawls in two stages: first it fetches the HTML, then JS rendering (running the page to see script-loaded content) goes into a separate queue and costs far more resources than plain HTML. The budget consequences:
- Content injected only via JS is "more expensive" to crawl and can be indexed with delay — on a large site, a real bottleneck.
- SSR or static generation (SSG) instead of pure client-side rendering hands Googlebot ready HTML and relieves the budget.
- Every unnecessary resource (heavy scripts, redundant files) raises the cost of a single fetch — cleaning up rendering is also cleaning up the budget.
Rule: the more content is visible in raw HTML without executing JS, the more efficiently Google spends your crawl budget. It's the same foundation that improves Core Web Vitals.
Crawl budget isn't indexing — and isn't ranking
Two common myths. First, crawling ≠ indexing: a page can be crawled and still not make the index ("Crawled – currently not indexed" is usually a quality signal, not a budget one). Page value, not budget alone, decides indexing. Second, crawl budget is not a ranking factor — it's a prerequisite (you must be crawled to rank), but a bigger budget doesn't raise positions.
/// CRAWLING ≠ INDEXING ≠ RANKING
* Crawl budget is a prerequisite, not a ranking factor. A crawled page can still fail to be indexed.
How to monitor
In Search Console use the Crawl Stats report: number of requests, download size, average response time, plus breakdowns by response code, file type and purpose (Discovery vs Refresh). In the "Pages" report watch two statuses: "Discovered – currently not indexed" (Google knows the URL but hasn't crawled it yet — a classic budget/demand symptom) and "Crawled – currently not indexed" (crawled but not indexed — a quality signal).
Server log analysis — the deepest insight
Search Console shows an aggregated picture; server logs show the truth — every Googlebot request individually. What to look at:
- Hit distribution by section — how much crawl goes to products/categories vs junk (parameters, cart, filters). That's your "crawl waste" indicator.
- Response codes for Googlebot — the share of 404/5xx and redirects; a high ratio is a budget leak.
- Visit frequency of key pages — are your most important products visited regularly or once a quarter.
- Verifying the real Googlebot — via reverse DNS, because many bots impersonate it.
The goal is simple: see where Googlebot wastes time and cut that part off — via `robots.txt`, canonicalization or removing crawlable links.
---
I optimize crawl budget for large sites and stores as part of technical SEO. I teach it in the SEO & GEO course. Get in touch — I'll start by analyzing your crawl stats and server logs.
Worth reading next:

SEO & GEO specialist and AI engineer from Białystok. 10 years building search visibility for recognized brands and 3 years delivering AI — agents, automation and LLM integrations (Next.js, React, Node.js).
/// RELATED_SERVICES
Need these concepts implemented? Explore the services related to this topic.
/// RELATED_RECORDS
SEO Is Dead. Welcome to the GEO Era — Generative Engine Optimization
When users ask ChatGPT instead of Google, the rules change. Discover GEO — the engineering of visibility in the age of language models.
SEO and GEO in 2026 — What Still Works, What's Fading and How to Build Your Strategy Today
Google AI Overviews, ChatGPT Search, Perplexity — the search landscape changed fundamentally in 12 months. A page ranking #1 can now lose half its clicks. See which SEO tactics still work, which are losing relevance and what to add so your brand appears in AI answers.
How to Measure Brand Share of Voice in AI Models — From Manual Tests to Automated Monitoring
A marketing director discovers that a competitor is being recommended in ChatGPT — despite holding a TOP 3 position in Google. Traditional SEO tools register nothing. I show how to build a methodology for measuring AI Share of Voice: from a manual baseline audit to automated monitoring with Perplexity API and AnswerLyzer.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
