RETURN_TO_BLOG
Updated: AI & SEO 15 min

Crawl Budget — How Google Indexes a Large Site

Paweł Wiszniewski
Paweł Wiszniewski
SEO & GEO Specialist · AI Engineer

Google won't crawl your site infinitely. It has a limited "budget" for crawling — and if you waste it on junk URLs, your most important pages get indexed slowly or not at all. Crawl budget is a topic small sites can ignore but that decides the visibility of large sites and stores. This guide explains — based on Google's official guide — what it is, who it affects and how to optimize it.

Google has a limited crawl budget — waste it on junk URLs and your important pages get indexed slowly or not at all. I explain what crawl budget is, when it matters (stores, URL parameters) and how to optimize it.

Google's definition: capacity + demand

Google defines crawl budget as "the set of URLs Googlebot can and wants to crawl". It comes down to two elements:

  • Crawl capacity limit — how much Googlebot can crawl without overloading your server. It rises when the site responds fast and reliably; it falls with 5xx errors and slow responses.
  • Crawl demand — how much Google *wants* to crawl your pages. It depends on URL popularity, freshness, and how Google perceives your site's "inventory" (the more junk URLs, the worse the demand is spent).

Rule: even if the server can take more, low demand means less crawling. Both must be present.

/// CRAWL BUDGET = CAPACITY + DEMAND

Capacity limit
  • Server speed and stability
  • No 5xx errors / timeouts
  • Google's resource limits
+
Crawl demand
  • URL popularity
  • Content freshness
  • Perceived "inventory" (less junk)

* Both must be present: even with a fast server, low demand = less crawling.

Who it really affects (Google's thresholds)

Google's official guide is "advanced" and intended for:

  • sites with 1,000,000+ unique pages whose content changes about weekly,
  • sites with 10,000+ pages that change daily,
  • or sites with a large share of URLs in the "Discovered – currently not indexed" status.

Google stresses these are rough thresholds. Small sites (up to a few thousand pages), especially those indexed the same day they publish, don't need to worry.

What wastes crawl budget (Google's list)

ProblemEffect
Faceted navigation and session IDs in URLsNear-infinite duplicates instead of real pages
Duplicate contentRepeatedly crawling the same thing
Soft 404 (an empty page returning 200)Crawling with no value
Hacked pages and "infinite spaces" (e.g. calendars)Googlebot gets stuck
Low-quality content and spamWasted demand
Long redirect chainsWasted requests

Faceted navigation — enemy number one

In online stores, a single problem accounts for most of the wasted budget: faceted navigation (color, size, price, brand filters). Every filter combination creates a new URL, and there are millions of combinations — Googlebot can drown in an infinite space of addresses that add nothing new.

The strategy depends on each facet's search value:

Faceted URL typeExampleWhat to do with it
Valuable for SEO (there's demand)/shoes/nike, /shoes/runningIndex it — own content, own title, internal link
Not valuable, but needed by userssorting, list/grid viewcanonical to the base version; don't link it for bots
Junk combinations and parameters?color=red&size=42&sort=priceBlock in robots.txt or don't generate crawlable links
"Empty result" filterscombinations with no productsDon't generate the link; return a sensible status

The rule: deliberately pick a handful of valuable facets to index, and cut the rest off from crawling. The most common mistake is leaving every combination as a crawlable link — then even a big server can't keep up with indexing your real products.

Crawl budget in e-commerce — common traps

Stores are where crawl budget hurts most. The most common sources of waste beyond facets:

  • Unavailable / discontinued products. Thousands of "out of stock" pages consume crawl. The decision depends on whether the product returns: keep it (200) if it will, 404/410 if it's gone for good, or redirect to a successor.
  • Category pagination. Deep pages like /category?page=87 rarely have value — make sure you reach the most important products another way and consider limiting depth.
  • Product variants as separate URLs (color, size) — consolidate with canonical to the parent product page if they differ only by an attribute.
  • Session IDs and tracking parameters in URLs — the classic generator of infinite duplicates.

Cleaning up these four areas returns budget to Googlebot that lands where you want it: on real, revenue-driving product and category pages.

How to optimize (Google's recommendations)

  • Tidy your URL inventory. Consolidate duplicates with `rel=canonical`; remove dead pages with 404/410; eliminate soft 404s.
  • Block unimportant URLs in `robots.txt` (filters, actions, infinite spaces). Remember: `Disallow` does not remove a page from the index — `noindex` does that (and it in turn requires the page to be crawlable).
  • Return 304 for unchanged pages. Googlebot sends an `If-Modified-Since` header; if content is unchanged, reply 304 Not Modified with no body — you save resources and Google crawls more real URLs.
  • Keep a clean sitemap with accurate `lastmod` — only canonical, indexable URLs.
  • Flatten redirect chains to a single hop.
  • Speed up the server — faster, stable responses raise the capacity limit. Mind your Core Web Vitals.
robots.txt — cut off junk paths
User-agent: *Disallow: /*?sort=Disallow: /*?filter=Disallow: /cartDisallow: /searchSitemap: https://yourdomain.com/sitemap.xmlNote: the old **crawl-rate slider in Search Console was removed on January 8, 2024** — Google adjusts the rate automatically (slowing down on 5xx/429 errors and rising response times).

Sitemaps and lastmod — steering demand

A sitemap isn't just a list of addresses — it's a demand signal and a map of what you want indexed. Keep only canonical, indexable URLs in it (no redirects, 404s, `noindex` or robots-blocked addresses). The `lastmod` element should reflect the real date of a meaningful content change — Google uses it as a hint about what to refresh. Artificially setting `lastmod` to "today" across the whole sitemap erodes trust in the signal and stops working. For very large sites, split sitemaps by topic (e.g. per category), because GSC then shows immediately which section indexes worse.

JavaScript and crawl budget — the hidden cost

On large sites, JavaScript rendering can quietly eat the budget. Googlebot crawls in two stages: first it fetches the HTML, then JS rendering (running the page to see script-loaded content) goes into a separate queue and costs far more resources than plain HTML. The budget consequences:

  • Content injected only via JS is "more expensive" to crawl and can be indexed with delay — on a large site, a real bottleneck.
  • SSR or static generation (SSG) instead of pure client-side rendering hands Googlebot ready HTML and relieves the budget.
  • Every unnecessary resource (heavy scripts, redundant files) raises the cost of a single fetch — cleaning up rendering is also cleaning up the budget.

Rule: the more content is visible in raw HTML without executing JS, the more efficiently Google spends your crawl budget. It's the same foundation that improves Core Web Vitals.

Crawl budget isn't indexing — and isn't ranking

Two common myths. First, crawling ≠ indexing: a page can be crawled and still not make the index ("Crawled – currently not indexed" is usually a quality signal, not a budget one). Page value, not budget alone, decides indexing. Second, crawl budget is not a ranking factor — it's a prerequisite (you must be crawled to rank), but a bigger budget doesn't raise positions.

/// CRAWLING ≠ INDEXING ≠ RANKING

Crawling
Googlebot fetches the URL
Indexing
Google decides whether to store it (quality decides)
Ranking
The page competes for position

* Crawl budget is a prerequisite, not a ranking factor. A crawled page can still fail to be indexed.

How to monitor

In Search Console use the Crawl Stats report: number of requests, download size, average response time, plus breakdowns by response code, file type and purpose (Discovery vs Refresh). In the "Pages" report watch two statuses: "Discovered – currently not indexed" (Google knows the URL but hasn't crawled it yet — a classic budget/demand symptom) and "Crawled – currently not indexed" (crawled but not indexed — a quality signal).

Server log analysis — the deepest insight

Search Console shows an aggregated picture; server logs show the truth — every Googlebot request individually. What to look at:

  • Hit distribution by section — how much crawl goes to products/categories vs junk (parameters, cart, filters). That's your "crawl waste" indicator.
  • Response codes for Googlebot — the share of 404/5xx and redirects; a high ratio is a budget leak.
  • Visit frequency of key pages — are your most important products visited regularly or once a quarter.
  • Verifying the real Googlebot — via reverse DNS, because many bots impersonate it.

The goal is simple: see where Googlebot wastes time and cut that part off — via `robots.txt`, canonicalization or removing crawlable links.

---

I optimize crawl budget for large sites and stores as part of technical SEO. I teach it in the SEO & GEO course. Get in touch — I'll start by analyzing your crawl stats and server logs.

Worth reading next:

Paweł Wiszniewski – SEO & GEO Specialist & AI Engineer
About the authorPaweł Wiszniewski

SEO & GEO specialist and AI engineer from Białystok. 10 years building search visibility for recognized brands and 3 years delivering AI — agents, automation and LLM integrations (Next.js, React, Node.js).

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...