Does crawl budget apply to my small site?

Usually not — per Google the guide is for sites of 1M+ pages (weekly changes) or 10k+ (daily changes), or those with a large share of "Discovered – not indexed". Small sites indexed the same day don't need to worry.

Does a robots.txt block remove a page from Google?

No. `Disallow` blocks crawling, but the page can still appear in results (without a snippet) if others link to it. To remove from the index, use `noindex` — which requires the page to be crawlable (so don't combine it with a robots.txt block).

Is crawl budget a ranking factor?

No. It's a prerequisite — you must be crawled to be indexed and ranked — but a bigger budget by itself doesn't raise positions. Crawling is also not the same as indexing.

What wastes crawl budget the most?

Per Google, mainly faceted navigation and session IDs in URLs, duplicate content, soft 404s, infinite spaces (e.g. calendars), low-quality content and long redirect chains.

Does JavaScript affect crawl budget?

Yes, especially on large sites. Googlebot fetches HTML first, and JS rendering goes into a separate, more expensive queue. Content injected only via JS is costlier to crawl and can be indexed with delay. The fix: server-side rendering (SSR) or static generation (SSG), which hand Googlebot ready HTML and relieve the budget.

How does crawl budget relate to AI visibility?

Indirectly but really. AI engines (especially ChatGPT and Copilot via the Bing index) can only cite what's been crawled and indexed. If you waste budget on junk URLs and important pages wait long for indexing, they're also unavailable as a source for AI. A clean URL inventory and a fast server serve both SEO and AI visibility.

How do I tame crawl budget in a store with faceted navigation?

Deliberately pick a handful of valuable facets (those with real demand, e.g. "running shoes", "Nike") and keep them indexable with their own title and internal linking. Cut the rest — sorting, views, filter combinations, tracking parameters — off from crawling: canonical to the base version, a robots.txt block, or simply not generating crawlable links. The most common mistake is leaving every combination as a link, so Googlebot can't keep up with your real products.

What should I do about crawl budget for out-of-stock products?

It depends on whether the product returns. If it will — keep the page (200), perhaps with an unavailability note and alternatives. If it's gone for good — 404 or 410, or a 301 to the closest successor. Thousands of "out of stock" pages left without a decision consume budget that should go to available products.

Does the sitemap affect crawl budget?

Indirectly — it's a demand signal and a map of what you want indexed. Keep only canonical, indexable addresses in it (no 404s, redirects, noindex), and set `lastmod` to the real date of a content change. A clean, topically split sitemap helps Googlebot prioritize and makes it easy to diagnose which section indexes worse.

RETURN_TO_BLOG

2026-06-30Updated: 2026-06-30AI & SEO 15 min

Crawl Budget — How Google Indexes a Large Site

Paweł Wiszniewski

SEO & GEO Specialist · AI Engineer

Google won't crawl your site infinitely. It has a limited "budget" for crawling — and if you waste it on junk URLs, your most important pages get indexed slowly or not at all. Crawl budget is a topic small sites can ignore but that decides the visibility of large sites and stores. This guide explains — based on Google's official guide — what it is, who it affects and how to optimize it.

Google has a limited crawl budget — waste it on junk URLs and your important pages get indexed slowly or not at all. I explain what crawl budget is, when it matters (stores, URL parameters) and how to optimize it.

Google's definition: capacity + demand

Google defines crawl budget as "the set of URLs Googlebot can and wants to crawl". It comes down to two elements:

Crawl capacity limit — how much Googlebot can crawl without overloading your server. It rises when the site responds fast and reliably; it falls with 5xx errors and slow responses.
Crawl demand — how much Google *wants* to crawl your pages. It depends on URL popularity, freshness, and how Google perceives your site's "inventory" (the more junk URLs, the worse the demand is spent).

Rule: even if the server can take more, low demand means less crawling. Both must be present.

/// CRAWL BUDGET = CAPACITY + DEMAND

Capacity limit

›Server speed and stability
›No 5xx errors / timeouts
›Google's resource limits

Crawl demand

›URL popularity
›Content freshness
›Perceived "inventory" (less junk)

* Both must be present: even with a fast server, low demand = less crawling.

Who it really affects (Google's thresholds)

Google's official guide is "advanced" and intended for:

sites with 1,000,000+ unique pages whose content changes about weekly,
sites with 10,000+ pages that change daily,
or sites with a large share of URLs in the "Discovered – currently not indexed" status.

Google stresses these are rough thresholds. Small sites (up to a few thousand pages), especially those indexed the same day they publish, don't need to worry.

What wastes crawl budget (Google's list)

Problem	Effect
Faceted navigation and session IDs in URLs	Near-infinite duplicates instead of real pages
Duplicate content	Repeatedly crawling the same thing
Soft 404 (an empty page returning 200)	Crawling with no value
Hacked pages and "infinite spaces" (e.g. calendars)	Googlebot gets stuck
Low-quality content and spam	Wasted demand
Long redirect chains	Wasted requests

In online stores, a single problem accounts for most of the wasted budget: faceted navigation (color, size, price, brand filters). Every filter combination creates a new URL, and there are millions of combinations — Googlebot can drown in an infinite space of addresses that add nothing new.

The strategy depends on each facet's search value:

Faceted URL type	Example	What to do with it
Valuable for SEO (there's demand)	/shoes/nike, /shoes/running	Index it — own content, own title, internal link
Not valuable, but needed by users	sorting, list/grid view	canonical to the base version; don't link it for bots
Junk combinations and parameters	?color=red&size=42&sort=price	Block in robots.txt or don't generate crawlable links
"Empty result" filters	combinations with no products	Don't generate the link; return a sensible status

The rule: deliberately pick a handful of valuable facets to index, and cut the rest off from crawling. The most common mistake is leaving every combination as a crawlable link — then even a big server can't keep up with indexing your real products.

Crawl budget in e-commerce — common traps

Stores are where crawl budget hurts most. The most common sources of waste beyond facets:

Unavailable / discontinued products. Thousands of "out of stock" pages consume crawl. The decision depends on whether the product returns: keep it (200) if it will, 404/410 if it's gone for good, or redirect to a successor.
Category pagination. Deep pages like /category?page=87 rarely have value — make sure you reach the most important products another way and consider limiting depth.
Product variants as separate URLs (color, size) — consolidate with canonical to the parent product page if they differ only by an attribute.
Session IDs and tracking parameters in URLs — the classic generator of infinite duplicates.

Cleaning up these four areas returns budget to Googlebot that lands where you want it: on real, revenue-driving product and category pages.

How to optimize (Google's recommendations)

Tidy your URL inventory. Consolidate duplicates with `rel=canonical`; remove dead pages with 404/410; eliminate soft 404s.
Block unimportant URLs in `robots.txt` (filters, actions, infinite spaces). Remember: `Disallow` does not remove a page from the index — `noindex` does that (and it in turn requires the page to be crawlable).
Return 304 for unchanged pages. Googlebot sends an `If-Modified-Since` header; if content is unchanged, reply 304 Not Modified with no body — you save resources and Google crawls more real URLs.
Keep a clean sitemap with accurate `lastmod` — only canonical, indexable URLs.
Flatten redirect chains to a single hop.
Speed up the server — faster, stable responses raise the capacity limit. Mind your Core Web Vitals.

robots.txt — cut off junk paths

User-agent: *Disallow: /*?sort=Disallow: /*?filter=Disallow: /cartDisallow: /searchSitemap: https://yourdomain.com/sitemap.xmlNote: the old **crawl-rate slider in Search Console was removed on January 8, 2024** — Google adjusts the rate automatically (slowing down on 5xx/429 errors and rising response times).

Sitemaps and lastmod — steering demand

A sitemap isn't just a list of addresses — it's a demand signal and a map of what you want indexed. Keep only canonical, indexable URLs in it (no redirects, 404s, `noindex` or robots-blocked addresses). The `lastmod` element should reflect the real date of a meaningful content change — Google uses it as a hint about what to refresh. Artificially setting `lastmod` to "today" across the whole sitemap erodes trust in the signal and stops working. For very large sites, split sitemaps by topic (e.g. per category), because GSC then shows immediately which section indexes worse.

JavaScript and crawl budget — the hidden cost

On large sites, JavaScript rendering can quietly eat the budget. Googlebot crawls in two stages: first it fetches the HTML, then JS rendering (running the page to see script-loaded content) goes into a separate queue and costs far more resources than plain HTML. The budget consequences:

Content injected only via JS is "more expensive" to crawl and can be indexed with delay — on a large site, a real bottleneck.
SSR or static generation (SSG) instead of pure client-side rendering hands Googlebot ready HTML and relieves the budget.
Every unnecessary resource (heavy scripts, redundant files) raises the cost of a single fetch — cleaning up rendering is also cleaning up the budget.

Rule: the more content is visible in raw HTML without executing JS, the more efficiently Google spends your crawl budget. It's the same foundation that improves Core Web Vitals.

Crawl budget isn't indexing — and isn't ranking

Two common myths. First, crawling ≠ indexing: a page can be crawled and still not make the index ("Crawled – currently not indexed" is usually a quality signal, not a budget one). Page value, not budget alone, decides indexing. Second, crawl budget is not a ranking factor — it's a prerequisite (you must be crawled to rank), but a bigger budget doesn't raise positions.

/// CRAWLING ≠ INDEXING ≠ RANKING

Crawling

Googlebot fetches the URL

Indexing

Google decides whether to store it (quality decides)

Ranking

The page competes for position

* Crawl budget is a prerequisite, not a ranking factor. A crawled page can still fail to be indexed.

How to monitor

In Search Console use the Crawl Stats report: number of requests, download size, average response time, plus breakdowns by response code, file type and purpose (Discovery vs Refresh). In the "Pages" report watch two statuses: "Discovered – currently not indexed" (Google knows the URL but hasn't crawled it yet — a classic budget/demand symptom) and "Crawled – currently not indexed" (crawled but not indexed — a quality signal).

Server log analysis — the deepest insight

Search Console shows an aggregated picture; server logs show the truth — every Googlebot request individually. What to look at:

Hit distribution by section — how much crawl goes to products/categories vs junk (parameters, cart, filters). That's your "crawl waste" indicator.
Response codes for Googlebot — the share of 404/5xx and redirects; a high ratio is a budget leak.
Visit frequency of key pages — are your most important products visited regularly or once a quarter.
Verifying the real Googlebot — via reverse DNS, because many bots impersonate it.

The goal is simple: see where Googlebot wastes time and cut that part off — via `robots.txt`, canonicalization or removing crawlable links.

---

I optimize crawl budget for large sites and stores as part of technical SEO. I teach it in the SEO & GEO course. Get in touch — I'll start by analyzing your crawl stats and server logs.

Worth reading next:

Paweł Wiszniewski – SEO & GEO Specialist & AI Engineer

About the authorPaweł Wiszniewski

SEO & GEO specialist and AI engineer from Białystok. 10 years building search visibility for recognized brands and 3 years delivering AI — agents, automation and LLM integrations (Next.js, React, Node.js).

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

Service

Technical SEO

Search engine dominance. Technical SEO that devours the competition.

View service

/// RELATED_RECORDS

AI & SEO

SEO Is Dead. Welcome to the GEO Era — Generative Engine Optimization

When users ask ChatGPT instead of Google, the rules change. Discover GEO — the engineering of visibility in the age of language models.

9 min

AI & SEO

SEO and GEO in 2026 — What Still Works, What's Fading and How to Build Your Strategy Today

Google AI Overviews, ChatGPT Search, Perplexity — the search landscape changed fundamentally in 12 months. A page ranking #1 can now lose half its clicks. See which SEO tactics still work, which are losing relevance and what to add so your brand appears in AI answers.

14 min

AI & SEO

How to Measure Brand Share of Voice in AI Models — From Manual Tests to Automated Monitoring

A marketing director discovers that a competitor is being recommended in ChatGPT — despite holding a TOP 3 position in Google. Traditional SEO tools register nothing. I show how to build a methodology for measuring AI Share of Voice: from a manual baseline audit to automated monitoring with Perplexity API and AnswerLyzer.

13 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

Google's definition: capacity + demand

Who it really affects (Google's thresholds)

What wastes crawl budget (Google's list)

Faceted navigation — enemy number one

Crawl budget in e-commerce — common traps

How to optimize (Google's recommendations)

Sitemaps and lastmod — steering demand

JavaScript and crawl budget — the hidden cost

Crawl budget isn't indexing — and isn't ranking

How to monitor

Server log analysis — the deepest insight

/// RELATED_SERVICES

Technical SEO

/// RELATED_RECORDS

SEO Is Dead. Welcome to the GEO Era — Generative Engine Optimization

SEO and GEO in 2026 — What Still Works, What's Fading and How to Build Your Strategy Today

How to Measure Brand Share of Voice in AI Models — From Manual Tests to Automated Monitoring

Signal received?

TerminateSilence

Terminate
Silence