Google explains how crawling works in 2026

Gary Illyes from Google shared some more details on Googlebot, Google’s crawling ecosystem, fetching and how it processes bytes.
The article is named Inside Googlebot: demystifying crawling, fetching, and the bytes we process.
Googlebot. Google has many more than one singular crawler, it has many crawlers for many purposes. So referencing Googlebot as a singular crawler, might not be super accurate anymore. Google documented many of its crawlers and user agents over here.
Limits. Recently, Google spoke about its crawling limits. Now, Gary Illyes dug into it more. He said:
- Googlebot currently fetches up to 2MB for any individual URL (excluding PDFs).
- This means it crawls only the first 2MB of a resource, including the HTTP header.
- For PDF files, the limit is 64MB.
- Image and video crawlers typically have a wide range of threshold values, and it largely depends on the product that they’re fetching for.
- For any other crawlers that don’t specify a limit, the default is 15MB regardless of content type.
Then what happens when Google crawls?
- Partial fetching: If your HTML file is larger than 2MB, Googlebot doesn’t reject the page. Instead, it stops the fetch exactly at the 2MB cutoff. Note that the limit includes HTTP request headers.
- Processing the cutoff: That downloaded portion (the first 2MB of bytes) is passed along to our indexing systems and the Web Rendering Service (WRS) as if it were the complete file.
- The unseen bytes: Any bytes that exist after that 2MB threshold are entirely ignored. They aren’t fetched, they aren’t rendered, and they aren’t indexed.
- Bringing in resources: Every referenced resource in the HTML (excluding media, fonts, and a few exotic files) will be fetched by WRS with Googlebot like the parent HTML. They have their own, separate, per-URL byte counter and don’t count towards the size of the parent page.
How Google renders these bytes. When the crawler accesses these bytes, it then passes it over to WRS, the web rendering service. “The WRS processes JavaScript and executes client-side code similar to a modern browser to understand the final visual and textual state of the page. Rendering pulls in and executes JavaScript and CSS files, and processes XHR requests to better understand the page’s textual content and structure (it doesn’t request images or videos). For each requested resource, the 2MB limit also applies,” Google explained.
Best practices. Google listed these best practices:
- Keep your HTML lean: Move heavy CSS and JavaScript to external files. While the initial HTML document is capped at 2MB, external scripts, and stylesheets are fetched separately (subject to their own limits).
- Order matters: Place your most critical elements — like meta tags,
<title>elements,<link>elements, canonicals, and essential structured data — higher up in the HTML document. This ensures they are unlikely to be found below the cutoff. - Monitor your server logs: Keep an eye on your server response times. If your server is struggling to serve bytes, our fetchers will automatically back off to avoid overloading your infrastructure, which will drop your crawl frequency.
Podcast. Google also had a podcast on the topic, here it is:
Read more at Read More









Leave a Reply
Want to join the discussion?Feel free to contribute!