Technical SEO: crawl budget, site speed, and indexing

Safalta Expert Published by: Shrey Bhardwaj Updated Thu, 13 Jun 2024 11:46 AM IST

Highlights

Search engines like Google need to prioritize pages, choose which content to crawl, and decide how often to re-crawl resources. These decisions can impact how search engines crawl individual websites. For small sites, this isn't typically an issue, but for larger enterprise websites with millions of pages, it can be challenging to get all the content indexed. This is where optimizing for the crawl budget becomes important.

The number of pages a search engine bot will visit and index on a website in a predetermined amount of time is known as the crawl budget. The crawl budget of a website is essential to its overall SEO performance.The most important content on a website may not be accessed by the search engine before the crawl budget is depleted due to a variety of crawl-related issues. What was the outcome? These content assets are never found by Google, nor does it recognise that they have been updated recently.  
 

Table Of Contents

  • What is a crawl budget?

  • Why is a crawl budget important for SEO?

  • What does crawl budget mean for Googlebot?

  • Factors that affect Crawl budget

  • Various techniques for maximizing crawl budget

  • Strategies to Monitor Your Crawl Budget

    Free Demo Classes

    Register here for Free Demo Classes


     

What is a crawl budget?

A crawl budget is the number of requests for assets that a web crawler is allowed to make on a website. The search engine determines the crawl budget for a site, and once it is used up, the crawler will stop accessing content on the site. The crawl budget varies for each website, as search engines use various criteria to decide how much time a crawler should spend on a particular website. Several factors influence the crawl budget, including:

Website performance: Slower websites are likely to receive a different budget compared to well-optimized ones.

Website size: Larger websites will be allocated a greater budget.

Content freshness: Sites that frequently publish or update their content will be given more time for crawling.

number of links on the site, and more. 

It is recommended that the number of requests a crawler needs to access all of a site’s content be lower than the crawl budget. Unfortunately, this is not always the case, leading to serious indexation problems.
 

Why is a crawl budget important for SEO?

The internet is an enormous place with countless websites and pages, which poses a challenge for search engines to crawl and index everything. Search engines like Google need to prioritize pages, choose which content to crawl, and decide how often to re-crawl resources. These decisions can impact how search engines crawl individual websites. For small sites, this isn't typically an issue, but for larger enterprise websites with millions of pages, it can be challenging to get all the content indexed. This is where optimizing for the crawl budget becomes important. By consistently monitoring and optimizing, an enterprise website can maximize its crawl budget and ensure that its most important content gets crawled and indexed.

 

What does crawl budget mean for Googlebot?

The Crawl Rate:

The crawl budget, which we've been discussing from the perspective of website owners and marketers, also has a significant impact on search engines. According to Google's Gary Illyes, Googlebot's crawl budget consists of two main elements: the crawl rate and the server's bandwidth. When Googlebot crawls a site, it requests access to various assets, such as pages, images, or other files, similar to how a web browser does when used by a human. This process consumes server resources and bandwidth allocated to the website by its host. Excessive crawling can overload the site, leading to slower performance or even a complete breakdown. The crawl rate is designed to prevent the bot from making too many requests too frequently, thus avoiding disruption to the site's performance. It's worth noting that Google allows webmasters to control their site's crawl rate through the Google Search Console.

Please remember the following information:

This feature allows the company to advise the web crawler on how frequently it should visit the site. However, manually setting the rate has its drawbacks. If the rate is set too low, it will impact how often Google discovers new content, while setting it too high can overload the server.
 

The site may experience two issues as a result:

1. Low Crawl Rate: New content may remain unindexed by the search engine for extended periods of time.
2. High Crawl Rate: This can unnecessarily consume the monthly crawl budget by crawling content that hasn't changed or doesn't need to be accessed by Googlebot frequently.

Unless you are certain, it's recommended to let Google handle the optimization of the crawl rate. Instead, focus on ensuring that the crawler can access all critical content within the available crawl budget.


The Crawl Demand:

If there's no demand for indexing, there will be low activity from Googlebot, regardless of whether the crawl rate limit is reached. Crawl demand helps crawlers determine if it's worthwhile to access the website again.

Two factors affect crawl demand:

1. URL Popularity: More popular pages tend to be crawled more frequently.

2. Stale URLs: Google also tries to prevent URLs from becoming stale in the index.

Factors that affect Crawl budget

According to Google, the most significant issue affecting crawl budget is low-value URLs. When there are too many URLs that offer minimal or no value, but are still within the crawler’s reach, they consume the available budget and hinder Googlebot from accessing more crucial assets. The problem is that you may not even be aware of the existence of many low-value URLs, as they are often generated without your direct involvement. Let's explore how this typically occurs.
 

How Websites Generate Low-Value URLs

1. Faceted Navigation

Faceted navigation allows users to filter or sort web page results based on different criteria, such as using Ross-Simons’ filters to fine-tune product listings.

Faceted navigation, while helpful for users, can cause problems for search engines. Filters often create dynamic URLs that may appear as individual URLs to Googlebot, leading to excessive crawling and indexing. This can deplete your crawl budget and result in duplicate content issues on the site. Additionally, faceted navigation can dilute link equity by directing it to dynamic URLs that you don’t want to be indexed.

 

To address these issues, there are several options:

1. Use a "nofollow" tag on faceted navigation links to minimize the discovery of unnecessary URLs and reduce the crawl space.

2. Employ a "noindex" tag to indicate which pages should not be included in the index. However, this may still waste the crawl budget and dilute link equity.

3. Utilize a robots.txt disallow to prevent crawling of URLs with unnecessary parameters, specifying directories to be disallowed. For example, disallowing prices under $100 in the robots file:

   Disallow: *?prefn1=priceRank&prefv1=%240%20-%20%24100

4. Implement canonical tags to specify a preferred version of a group of pages, consolidating link equity into the chosen preferred page. Note that this method may still waste the crawl budget.

2. Session Identifiers/On-site Duplicate Content

URL parameters, such as session IDs or tracking IDs, and forms that send information using the GET method can create numerous unique versions of the same URL. These dynamic URLs can lead to duplicate content problems on the website and consume a significant portion of the crawl budget, despite the fact that none of these assets are genuinely unique.

3. Soft 404

A "soft 404" happens when a web server responds with a 200 OK HTTP status code instead of the 404 Not Found, even though the page doesn't exist. In this situation, the Googlebot will try to crawl the page, using up the allocated budget, instead of moving on to actual, existing URLs.

4. Hacked Pages

If your website has been hacked, it can lead to an increase in the number of URLs that a web crawler might try to access. If this happens, it's important to remove the hacked pages from your site and make sure that when Googlebot tries to access them, it receives a "404 Not Found" response code. Google is familiar with hacked pages and will promptly remove them from its index, but only if you serve a 404 response.

5. Infinite Spaces and Proxies

Nearly unlimited lists of URLs that Googlebot will try to crawl are referred to as infinite spaces. They can occur in various ways, with the most common being auto-generated URLs by the site search. Some websites display on-site searches on pages, leading to the creation of an almost infinite number of low-value URLs that Google will consider crawling. A calendar with a "next month" link shown on a page is another such situation. Each URL will have this link, meaning that the calendar can generate thousands of unnecessary infinite spaces.

Google has recommended ways to handle infinite spaces, such as removing entire categories of those links in the robots.txt file. By doing this, you can prevent Googlebot from accessing those URLs right from the beginning and conserve your crawl budget for other pages.


Factors That Affect Crawl Budget

1. Broken and Redirected Links

A broken link is a hyperlink that directs to a page that does not exist. This can occur due to an incorrect URL in the link or if the page has been removed while the internal link pointing to it remains. 

A broken or redirected link leads to a non-existent page through a redirect, often involving multiple redirects. 

Both issues can impact the crawl budget, with redirected links being particularly problematic. They can force the crawler through a chain of redirects, consuming the available budget for unnecessary redirects. 

For more information on URL redirects, you can visit our guide: "A Technical SEO Guide to URL Redirects."

2. Issues with Site Speed

It's essential to consider site speed for the crawl budget. If a page takes too long to load when Googlebot tries to access it, Googlebot might give up and move to another website. A response time of two seconds or more significantly reduces the crawl budget for the site. You may receive the following message:

"We’re noticing an extremely high response time for requests made to your site (sometimes over 2 seconds to fetch a single URL). As a result, we have significantly limited the number of URLs we’ll crawl from your site, and you'll see this reflected in Fetch as well as Google."

3. Issues with the Hreflang Tag

Alternate URLs defined with the hreflang tag can also consume the crawl budget. For a very straightforward reason, the search engine has to make sure that those assets are the same or comparable and do not point to spam or other content. Google will crawl them.

4. CSS and JavaScript

In addition to HTML content, CSS or JavaScript files also consume a crawl budget. Years ago, Google didn't crawl these files, so it wasn't a big issue. However, since Google started crawling these files, especially for rendering pages for elements such as ad placement, content above the fold, and hidden content, many people haven't taken the time to optimize them.

5. The Sitemap

The XML sitemap is crucial for optimizing the crawl budget. Google gives priority to crawling URLs included in the sitemap over those it discovers while crawling the site. However, it's important to note that not all pages should be added to the sitemap. Including all pages will cause Google to prioritize all content, potentially wasting your crawl budget by accessing unnecessary assets.

6. AMP Pages 

Many websites are now creating AMP versions of their content. As of May 2018, there were over 6 billion AMP pages on the web, and that number has likely grown even more since then. Google has confirmed that AMP pages also use crawl budgets because Googlebot needs to crawl these assets to check for errors and to make sure that the content is consistent between the regular page and its AMP version.

 

Various techniques for maximizing crawl budget

Based on the information provided, it's clear that severe issues with your site's crawl budget can have a significant impact. The good news is that you have the ability to maximize the time crawlers allocate to your website. 

Some broad actions that can be taken to assist include increasing the performance of the website overall, minimizing duplicate information, removing broken pages, and streamlining the site's architecture.

However, there are additional factors that you should optimize to prevent the crawl budget from being wasted.

Reduce the number of crawlable URLs

The key to optimizing the crawl budget is to make sure that the number of URLs that can be crawled does not exceed the budget. If you have fewer URLs to crawl than the allocated number of requests, there is a much better chance that search engine crawlers will be able to access all of your content. There are several different ways to achieve this, but here are some of the most common approaches:

1. Fix 30x Redirects

It's crucial to remember that any broken link or redirect can be a dead-end for Googlebot. When there are broken links, the crawler may think there's nowhere else to go and move on to another website. With redirects, it can travel through some hops. However, even Google recommends not exceeding five hops, otherwise, the crawler will move on. To avoid these issues, make sure that all redirected URLs point directly to the final destination and fix any broken links.

2. Remove 4xx, URL No Longer Active, Links

The crawl budget usage is also optimized by eliminating any links to 404 pages. The likelihood of having internal links on your website pointing to inactive URLs increases with the age of your website.



3. Optimize the Faceted Navigation

We've already discussed the issue of faceted navigation. Filters on a page can create many low-value URLs that consume the crawl budget. However, this doesn't mean that you can't use faceted navigation. On the contrary, you can use it, but you need to take steps to prevent crawlers from trying to access dynamic URLs created by the navigation.

 

To address this faceted navigation issue, there are several solutions you can implement based on which parts of your site should be indexed.

1. NOINDEX:

You can implement "noindex" tags to inform bots about which pages to exclude from the index. This method will remove pages from the index, but there will still be a crawl budget spent on them and link equity that is diluted.

2. CANONICALIZATION:

Canonical tags allow you to tell Google that a group of similar pages has a preferred version of the page.

3. NOFOLLOW:

The simplest solution is to add the "nofollow" tag to those internal links. It will stop crawlers from trying to access the information by clicking on those links.

4. Remove Outdated Content

It is not necessary for you to physically remove those pages. The crawl budget can be freed up by quickly reducing the number of crawlable URLs by preventing crawlers from reaching them.


5. Block Crawlers from Accessing URLs that Should Not Be Indexed

To save your crawl budget from being wasted, you can block crawlers from accessing URLs that don’t need to be indexed. These could include pages with legal information, tags, content categories, or other assets that provide little value to searchers. The easiest way to do this is by adding the "noindex" tag to those assets or a canonical tag pointing to a page you want to index instead.

6. Cleaning the Sitemap

As previously discussed, Google gives priority to URLs in the sitemap over those it finds while crawling the site. However, without regular updates, the sitemap can become filled with inactive URLs or pages that don't need to be indexed. Regularly updating the sitemap and removing unwanted URLs will help free up crawl budget.

7. Using the Robots.txt File

A robots.txt file informs search engine crawlers about the pages or files they can or can't request from your site. Usually, it is used to prevent crawlers from overwhelming sites with requests, but it can also guide Googlebot away from specific sections of the site and free up the crawl budget. One important thing to note is that robots.txt is merely a suggestion to Googlebot, not a directive that it must unconditionally follow every single time.

8. Improve the Site Speed

Google has openly stated that improving site speed not only enhances user experience but also increases the crawl rate. Therefore, making pages load faster is likely to improve the crawl budget usage. Optimizing page speed is a broad topic that involves working on various technical SEO factors. At seoClarity, we recommend at least enabling compression, removing render-blocking JavaScript, leveraging browser caching, and optimizing images to ensure that Googlebot has sufficient time to visit and index all of your pages.

9. Improve the Internal Linking Structure

Search engine bots find content on a site in two ways: First, they consult the sitemap. But they also navigate the site by following internal links.

This means that if a certain page is well connected to other content through internal links, it has a better chance of being discovered by the bot. A page with few or no internal links may go unnoticed by the bot.

However, you can use internal links to guide crawlers to pages or content clusters that you want to index. For instance, you can link these pages from content with many backlinks and high crawl frequency. This will increase the likelihood that Googlebot will reach and index those pages quickly.

Optimizing the entire site’s architecture can also help to make the most of the crawl budget. A flat but wide architecture, where the most important pages are not too far from the homepage, makes it easier for Googlebot to reach those assets within the available crawl budget

 

Strategies to Monitor Your Crawl Budget

 

Google Search Console

Google Search Console provides a wide range of information about how your website is indexed and its performance in search results. It also offers insights into your website's crawl budget. In the Legacy tools section, you can access the Crawl Stats report, which displays Googlebot's activity on your site over the last 90 days.

According to the report, Google crawls an average of 48 pages per day on this site. To calculate the average crawl budget for the site, you can use the following formula:

30 days * average daily pages = crawl budget This is how the computation would seem in this scenario:

48 pages per day * 30 days = 1440 pages per month.

This is a rough estimate, but it can provide some insight into your available crawl budget. 

It's important to note that optimizing the crawl budget using the tips mentioned above should increase the number.

Additionally, the Coverage Report in GSC will indicate how many pages Google has indexed on the site and excluded from indexation. To find out which pages the Googlebot overlooked, you can contrast that figure with the total number of content assets.
 

Server Log File Analysis

The server log file is undoubtedly one of the most important sources of information regarding a site's crawl budget. This is due to the fact that the server log file provides precise information about when search engine bots are accessing your site. Furthermore, the file discloses which pages they visit most frequently and the size of the files that are crawled.

With Bot Clarity, you can:

  • Understand the most important pages on your site for search engine crawling.

  • Optimize crawl budgets to ensure that bots crawl and index as many important pages on your site as possible.

  • Identify broken links and errors encountered by search engine bots while crawling your site.

  • Audit your redirects.

  • Connect bot activity to performance, indicating which areas of the site you should focus your efforts on.

Managing your website's crawl budget is a crucial aspect of technical SEO that can significantly impact your site's performance in search engine rankings. By understanding and optimizing the factors that affect your crawl budget—such as website speed, content freshness, and internal linking structure—you can ensure that search engine bots efficiently index your most important content. Implementing strategies like fixing broken links, optimizing the sitemap, and using the robots.txt file to block low-value URLs can help maximize your crawl budget. Additionally, regular monitoring using tools like Google Search Console and server log file analysis provides valuable insights into bot activity and helps identify areas for improvement. Ultimately, a well-managed crawl budget ensures that search engines discover and prioritize the most valuable pages on your site, leading to better visibility and higher rankings in search results. By prioritizing these technical SEO practices, you can enhance your site's crawl efficiency and overall SEO performance, driving more organic traffic and achieving your digital marketing goals.

Do URLs that I block using robots.txt have an impact on my crawl budget?

No

How much does the meta noindex tag "save" on crawling expenses?

No, But it can reduce the number of crawlable URLs.

Does crawl budget get impacted by the nofollow directive?

If another page on your website, or any other page on the internet, does not mark the link as nofollow, the nofollow directive on that URL does not prevent the crawler from reaching the page.

Can I use the "crawl-delay" directive to regulate Googlebot?

No.

Does your crawl budget include embedded material and different URLs?

Yes, they do.

Related Article

How to use Quora for Marketing

Read More

What is E-Commerce Marketing strategy and How to Drive Traffic and Increase Sales

Read More

Understanding Content Management Systems (CMS): A Comprehensive Guide

Read More

Targeted Pay-Per-Click Advertising for Optimal Audience Engagement

Read More

Unlock the Power of Advanced Excel Tools: A Complete Guide

Read More

Meta Title: The Seed of SEO

Read More

Online Marketplace : Our New World

Read More

How to leverage Ecommerce for maximum impact

Read More

Understanding the Basics of Predictive Analytics in Marketing

Read More