robots.txt Best Practices

Master crawl control to optimize search engine access and protect your crawl budget

What is robots.txt?

A simple text file that controls how search engine crawlers access your website

robots.txt is a text file placed in the root directory of your website (e.g., example.com/robots.txt) that tells search engine crawlers which pages or sections of your site they should and shouldn't access. It's part of the Robots Exclusion Protocol (REP), a standard followed by all major search engines.

What robots.txt Does

  • • Controls crawler access to specific URLs or patterns
  • • Prevents crawling of low-value or duplicate pages
  • • Protects server resources from excessive crawling
  • • References your XML sitemap location

What robots.txt Does NOT Do

  • • Does NOT prevent URLs from being indexed
  • • Does NOT provide security (use authentication instead)
  • • Does NOT guarantee compliance (crawlers can ignore it)
  • • Does NOT remove already indexed pages

Why robots.txt Matters for SEO

Proper robots.txt configuration is crucial for crawl budget optimization and site performance

1. Crawl Budget Optimization

Search engines allocate a limited "crawl budget" to each site—the number of pages they'll crawl in a given timeframe. Wasting it on low-value pages means important content gets crawled less frequently.

Example: A site with 100,000 paginated URLs (page=2, page=3...) and 5,000 product pages. Without robots.txt blocking pagination, Google may spend 95% of crawl budget on duplicates instead of new products.

2. Preventing Duplicate Content Discovery

Many sites generate infinite URL variations through filters, sorting, and parameters. robots.txt prevents crawlers from discovering these duplicates in the first place.

Common culprits: ?sort=price, ?filter=color-red, ?page=47, /search?q=shoes, /cart, /checkout, /thank-you

3. Reducing Server Load

Aggressive crawlers can overload servers, especially for database-heavy pages like search results or complex filters. robots.txt provides a first line of defense.

High-risk pages: Internal search, faceted navigation, user-generated content pages, calendar views, archive pages

How Search Engines Read robots.txt

Understanding the crawler perspective

1

First Thing Checked

Before crawling any page on your site, search engines request example.com/robots.txt. If it doesn't exist (404), they assume everything is allowed.

2

Parse Rules by User-Agent

Each crawler (Googlebot, Bingbot, etc.) looks for rules specific to its user-agent, falling back toUser-agent: * (all bots) if no specific rules exist.

3

Match Most Specific Rule

When multiple rules could apply, the longest matching pattern wins. Disallow: /admin/beats Allow: / for /admin/settings.

4

Cached for Hours

robots.txt is cached by search engines (typically 24 hours). Changes won't take effect immediately. Test with Google Search Console's robots.txt Tester before deploying.

Major Crawlers
  • Googlebot: Respects all directives
  • Bingbot: Respects all + crawl-delay
  • Yandexbot: Respects all + crawl-delay
  • Applebot: Respects basic directives
Bad Bots
  • • Scrapers often ignore robots.txt
  • • Spambots don't respect it
  • • Security: robots.txt is NOT protection
  • • Use server-side blocking for bad actors

Common Misconceptions That Hurt SEO

MYTH: "Blocking in robots.txt prevents indexing"

Reality: Blocked URLs can still appear in search results if other sites link to them (shown as "A description for this result is not available"). Use noindex meta tags instead.

MYTH: "Block CSS/JS to speed up crawling"

Reality: Google needs CSS/JS to render pages properly. Blocking them can hurt rankings for JavaScript-heavy sites. Only block if absolutely necessary.

MYTH: "robots.txt provides security"

Reality: robots.txt is publicly accessible. It actually reveals URLs you're trying to hide! Use proper authentication and server-side access controls for sensitive pages.