robots.txt Best Practices
Master crawl control to optimize search engine access and protect your crawl budget
What is robots.txt?
A simple text file that controls how search engine crawlers access your website
robots.txt is a text file placed in the root directory of your website (e.g., example.com/robots.txt) that tells search engine crawlers which pages or sections of your site they should and shouldn't access. It's part of the Robots Exclusion Protocol (REP), a standard followed by all major search engines.
What robots.txt Does
- • Controls crawler access to specific URLs or patterns
- • Prevents crawling of low-value or duplicate pages
- • Protects server resources from excessive crawling
- • References your XML sitemap location
What robots.txt Does NOT Do
- • Does NOT prevent URLs from being indexed
- • Does NOT provide security (use authentication instead)
- • Does NOT guarantee compliance (crawlers can ignore it)
- • Does NOT remove already indexed pages
noindex meta tags to prevent indexing.Why robots.txt Matters for SEO
Proper robots.txt configuration is crucial for crawl budget optimization and site performance
1. Crawl Budget Optimization
Search engines allocate a limited "crawl budget" to each site—the number of pages they'll crawl in a given timeframe. Wasting it on low-value pages means important content gets crawled less frequently.
2. Preventing Duplicate Content Discovery
Many sites generate infinite URL variations through filters, sorting, and parameters. robots.txt prevents crawlers from discovering these duplicates in the first place.
3. Reducing Server Load
Aggressive crawlers can overload servers, especially for database-heavy pages like search results or complex filters. robots.txt provides a first line of defense.
How Search Engines Read robots.txt
Understanding the crawler perspective
First Thing Checked
Before crawling any page on your site, search engines request example.com/robots.txt. If it doesn't exist (404), they assume everything is allowed.
Parse Rules by User-Agent
Each crawler (Googlebot, Bingbot, etc.) looks for rules specific to its user-agent, falling back toUser-agent: * (all bots) if no specific rules exist.
Match Most Specific Rule
When multiple rules could apply, the longest matching pattern wins. Disallow: /admin/beats Allow: / for /admin/settings.
Cached for Hours
robots.txt is cached by search engines (typically 24 hours). Changes won't take effect immediately. Test with Google Search Console's robots.txt Tester before deploying.
- • Googlebot: Respects all directives
- • Bingbot: Respects all + crawl-delay
- • Yandexbot: Respects all + crawl-delay
- • Applebot: Respects basic directives
- • Scrapers often ignore robots.txt
- • Spambots don't respect it
- • Security: robots.txt is NOT protection
- • Use server-side blocking for bad actors
Common Misconceptions That Hurt SEO
MYTH: "Blocking in robots.txt prevents indexing"
Reality: Blocked URLs can still appear in search results if other sites link to them (shown as "A description for this result is not available"). Use noindex meta tags instead.
MYTH: "Block CSS/JS to speed up crawling"
Reality: Google needs CSS/JS to render pages properly. Blocking them can hurt rankings for JavaScript-heavy sites. Only block if absolutely necessary.
MYTH: "robots.txt provides security"
Reality: robots.txt is publicly accessible. It actually reveals URLs you're trying to hide! Use proper authentication and server-side access controls for sensitive pages.