Robots.txt and XML Sitemaps: Controlling Crawl and Index
Before a search system evaluates your content, it reads two files: robots.txt and your XML sitemap. These files function as your crawl directive layer — they tell search systems what to access, what to skip, and what matters most. Google's crawl scheduling patent (US Patent 7,593,932) describes how these directives influence crawl priority allocation.
Robots.txt: The Access Control Layer
What Robots.txt Actually Controls
Robots.txt controls crawl access, not indexation. A disallowed URL can still appear in search results if other pages link to it. To prevent indexation, you need the noindex meta tag.
The Optimal Robots.txt Structure
A well-structured robots.txt for most sites:
- Allow all content pages, blog posts, product pages, category pages
- Disallow admin pages, search results pages, filter/sort parameter URLs, staging environments, API endpoints
- Reference your XML sitemap(s)
- Specify crawl-delay only if your server cannot handle frequent crawling
Common Robots.txt Errors We Fix
- 1Blocking CSS/JS files — Search systems need these to render your pages. Blocking them means your content is evaluated without styling or functionality context.
- 2Blocking image directories — Your images cannot appear in image search if the directory is disallowed.
- 3Overly broad disallow rules —
Disallow: /on staging that accidentally deploys to production blocks your entire site. - 4Missing sitemap directive — Every robots.txt should include
Sitemap: https://yourdomain.com/sitemap.xml.
XML Sitemap: The Priority Signal
Beyond the Basics
Your XML sitemap is not just a list of URLs. It is a priority signal that tells search systems which pages you consider most important and how frequently they change.
Sitemap Best Practices
- 1Only include indexable pages — Every URL in your sitemap should return 200, have a self-referencing canonical, and not have a noindex tag.
- 2Use lastmod accurately — Only update lastmod when the content meaningfully changes. Search systems track lastmod reliability and ignore sites that update it on every crawl.
- 3Implement priority strategically — Homepage: 1.0, category pages: 0.8, key content pages: 0.7, supporting pages: 0.5.
- 4Split large sitemaps — Over 10,000 URLs? Use a sitemap index file that references multiple smaller sitemaps organized by content type.
- 5Include image sitemaps — If images are important to your visibility, add image sitemap tags within your main sitemap.
Sitemap Errors That Waste Crawl Budget
- URLs returning 301, 302, 404, or 410
- URLs with noindex tags
- URLs that canonical to a different page
- Parameter-variant URLs that duplicate content
- Non-indexable file types (PDFs, unless intentionally indexed)
How We Audit and Implement
Our Technical Health dimension includes a complete crawl directive audit:
- 1Crawl every URL in your sitemap — verify status codes, canonical consistency, and index directives
- 2Compare sitemap to actual site structure — identify pages that should be in the sitemap but are not
- 3Audit robots.txt — ensure no critical resources are blocked
- 4Monitor crawl stats — Use Search Console data to verify that search systems are spending crawl budget on your priority pages
We implement fixes directly — updating your sitemap, correcting robots.txt directives, and ensuring your crawl directive layer aligns with your content priorities.