Crawl Baseline Generation for Technical Site Migrations

Context: This playbook establishes the foundational crawl architecture required before executing a domain cutover or platform migration. Webmasters, SEO engineers, site architects, and technical project managers must capture a complete, unaltered snapshot of the production environment. Baseline generation prevents indexation loss, preserves link equity, and anchors enterprise risk matrices. Align your technical crawl scope with Pre-Migration Auditing & Risk Assessment frameworks before initiating extraction.

Pre-flight Checks

Establish crawl parameters, authentication handling, and URL discovery limits before baseline extraction. Mitigate common crawl traps and server overload risks.

Configure user-agent strings to match target search engine bots. Set crawl depth to unlimited.
Exclude staging subdomains via server-level robots.txt directives to prevent index contamination.
Define thread concurrency limits (5-10) and enforce a 1-second request delay to prevent origin server overload.
Disable faceted navigation parameters, calendar queries, and session IDs in the crawler configuration to avoid infinite loops.
Verify authentication credentials for gated conversion paths. Ensure crawler access bypasses member-only walls.
Validate headless browser execution settings. Omitting JS rendering guarantees missing DOM links and metadata.

Execution Steps

Capture complete URL inventory, HTTP status codes, response times, and metadata for migration mapping. Correlate technical data with business metrics to prioritize risk tiers.

Execute headless browser rendering for JS-heavy frameworks. Enforce a 3-5 second wait time to capture hydrated DOM links.
Export CSV/JSON datasets containing canonical tags, redirect chains, and meta directives. Consult How to Export Full Crawl Data Before Migration for schema-compliant field mapping and dataset validation.
Join crawl datasets with GA4/GTM event tracking and GSC index coverage reports. Apply Traffic & Conversion Mapping to flag revenue-critical URLs and high-conversion funnels.
Segment baseline inventory by content architecture (product, blog, utility, legal) for targeted redirect planning.
Calculate indexation velocity and canonical consistency scores. Pre-identify duplicate content risks before cutover.
Cross-reference export integrity against raw server access logs. Identify orphaned or unindexed assets immediately.

Configs/Commands

Deploy these CLI parameters and validation loops to standardize baseline extraction across environments.

# ScreamingFrog CLI
java -jar screamingfrogseo.jar --crawl https://production-origin.com --export-csv baseline.csv --max-depth 0 --render-js --user-agent 'Googlebot/2.1'

# Sitebulb CLI
sitebulb.exe --project baseline --url https://production-origin.com --export baseline.json --include-redirect-chains --ignore-robots --headless

# cURL Validation Loop
curl -sI -o /dev/null -w '%{http_code} %{time_total} %{url_effective}' -K url_list.txt

# DNS TTL Pre-Change
dig production-origin.com SOA | grep TTL
# Set authoritative TTL to 300s exactly 48 hours pre-cutover

// WordPress/Headless CMS Cache Flush
define('WP_CACHE', false);
// Flush CDN edge via API
curl -X POST https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache \
 -H 'Authorization: Bearer {token}' \
 -H 'Content-Type: application/json' \
 -d '{"purge_everything":true}'

Validation

Communicate baseline findings, establish rollback triggers, and execute final QA before domain cutover.

Run automated diff checks between baseline and staging environments. Verify structural parity.
Compare baseline URL count against sitemap.xml line count using wc -l.
Verify absence of X-Robots-Tag: noindex on production paths via grep -i 'noindex' headers.txt.
Validate DNS propagation status and CDN cache purge schedules prior to final production switch.
Distribute crawl discrepancy reports to engineering, content, and marketing teams. Implement Stakeholder Communication Plans for phased rollout sign-offs and incident escalation paths.

Rollback Triggers

Halt the cutover and revert DNS immediately if any of the following conditions are detected during pre-launch validation:

Baseline URL count deviates >5% from historical sitemap or server log totals.
Critical 301/302 redirect chains are broken or missing, risking broken link equity transfer.
Canonical discrepancies exceed acceptable thresholds, triggering duplicate content mapping and indexation dilution.
Authentication walls unexpectedly block crawler access to revenue-generating paths.
DNS propagation fails or CDN cache purge schedules remain unverified.

FAQ

How do I handle JavaScript-rendered URLs during baseline generation? Deploy a headless browser crawler (Puppeteer, Sitebulb, or Screaming Frog with JS rendering enabled) and enforce a 3-5 second wait time to ensure complete DOM hydration before link extraction and metadata parsing.

What is the optimal crawl depth and thread count for enterprise sites? Set depth to unlimited (--max-depth 0) and configure 5-10 concurrent threads with a 1-second request delay. This prevents origin server overload while ensuring deep archive structures and paginated series are fully captured.

How do I validate baseline accuracy against server logs? Cross-reference the exported crawl CSV with 30-day Apache/Nginx access logs using awk or Python pandas. Filter for 200/301 status codes, identify orphaned URLs, and flag high-traffic paths missing from the baseline inventory.

When should DNS TTL be lowered relative to crawl baseline completion? Lower the authoritative DNS TTL to 300 seconds exactly 48 hours before cutover. This must occur after baseline validation confirms all redirect mappings, canonical tags, and CMS configurations are production-ready.