Crawl Baseline Generation
Context
A crawl baseline is the frozen, unaltered picture of the legacy estate that every post-launch validation is measured against. Webmasters, SEO engineers, and technical project managers capture it in the diagnostic window — after content freeze, before any DNS, CMS, or redirect change — so that “did the migration break anything?” becomes a diff against a known dataset rather than a guess. Without a baseline you cannot prove indexation parity, cannot find orphaned-but-trafficked URLs, and cannot tier redirects by value. This work sits at the front of the Pre-Migration Auditing & Risk Assessment sequence and feeds both the risk matrix and the redirect source map.
The baseline must be complete in two dimensions: every URL a crawler can reach by following links, and every URL real traffic actually requested. Link-following alone misses orphaned pages that still earn sessions; logs alone miss freshly published pages with no inbound traffic yet. You need both, joined.
Pre-flight Checks
Set crawl parameters, authentication, and exclusions before extraction so the baseline is complete and the origin survives the load.
- Set the crawler user-agent to match the target bot (Googlebot smartphone) and crawl depth to unlimited.
- Enable JavaScript rendering with a 3–5 s render wait; omitting it loses DOM links and metadata on client-rendered frameworks.
- Cap concurrency at 5–10 threads with a 1 s delay to avoid origin overload; crawl the origin IP, not the CDN edge, to bypass cached responses.
- Exclude faceted-navigation parameters, calendar queries, and session IDs to avoid infinite crawl traps.
- Keep staging out of the index with
X-Robots-Tag: noindexresponse headers (more reliable than robots.txt during DB syncs). - Supply authentication for gated conversion paths so member-only revenue funnels are captured.
- Gather 30–90 days of raw access logs and the live
sitemap.xmlto merge with the link crawl.
Execution Steps
1. Run the Render-Enabled Crawl
Execute a full crawl with JavaScript rendering against the legacy origin to capture the link-reachable URL set. Set unlimited depth and a 3–5 s render wait so hydrated DOM links and client-side metadata land in the dataset. Crawl the origin directly to avoid edge-cached responses masking real status codes.
2. Merge Logs and Sitemap for Completeness
Extend the link crawl with URLs that real traffic requested but no internal link exposes. Parse 30–90 days of access logs and the live sitemap, then union them with the crawl output and deduplicate. Anything present in logs but absent from the crawl is an orphaned candidate that needs an explicit redirect decision.
3. Export the Full Dataset
Serialise the inventory with every field downstream work needs: status code, canonical, redirect target, meta robots, and indexability. Follow How to Export Full Crawl Data Before Migration for schema-compliant field mapping so the export drops straight into redirect tooling.
4. Join to Analytics and Tier by Value
Attach business value to each URL so later prioritisation is objective. Join the inventory to GA4 sessions and conversions and to Search Console clicks, then segment by content type. Apply Traffic & Conversion Mapping to flag revenue-critical paths that must migrate as single-hop 301s.
5. Validate Integrity Against Logs
Confirm the baseline is trustworthy before it becomes the reference for every gate. Cross-check the export against raw logs for coverage gaps, recompute canonical-consistency and indexation scores, and circulate the discrepancy report through Stakeholder Communication Plans for sign-off.
Configs / Commands
# Screaming Frog CLI — unlimited depth crawl with JS rendering, full export
# Requires the Screaming Frog SEO Spider (headless mode on Linux)
java -jar ScreamingFrogSEOSpider.jar \
--crawl https://production-origin.com \
--headless --save-crawl \
--export-tabs "Internal:All,Response Codes:All,Canonicals:All" \
--output-folder /tmp/baseline/ \
--config crawl.seospiderconfig # render mode = Ajax, 5s wait, JS enabled
# Merge crawl URLs with log + sitemap URLs, then dedupe to one inventory
awk '{print $7}' access.log | sort -u > log_urls.txt # paths from logs
grep -oP '(?<=<loc>)[^<]+' sitemap.xml | sort -u > sm_urls.txt
sort -u crawl_urls.txt log_urls.txt sm_urls.txt > inventory.txt
wc -l inventory.txt # frozen baseline count
# cURL validation loop — record status code for every URL in the inventory
while IFS= read -r url; do
code=$(curl -sI -o /dev/null -w '%{http_code}' "$url") # headers only
echo "$code $url"
done < inventory.txt > status_baseline.csv
# dig — capture the SOA TTL now so rollback can restore the exact value later
dig production-origin.com SOA +noall +answer
# Then lower the record TTL to 300 s exactly 48 h before cutover via the DNS API
Validation
Concrete pass/fail checks the baseline must clear before it is accepted.
grep -c '<loc>' sitemap.xmlvswc -l inventory.txtX-Robots-Tag: noindex:grep -i 'noindex' status_baseline_headers.txtreturns nothingstatus_baseline.csvis explained (intentionally retired or flagged for fix)awk/pandas
Rollback Triggers
Halt the cutover and revert if any condition appears during pre-launch validation.
- Baseline URL count deviates >5% from sitemap or 30-day log totals.
- Any top-quintile-revenue URL is missing a single-hop 301 mapping.
- Canonical discrepancy rate exceeds 2% across migrated templates.
- Authentication walls unexpectedly block crawler access to revenue paths.
- DNS propagation fails or CDN cache-purge schedules are unverified at the gate.
FAQ
How do I handle JavaScript-rendered URLs during baseline generation? Use a render-capable crawler (Screaming Frog with Ajax rendering, or Puppeteer/Playwright) and enforce a 3–5 s render wait so the DOM fully hydrates before link and metadata extraction; otherwise client-side routes never enter the inventory.
What crawl depth and thread count suit enterprise sites? Set depth to unlimited and run 5–10 concurrent threads with a 1 s delay. That keeps deep archives and paginated series in scope while staying under the origin’s overload threshold.
How do I validate baseline accuracy against server logs?
Join the export to 30-day Nginx/Apache logs with awk or pandas, filter for 200/301 responses, and surface URLs that appear in logs but not the crawl — those orphaned, traffic-earning paths are the highest-risk omissions.
When should DNS TTL be lowered relative to baseline completion? Lower the authoritative TTL to 300 s exactly 48 h before cutover, and only after baseline validation confirms redirect mappings, canonical tags, and CMS config are production-ready.
Related
- Pre-Migration Auditing & Risk Assessment
- Risk Assessment Frameworks
- Traffic & Conversion Mapping
- How to Export Full Crawl Data Before Migration
← Back to Pre-Migration Auditing & Risk Assessment