How to Export Full Crawl Data Before Migration
Problem/Symptom
Migrating without a complete URL inventory guarantees broken redirects and lost link equity. Incomplete datasets trigger immediate 404 spikes and organic traffic collapse. You must capture every HTTP status, redirect chain, and canonical tag before touching DNS.
Exact Execution/Config
Define strict crawl boundaries before initiating extraction. Prevent server overload while guaranteeing full architecture capture. Before running extraction, initialize Crawl Baseline Generation protocols to guarantee complete architecture mapping.
- Configure
robots.txtcompliance flags to bypassDisallowdirectives for baseline capture. - Set
max_depthto 0 (unlimited) andmax_pagesto exact sitemap count + 15% buffer. - Apply rate-limiting (
--delay 2000ms) to prevent 429/503 errors during extraction. - Enable JavaScript rendering (
--render-js true) for SPA/CSR route capture. - Export internal and external link tables separately to isolate migration-critical paths.
Run the crawler using optimized concurrency. Export raw datasets immediately for downstream processing.
screamingfrog --crawl https://production-site.com --export-csv baseline.csv --config pre_migration.cfg --user-agent 'MigrationBot/1.0'
wget -r -l inf -nd -A html,htm --wait=2 -o crawl.log https://production-site.com/sitemap.xml
python3 -c "import pandas as pd; df=pd.read_csv('baseline.csv'); df.to_parquet('baseline.parquet')"
Validation
Transform raw exports into migration-ready mapping tables. Apply strict regex patterns and column transformations to guarantee data integrity.
- Apply regex
^https?://(?:www\.)?olddomain\.com(/.*)$to isolate legacy URL paths. - Map CSV columns:
Address→Source_URL,Status Code→HTTP_Status,Canonical→Target_Canonical. - Use
awkor Pythonpandasto deduplicate 301/302 chains:df.drop_duplicates(subset=['Source_URL'], keep='last'). - Validate redirect loops via
--max-redirects 3flag during secondary verification crawl. - Apply transformation rules:
LOWER(source_url),REGEX_REPLACE(redirect_target, '^https?://[^/]+', ''),TRIM(meta_desc).
Once normalized, integrate the dataset into your broader Pre-Migration Auditing & Risk Assessment workflow to identify structural gaps. Avoid these critical failures during validation:
- Ignoring JavaScript-rendered routes causes SPA content loss post-migration.
- Exporting only top-level URLs misses deep pagination and filter parameters.
- Failing to preserve
hreflangandx-defaultannotations corrupts international targeting. - Overwriting baseline files without versioning eliminates rollback capability.
- Misconfiguring crawler concurrency triggers WAF blocks and incomplete datasets.
Rollback/Emergency Steps
Implement immutable backup strategies and rapid restoration commands. Revert migration changes immediately if critical SEO metrics degrade.
- Commit raw CSV exports to Git with SHA-256 checksums:
sha256sum crawl_raw.csv > crawl_raw.sha256. - Maintain DNS TTL at 300s pre-migration to enable rapid IP failover.
- Store
.htaccess/Nginxrewriterules in version-controlled YAML:nginx -t && systemctl reload nginx. - Define rollback trigger thresholds: >5% organic traffic drop or >10% 404 spike within 24h of launch.
FAQ
What is the optimal crawl depth for capturing migration-critical URLs?
Set depth to 0 (infinite) with a hard page limit matching sitemap.xml count + 15%. Use --max-depth 0 in CLI tools to prevent arbitrary truncation of deep archive or parameterized URLs.
How do I handle 302 redirects in the baseline export?
Flag all 302s for manual review. Convert them to 301s pre-migration if permanent, or preserve them in a redirect_type CSV column to prevent unintended canonicalization during DNS cutover.
Can I automate CSV normalization for 500k+ URLs?
Yes. Use pandas with chunking (chunksize=50000) or duckdb for out-of-core processing. Apply vectorized regex replacements and export to Parquet for faster I/O during mapping.