Scheduling Recurring Crawls to Detect Pre-Migration Drift

Problem Statement

A single pre-migration crawl is a photograph, but a site is a film: editors publish, developers ship, and parameters multiply during the weeks before cutover. If your redirect map is built against a crawl from three weeks ago, every URL added since is an unmapped 404 waiting to happen, and every page that quietly turned into a redirect or changed its canonical is a mismatch your map does not know about. You need recurring, scheduled crawls that diff against the last run so drift surfaces while there is still time to remap, rather than discovering it from a spike in 404s on launch night. The goal is not to crawl more, but to crawl on a cadence and compare automatically so a human only looks when something actually moved. This page is part of Crawl Baseline Generation; start there for the initial inventory.

Each scheduled run exports a dated snapshot and diffs it against the prior one; drift raises an alert and the loop repeats until freeze.

When to Use This Approach

The site changes frequently (active editorial, e-commerce catalogue, or rapid release cadence) in the run-up to migration.
The migration freeze window is weeks away and the URL set is still moving.
You want to detect new URLs, status-code changes, and canonical changes automatically rather than re-auditing by hand.
You have a host that can run a headless crawler on a schedule (a build agent, a small VM, or a cron box).
You need a dated trail of snapshots to show stakeholders exactly when a URL appeared or broke.

Step-by-Step Instructions

1. Make the Crawl Fully Headless and Repeatable

A scheduled crawl cannot prompt for input, so run the Screaming Frog CLI in headless mode with a saved config and a dated output folder. Pinning the config guarantees every run is comparable.

# Screaming Frog CLI — headless nightly crawl into a dated folder
STAMP=$(date +%Y-%m-%d)
ScreamingFrogSEOSpider \
  --crawl https://oldshop.example.com \
  --headless \
  --save-crawl \
  --export-tabs "Internal:All" \
  --output-folder "/var/crawls/$STAMP" \
  --config /etc/crawls/pre_migration.seospiderconfig   # pinned config = comparable runs

2. Schedule It with cron

Wrap the crawl in a small script and schedule it nightly so drift is caught within 24 hours. Run it at a low-traffic hour and add a modest crawl delay in the config to avoid stressing the origin. Have the wrapper script set the date stamp, create the output folder, run the crawl, and then call the diff, so the entire pipeline is a single cron entry with one log file to watch. Pin the timezone explicitly in the cron environment, because a host that drifts between UTC and local time will run your “02:00” crawl in the middle of peak traffic after a daylight-saving change.

# /etc/cron.d/pre-migration-crawl — nightly at 02:00, log to a rotating file
0 2 * * * deploy /usr/local/bin/nightly_crawl.sh >> /var/log/crawl.log 2>&1

3. Diff the Latest Snapshot Against the Previous One

Compare the two most recent exports to surface added URLs, removed URLs, and status-code flips. A simple key of URL plus status code makes new and broken pages fall out of a set difference, and joining the two frames on URL exposes pages that changed status without changing address — the silent 200-to-301 flips that break redirect maps. Keep the diff dumb and deterministic so its output is trustworthy; resist the urge to “smartly” suppress changes, because the whole point is that a human reviews every real delta.

# Diff today's crawl against yesterday's on URL + status code
import pandas as pd, sys
prev, curr = sys.argv[1], sys.argv[2]
a = pd.read_csv(prev, usecols=['Address', 'Status Code'])
b = pd.read_csv(curr, usecols=['Address', 'Status Code'])
new_urls   = set(b.Address) - set(a.Address)               # appeared since last run
gone_urls  = set(a.Address) - set(b.Address)               # disappeared since last run
merged = a.merge(b, on='Address', suffixes=('_old', '_new'))
flips = merged[merged['Status Code_old'] != merged['Status Code_new']]  # status drift
print(f"new={len(new_urls)} gone={len(gone_urls)} status_changes={len(flips)}")

4. Alert on Drift Above a Threshold

Noise-free alerting means a non-zero drift count writes a report and exits non-zero so cron mail (or a CI step) flags it. Keep the threshold low during freeze so even small changes are reviewed, but allow a small tolerance early in the run-up when the site is still legitimately changing, so the team is not desensitised by nightly alerts that are all expected. The exit code is the integration point: cron mails on any non-zero exit, and a CI runner marks the job red, so the same script drives both a person and a pipeline without extra plumbing. Archive each diff report next to its dated crawl folder so the history of what changed, and when, is permanent.

# Fail the job when drift exceeds threshold so the run is flagged
DRIFT=$(python diff_crawl.py "$PREV" "$CURR" | grep -oP 'status_changes=\K\d+')
if [ "$DRIFT" -gt 0 ]; then
  echo "Drift detected: $DRIFT status changes — review /var/crawls" >&2
  exit 1   # non-zero exit triggers cron mail / CI failure
fi

Worked Example

A publisher on oldnews.example.com plans to migrate to news.example.com in five weeks. Nightly Screaming Frog runs land in /var/crawls/2026-06-12 through /var/crawls/2026-06-19. On the night of the 17th the diff reports new=42 gone=3 status_changes=7: an editor relaunched a tag taxonomy, creating 42 new /topics/ URLs and turning 7 old tag pages into 301s.

Because the alert fired, those 42 URLs are added to the redirect map before freeze instead of becoming 404s on launch day, and the 7 status flips are traced to old tag pages that now 301 internally — chains that would have stacked on top of the migration redirects had they gone unnoticed. Two nights later the diff reports status_changes=0, confirming the taxonomy has settled and the map is back in sync.

The dated folders give the team an exact audit trail: they can point to /var/crawls/2026-06-17 as the night the taxonomy changed and show stakeholders that the drift was caught and remapped within 24 hours. The dated snapshots also feed the Core Web Vitals baseline work and the broader Pre-Migration Auditing & Risk Assessment record, giving a precise timeline of when the taxonomy changed.

Verification

Confirm the schedule actually runs and the diff catches known changes.

# Confirm cron executed and produced a dated folder in the last 24h
find /var/crawls -maxdepth 1 -type d -mtime -1

# Sanity-check the diff by planting a known new URL and re-running
python diff_crawl.py /var/crawls/2026-06-18/internal_all.csv \
                     /var/crawls/2026-06-19/internal_all.csv

# Verify the alert exits non-zero on drift
echo "exit code: $?"

Watch for these failures: a non-headless config that silently hangs waiting for a dialog; comparing folders out of order so “new” and “gone” are swapped; and crawling without a delay and tripping the origin’s rate limiting, which injects phantom 429 drift. Watch too for a crawl that quietly truncates — if a run hits a page cap or times out, it will report a flood of “gone” URLs that are not really gone, so assert a minimum expected URL count before trusting the diff. Finally, make sure the cron user has write access to the output and log paths, or the job will fail silently and you will believe the site is stable when nothing has actually run.

FAQ

How often should the recurring crawl run before migration? Nightly is the practical default during the final month, dropping to weekly when the site is stable and freeze is distant. The cadence should be short enough that drift is caught within one editorial cycle so there is always time to remap before cutover.

Will nightly crawling overload the production origin? Not if you add a crawl delay and schedule for a low-traffic window. Set a delay of one to two seconds in the saved config and run at, say, 02:00 local time; the load is comparable to a single extra visitor and far lighter than search-engine crawling.

Can I run this without the paid Screaming Frog licence? Yes, the headless CLI and tab exports require a licence, but you can substitute an open-source crawler that emits CSV and feed the same diff script. The diff logic only needs a URL column and a status-code column, so any crawler that produces those will work.

← Back to Crawl Baseline Generation