Writing a Migration Go-Live Runbook

Problem Statement

On cutover night, ambiguity is the enemy: if two people each assume the other is flipping DNS, the migration stalls or doubles up. A go-live runbook removes improvisation by naming who does what, in what order, by when, and what evidence must be true before each step. Without it, a 30-minute cutover becomes a three-hour scramble of chat threads, half-applied changes, and a rollback that nobody is sure how to trigger. The runbook is also the artefact that lets a calm planning session, days earlier, do the hard thinking so that the high-pressure night is pure execution — reading and confirming, not deciding and debating. This page is part of Stakeholder Communication Plans; start there for the wider communications framework.

The runbook is a timed sequence; the go/no-go gate either advances to cutover or routes back to a hold-and-fix loop.

When to Use This Approach

The migration involves more than one team (engineering, SEO, infra, content, support) acting on the same night.
DNS, redirects, and search-console handover must happen in a strict order.
Leadership requires a documented go/no-go decision before traffic is moved.
You need an artefact that an on-call engineer who was not in planning can execute correctly.
The change is high-risk enough that a clear rollback path must be one step away at all times.

Step-by-Step Instructions

1. Define Roles and a Single Decision-Maker

List every role with a named person and a backup, and designate exactly one go/no-go decision-maker. Diffuse ownership is how steps get skipped, so each line item maps to one accountable name. Name backups for every role too, because cutovers run late and key people drop offline; a role with only a primary is a single point of failure on the riskiest night of the project. Confirm contact details and availability windows for everyone listed, and put the on-call rollback owner at the top so reverting is never blocked on hunting for who has the access.

# roles.yml — named owners for the cutover (one decision-maker only)
decision_maker: "A. Rivera (Migration Lead)"     # sole go/no-go authority
roles:
  dns_operator:   { primary: "J. Okafor", backup: "T. Lin" }
  redirect_owner: { primary: "M. Costa",  backup: "P. Shah" }
  gsc_handover:   { primary: "R. Devi",   backup: "S. Müller" }
  comms_lead:     { primary: "K. Brandt", backup: "L. Nguyen" }

2. Sequence the Steps with Times and Dependencies

Write the cutover as a timed checklist where each step has a start time, an owner, and a precondition. Order matters: lowering TTL and confirming the redirect map must precede the DNS flip, and search-console handover must follow verified 200s, not precede them. Express times relative to T-0 (the go/no-go gate) rather than wall-clock, so the same runbook works regardless of which evening the migration actually runs and a slip in one step shifts the rest predictably.

# cutover_sequence.txt — T-relative schedule (precondition must be true to start)
T-60  redirect_owner  Deploy redirect rules to staging        pre: map signed off
T-30  dns_operator    Confirm TTL already lowered to 300s     pre: 24h since TTL drop
T-00  decision_maker  GO/NO-GO gate                           pre: all checks green
T+05  dns_operator    Flip A/AAAA records to new origin       pre: GO recorded
T+10  redirect_owner  Promote redirects to production         pre: DNS change applied
T+30  gsc_handover    Submit change of address + new sitemap  pre: 200s verified

3. Wire Comms Triggers to Steps

Every meaningful step emits a status message so stakeholders are never guessing. Pre-write the messages so the comms lead pastes rather than composes under pressure. Decide in advance which channel carries operational chatter and which carries the clean status updates leadership reads, so executives are not scrolling past debugging noise to find whether the migration is on track. Timestamp every message and keep them factual — “DNS flipped, propagation monitoring underway” — because a calm, regular cadence of updates is itself a signal that the operation is under control.

# notify.sh — post a templated status line to the migration channel
post() { curl -s -X POST "$WEBHOOK" -H 'Content-Type: application/json' \
  -d "{\"text\":\"[$(date +%H:%M)] $1\"}"; }   # timestamped status to stakeholders
post "T-0 GO recorded by A. Rivera — DNS flip starting"
post "T+5 DNS flipped; propagation monitoring underway"

4. Embed the Go/No-Go Gate and Rollback Reference

The gate is a hard stop with an explicit checklist; if any item is red, the decision is no-go. Reference the rollback procedure inline so it is one click away, not buried.

# go_no_go.txt — ALL must be green to proceed
[ ] Full backup verified and restorable
[ ] Redirect map signed off and tested on staging
[ ] TTL lowered ≥24h ago and confirmed via dig
[ ] Monitoring + alerting live for new origin
[ ] Rollback owner on call and procedure open
Decision: GO / NO-GO  by: ______  at: ______

Worked Example

A SaaS company migrates app.oldco.com to app.newco.com on a Tuesday at 22:00 local. The runbook names J. Okafor as DNS operator and A. Rivera as the sole decision-maker. At T-30 the gate checklist shows four greens but one red: monitoring for the new origin is not yet firing alerts. Rivera records a NO-GO, the comms lead posts the hold, and the team fixes alerting in 18 minutes.

At the reconvened T-0 all items are green, GO is recorded, and notify.sh posts the timestamped status. The DNS flip happens at T+5, redirects are promoted at T+10 once the change is applied, and the change-of-address submission waits until T+30 when the DNS operator has confirmed 200s on the new origin. Each step is gated by its precondition, so when propagation runs slightly slow the redirect promotion simply waits rather than firing against the old origin.

Because every status line was pre-written, the comms lead pastes updates within seconds of each milestone, and at no point does an executive have to ask “where are we?” The recorded decision and timeline feed the executive status dashboard and the broader Pre-Migration Auditing & Risk Assessment record, giving leadership an auditable account of why the first attempt was held and how the second succeeded.

Verification

Confirm the runbook is executable and the gate evidence is real.

# Confirm TTL precondition: record must already be at the lowered value
dig +short app.oldco.com A
dig +short app.oldco.com | head -1   # cross-check resolver agreement

# Confirm new origin returns 200 before submitting change of address
curl -sI https://app.newco.com/ | head -1

# Confirm a backup exists and is restorable (dry-run restore)
restic snapshots --last 1 && echo "backup present"

Watch for these failures: a runbook with roles but no times, so steps fire out of order; a go/no-go gate that is advisory rather than blocking; and comms messages composed live, which delay updates exactly when stakeholders most want them. Equally damaging is a runbook that was never rehearsed — a dry run against staging surfaces the missing access, the step nobody owns, and the precondition that cannot actually be verified, all while there is still time to fix them. Version the runbook in the same repository as the migration config so the document and the change it describes stay in lockstep, and stamp it with the rehearsal date so everyone knows it has been exercised, not just written.

FAQ

Who should hold the final go/no-go authority? Exactly one person, usually the migration lead, with a named backup. Splitting the decision across a committee produces hesitation and contradictory instructions at the worst possible moment; a single accountable decision-maker reading a checklist keeps the gate fast and unambiguous.

How detailed should the timed sequence be? Detailed enough that an engineer who missed planning can execute it correctly. Each line needs a time, an owner, and a precondition, but avoid narrating obvious sub-steps; the runbook is an operational checklist, not a tutorial.

Where does rollback fit in the runbook? Rollback is referenced at the go/no-go gate and kept open throughout cutover, with its owner on call. The runbook should link straight to the rollback procedure so reverting is a known, rehearsed sequence rather than an improvised reaction.

← Back to Stakeholder Communication Plans