Database migration is one of the highest-risk operations any engineering team can undertake. A single misconfigured type mapping, an unexpected charset collision, or a poorly-timed cutover can bring down production systems and corrupt years of data. Yet it's unavoidable — every growing organisation eventually needs to move their data to a faster, cheaper, or more scalable platform.

This guide covers everything you need to design and execute a zero-downtime migration, from architectural patterns to cutover choreography. We've drawn on lessons from 500+ enterprise migrations across financial services, healthcare, and technology companies.

💡 TL;DR: Zero-downtime migration requires dual-write architecture, a validated change data capture (CDC) stream, and a controlled cutover window measured in seconds — not hours.

1. Why Traditional Migration Fails

The traditional approach — dump, transform, restore — requires a maintenance window. You stop writes to the source, export everything, transform it, import it to the target, and then cut traffic over. For anything larger than a few gigabytes, this window stretches to hours or days, which is simply not acceptable for production systems.

Even "fast" export-import pipelines at 10 Gbps throughput will take 13+ minutes per 100GB, and that's before transformation, index rebuilding, and verification. Databases with foreign key constraints, triggers, or complex partitioning can multiply that 5–10x.

2. The Dual-Write Architecture

The only truly zero-downtime approach is to never stop writing to the source database. Instead, you run both databases in parallel, keeping them in sync, until you're confident the target is fully caught up and verified. Only then do you switch traffic over — and even then, the source remains live as a fallback.

The pattern has three phases:

  • Backfill: Full initial data load from source to target. This takes time but runs against a live source.
  • Sync: Change Data Capture (CDC) streams ongoing writes from source to target. The lag closes to near-zero.
  • Cutover: Once lag is <100ms, switch application writes to target. The cutover itself is a config change, not a data operation.
-- PostgreSQL: Enable logical replication for CDC
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 10;

-- Create replication slot
SELECT pg_create_logical_replication_slot(
  'dflux_migration_slot',
  'pgoutput'
);

3. Validation: Beyond Row Counts

Most migration tools check row counts and declare success. This catches gross errors but misses a significant class of silent corruption. We've seen migrations where row counts match perfectly but numeric precision is silently degraded (DECIMAL(10,4) → FLOAT), or where timezone-naive timestamps are shifted by UTC offset, or where unicode normalisation converts characters invisibly.

Recommended validation stack: Row counts + column checksums + statistical distribution sampling + null rate comparison + sample record deep comparison.

4. Cutover Choreography

The actual cutover — the moment you switch production traffic from source to target — should be a non-event. If you've designed it correctly, it's a config change, not a migration operation. Your application reads a connection string from a config store; you change it. Bake in a 5-minute "drain" period where new connections go to target but active sessions on source are allowed to complete.

⚠️ Always have rollback ready: Keep your CDC sync running bidirectionally during the cutover window. If something goes wrong in the first 30 minutes, you can switch back without any data loss.


This is a framework, not a prescription. Every migration is different — the right approach for a 100GB OLTP database serving 1,000 req/s is very different from a 10TB data warehouse with nightly ETL jobs. Start with your constraints, then design backwards.

If you'd like a migration assessment for your specific database stack, our team is happy to review your architecture and give a concrete recommendation.