Les pubs vous déplaisent ? Aller Sans pub Auj.

Zero-Downtime Database Migrations Schema Changes Without the Scary 3am Rollback

Mis à jour le

The expand-contract pattern for renaming columns, dropping old ones, and adding NOT NULL constraints without locking your Postgres table or breaking rolling deployments. A step-by-step walkthrough with real SQL.

Zero-Downtime Database Migrations: Schema Changes Without the Scary 3am Rollback 1
ANNONCE · Supprimer ?

At 2am, you’re staring at a migration you ran ten minutes ago. The deploy is half-done. Old app pods are crashing because the column they expect is gone. New pods are up but can’t write because the backfill isn’t finished. You’re about to roll back, which means running a reverse migration you haven’t tested, while your on-call teammate types “what happened” in Slack.

This is a solved problem. The expand-contract pattern exists specifically so you never have to choose between “run the migration and hope” and “schedule a maintenance window at 3am.” Here’s how it works, concretely, with a real example.

The scenario that breaks

Say your users table has a column called username. Product wants it renamed to display_name to match the new API spec. Naively, you write this migration:

ALTER TABLE users RENAME COLUMN username TO display_name;

Then you deploy your app. If your deploy is instantaneous—one pod, zero downtime requirement, just restart—this works fine. But if you’re doing a rolling deploy across 6 app servers, or a blue-green switch that takes 30 seconds to drain, you have a window where old code and new code are both running against the same database. Old code queries username. New code queries display_name. At least one of them is wrong at every moment until the deploy completes.

The migration and the deploy need to be two separate, independently safe things. That’s what expand-contract gives you.

The other thing that bites you: table locks

Before diving into the pattern, a second failure mode worth understanding: lock wait time.

Every DDL statement in Postgres acquires an AccessExclusiveLock—the most aggressive lock type, which blocks all reads and writes until it completes. For most schema changes in Postgres 12+, the lock is held for milliseconds. But there are exceptions that can take it from milliseconds to minutes:

OpérationPostgres behaviorSafe alternative
ADD COLUMN NOT NULL DEFAULT now()Full table rewrite (volatile default)Add nullable column, backfill, then constrain
ALTER COLUMN TYPEFull table rewrite, alwaysNew column + backfill + drop old
CREATE INDEXShareLock blocks writes for the full buildCREATE INDEX CONCURRENTLY
ADD COLUMN NOT NULL DEFAULT 'active'Postgres 11+: catalog-only (fast). Postgres 10-: full rewriteOn PG 11+, this is already safe
ADD COLUMN (nullable, no default)Catalog-only, fast on all versionsThis is the expand step
DROP COLUMNCatalog-only, fastThis is the contract step

There’s also the queue problem: even a fast DDL that only needs the lock for 50ms will queue behind any open long-running transaction. And everything behind the DDL queues behind it. This is the “migration took 2 seconds but caused 45 seconds of downtime” situation. Setting lock_timeout et statement_timeout on your migration connection helps here—fail fast rather than queue indefinitely.

The expand-contract pattern

The core idea: never remove anything that running code depends on. Always add before you remove. Every step must be backward-compatible with the previous deploy.

Here’s the full walkthrough for renaming username à display_name.

Step 1: Add the new column (expand)

-- Migration 1: add the new column, nullable
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

Adding a nullable column with no default is a catalog-only operation in Postgres—it doesn’t touch existing rows, takes no time proportional to table size, and the lock is held for milliseconds. After this migration, the column exists but is NULL for all existing rows. Old code still reads and writes username. Nothing breaks.

Step 2: Deploy code that writes to both columns

This is the step most people underestimate. Before you start reading from the new column, you need every write to also populate it.

# v1 (before)
user.username = new_name
db.save(user)

# v2 (dual-write) - safe to deploy alongside v1 during rolling update
user.username = new_name      # old column, still written
user.display_name = new_name  # new column, also written
db.save(user)

Deploy v2. Now every new write populates both columns. Old rows still have NULL in display_name.

Step 3: Backfill existing rows

Don’t run a single UPDATE on a large table—it holds row-level locks for the entire duration of the transaction, and on a table with 10 million rows, that’s a long time while your app is also trying to write to those rows.

Do it in batches:

-- Run this repeatedly until 0 rows updated
UPDATE users
SET display_name = username
WHERE display_name IS NULL
  AND id IN (
    SELECT id FROM users
    WHERE display_name IS NULL
    ORDER BY id
    LIMIT 10000
  );

How small to make batches depends on your table write rate and row size. For most OLTP workloads, 1,000–10,000 rows per batch with a short sleep between runs keeps individual transactions brief enough that they don’t compete badly with app traffic. Once this completes, every row has a value in display_name.

Step 4: Deploy code that reads from the new column

Now that all rows are populated, deploy code that reads from display_name. Keep writing to both columns during this deploy—this version needs to stay compatible with v2 pods still running during the rolling update.

# v3 - reads new column, still writes both
name = user.display_name  # now reading from new column
user.username = new_name      # still writing old (temporary)
user.display_name = new_name
db.save(user)

Step 5: Stop writing to the old column, then drop it

# v4 - only reads and writes new column
name = user.display_name
user.display_name = new_name
db.save(user)

Once v4 is fully deployed and no pods are writing to username:

-- Migration 2: drop the old column (catalog-only, fast)
ALTER TABLE users DROP COLUMN username;

That’s 4 deploys and 2 migrations instead of 1 migration and 1 deploy. Yes, it takes more calendar time. It also means you can run every step during business hours, roll back any individual step without data loss, and sleep through what would have been your 3am maintenance window.

Making the new column NOT NULL

After the backfill, you probably want display_name NOT NULL—it’s semantically a required field now. The naive approach:

-- Dangerous on large tables
ALTER TABLE users ALTER COLUMN display_name SET NOT NULL;

This triggers a full table scan to verify no nulls exist, while holding AccessExclusiveLock. On a large table: minutes of downtime.

The safe alternative uses a NOT VALID check constraint, which skips the full scan on the existing rows and validates them separately with a weaker lock:

-- Step 1: Add the constraint without scanning existing rows (fast, AccessExclusiveLock held briefly)
ALTER TABLE users ADD CONSTRAINT users_display_name_not_null
  CHECK (display_name IS NOT NULL) NOT VALID;

-- Step 2: Validate the constraint (scans existing rows, but only needs ShareUpdateExclusiveLock --
-- concurrent reads and writes are still allowed during this scan)
ALTER TABLE users VALIDATE CONSTRAINT users_display_name_not_null;

ShareUpdateExclusiveLock is the key difference—it only conflicts with other schema changes, not with regular reads and writes. The validation scan can take a while on large tables, but it won’t block your app. Once validated, the query planner can use the constraint for optimization, same as a native NOT NULL.

This is Postgres-specific. MySQL doesn’t have an equivalent—see the next section.

What your migration files look like

If you use any versioned migration tool, the pattern maps to numbered files:

migrations/
  V001__add_display_name_column.sql           # ALTER TABLE ADD COLUMN (safe, run anytime)
  V002__backfill_display_name.sql             # UPDATE in batches
  V003__add_not_null_constraint_not_valid.sql # NOT VALID constraint
  V004__validate_not_null_constraint.sql      # VALIDATE CONSTRAINT
  V005__drop_username_column.sql              # only after v4 code ships

V001–V004 are safe to run while the old app is still deployed. V005 can only run after v4 code is fully rolled out and confirmed. If you want to compare the schema DDL before and after a migration run to make sure it matches your expectations, the IO Tools text diff is useful for a quick side-by-side check on your schema dumps.

Postgres vs MySQL

Most of this guide targets Postgres 12+, but the expand-contract discipline applies anywhere. A few MySQL-specific differences:

  • MySQL 8.0+ Online DDL: ALTER TABLE ADD COLUMN supports ALGORITHM=INSTANT for most cases—equivalent to Postgres’s fast catalog-only add. Check the MySQL 8.0 Online DDL docs for which operations support INSTANT vs INPLACE vs COPY.
  • MySQL 5.7 and older: Almost every ALTER TABLE triggers a full table rebuild. The expand-contract pattern is even more important here, and the backfill batching is even more conservative.
  • No NOT VALID equivalent: MySQL doesn’t have the two-phase constraint validation trick. To safely add a NOT NULL constraint on a large MySQL table, your realistic options are: (1) backfill + set a default + alter in a low-traffic window, or (2) use gh-ost ou pt-online-schema-change, which implement expand-contract internally via shadow tables and triggers.

Migration tooling

The expand-contract pattern is a discipline, not a feature. Any migration tool will work—they all version and sequence migration files. What differs is how much ceremony is involved:

  • Flyway (Java, SQL files) — minimal setup, works. The free version covers most teams. Doesn’t enforce safe migration patterns, but the versioning is solid and the tooling ecosystem is mature.
  • Liquibase (XML/YAML/SQL) — explicit rollback support built in. More ceremony than Flyway. Worth it if you need to define down-migrations explicitly and have a team that will maintain them.
  • golang-migrate (Go CLI + library, SQL up/down pairs) — lightweight and language-agnostic. Config is often JSON or YAML; if one gets mangled during a CI pipeline copy-paste, running it through a JSON formatter surfaces the issue immediately.
  • Alembic (Python/SQLAlchemy) — autogenerate support that diffs your models against the live schema. Useful, but the autogenerate output often needs manual cleanup before it’s safe to run in production. Don’t auto-apply without reviewing.
  • Atlas — declarative schema management. You describe the target state, Atlas diffs and generates the migration. Different mental model from imperative migration files—some teams find it cleaner, others find the generated SQL harder to audit.

None of these prevent you from writing a migration that takes an exclusive lock for 20 minutes. They sequence and version migrations—the safety discipline is still on you.

The mental model

Every schema migration exists in a triangle: the database state before the migration, the app version currently deployed, and the app version about to deploy. A migration is safe only if both the current and next app version can run against both the before and after database state.

That’s it. Before every migration, draw the triangle mentally. If any cell doesn’t work, you need to split the migration. Expand-contract is just a systematic way to always satisfy that constraint.

The 3am rollback happens when someone assumes the migration and the deploy are the same operation. They’re not. Treat them as independent, coordinate them explicitly, and the scary part stops being scary.

Envie d'une expérience sans pub ? Passez à la version sans pub

Installez nos extensions

Ajoutez des outils IO à votre navigateur préféré pour un accès instantané et une recherche plus rapide

Sur Extension Chrome Sur Extension de bord Sur Extension Firefox Sur Extension de l'opéra

Le Tableau de Bord Est Arrivé !

Tableau de Bord est une façon amusante de suivre vos jeux, toutes les données sont stockées dans votre navigateur. D'autres fonctionnalités arrivent bientôt !

ANNONCE · Supprimer ?
ANNONCE · Supprimer ?
ANNONCE · Supprimer ?

Coin des nouvelles avec points forts techniques

Impliquez-vous

Aidez-nous à continuer à fournir des outils gratuits et précieux

Offre-moi un café
ANNONCE · Supprimer ?