Why Deleting a Column Takes Hours in a Live Database
You stare at a database schema and spot a column that hasn’t been used in two years. user_referral_code, nullable, empty in 94% of rows. You type ALTER TABLE users DROP COLUMN user_referral_code and hit enter. On your laptop’s test database with 500 rows, it finishes before you blink. On production with 80 million rows, you’ve just kicked off a process that might not finish until your users are filing support tickets.
This isn’t a bug. It’s the predictable consequence of how relational databases actually work, and understanding it changes how you think about schema changes entirely.
1. The Database Has to Rewrite the Whole Table
In most relational databases, rows aren’t stored as loose key-value pairs. They’re packed into fixed-format pages on disk, and the column layout is baked into that format. When you drop a column, the database can’t just flip a bit and move on. It has to read every page, reconstruct every row without the dropped column, and write everything back.
PostgreSQL prior to version 14 did exactly this. A DROP COLUMN on a large table triggered a full table rewrite, during which the database held an ACCESS EXCLUSIVE lock, blocking every read and write to that table. MySQL’s InnoDB storage engine historically had the same behavior for many schema changes. The actual wall-clock time scales roughly linearly with table size and disk I/O speed. An 80-gigabyte table on spinning disks can genuinely take hours.
PostgreSQL 14 introduced the ability to mark a column as dropped in the system catalog without an immediate rewrite, which is a meaningful improvement. But the rewrite is still deferred, not eliminated, and the details of when it actually occurs depend on subsequent operations.
2. Locks Block Everything Behind Them
The rewrite problem would be bad enough on its own. The lock problem compounds it. When a DROP COLUMN requires a table rewrite, the database needs to hold an exclusive lock for the entire duration. Any query that was already running against that table when the lock was requested finishes first. Your ALTER TABLE queues behind them.
Here’s where it gets worse. While your ALTER TABLE is queued and waiting, every subsequent query against that table also queues behind it. You now have a lock queue that grows for as long as the ALTER TABLE is waiting, and then for the entire duration of the rewrite. A table that handles thousands of queries per second can accumulate a backlog that doesn’t drain for minutes after the schema change finishes. This is how a schema migration that “should have been fast” causes a multi-minute production outage.
The tooling to handle this safely, things like pt-online-schema-change for MySQL or pg_repack for PostgreSQL, works by creating a shadow copy of the table, applying changes there, and using triggers to sync ongoing writes. The switchover happens in a brief, controlled window. The indirection is unglamorous but necessary.
3. Indexes Have to Be Rebuilt
A column that nobody directly queries can still be part of an index. A composite index on (user_id, referral_code, created_at) breaks when you drop referral_code. The database has to drop the index and rebuild any affected ones, which is its own significant I/O operation.
Index rebuilds are expensive in proportion to table size and the number of affected indexes. On a table with a dozen composite indexes, several of which touch the column you’re dropping, rebuilding adds substantial time to an already lengthy operation. And unlike the table rewrite, which at least runs once, index rebuilds can’t easily be parallelized against each other without careful database-version-specific configuration.
The practical implication: before you drop any column, run a query against the information schema (or your database’s equivalent) to find every index that includes it. Knowing the scope of the rebuild before you start is the difference between a planned maintenance window and an emergency.
4. Foreign Keys and Constraints Need Validation
If the column you’re dropping is referenced by a foreign key in another table, the database has to validate that no orphaned references exist before it can complete the operation. On tables with millions of rows on both sides of the relationship, this validation scan takes time and requires locks on both tables simultaneously.
Constraint validation is a frequently underestimated cost. Dropping a column that has a NOT NULL constraint, a CHECK constraint, or a unique constraint means the database has to update its constraint catalog and, in some cases, do a validation pass. PostgreSQL lets you add constraints as NOT VALID and validate them separately, but the reverse operation, removing them as part of a column drop, doesn’t have the same flexibility.
This is also where application code creates hidden dependencies. Code that queries SELECT * and maps results to structs will silently pick up or lose columns. An ORM model that references the column by name will throw errors the moment the column disappears, regardless of whether the database finished the operation cleanly. Schema changes and application deploys have to be sequenced carefully, which usually means deploying code that tolerates the column’s absence before removing it.
5. Production Traffic Makes Everything Slower
All of the above assumes a quiet database. Production databases are not quiet. A table rewrite that takes 20 minutes at 2am with minimal traffic might take four hours at 2pm with full load, because the database is competing with live queries for I/O bandwidth, CPU, and buffer cache.
The buffer cache point is particularly sharp. Databases aggressively cache frequently-accessed pages in memory. A full table rewrite blows through that cache, thrashing it with pages from the table being rebuilt. This degrades performance for every other query running against that database instance, not just the table being modified. At sufficient scale, a poorly timed schema migration can degrade an entire database cluster.
This is why the standard advice is to use online schema change tools, run migrations during low-traffic windows, and test migration time against a production-sized snapshot before touching live data. The tooling exists. The processes exist. The reason teams still get burned is that “just drop the column” feels equivalent to “just delete a file,” and at 500 rows it is. The abstraction leaks at 80 million.
A column is not a file. A table rewrite is not a rename. The database is doing an enormous amount of physical work that the SQL syntax completely conceals, and the concealment is a feature right up until it isn’t.