Abstract data flow visualization

About a year ago we started seeing slow query alerts on a table with around 120 million rows and heavy write traffic — a few thousand UPDATEs per minute during peak hours. Dead tuples were accumulating faster than autovacuum could clean them up, which meant table bloat, which meant slower sequential scans, which meant alerts.

The fix was autovacuum tuning, which I expected to take a weekend. It took six months to fully validate and I made at least three changes that I later had to partially revert. Here's what I learned.

The default scale factor is too conservative for large tables

Autovacuum triggers when dead tuples exceed autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * table_rows. The default scale factor is 0.2, meaning a table with 120M rows has to accumulate 24 million dead tuples before autovacuum kicks in. That's a lot.

We dropped the scale factor for that specific table to 0.01 using per-table storage parameters, which means autovacuum runs at 1.2M dead tuples instead of 24M:

ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.01,
  autovacuum_analyze_scale_factor = 0.005
);

This helped significantly, but introduced a new problem: autovacuum was now running much more frequently and consuming more I/O, which we noticed in the disk throughput graphs.

Cost delay tuning is where things got complicated

Autovacuum is intentionally throttled to avoid overwhelming I/O. The throttle is controlled by autovacuum_vacuum_cost_delay (default 2ms) and the cost limits. Lower the delay, vacuum runs faster but uses more I/O.

We dropped the delay to 1ms. This helped with the dead tuple accumulation but caused our p95 query latency to increase slightly during vacuum runs — not dramatically, but enough to show on graphs and trigger some "is something wrong" questions from the team.

We ended up at 2ms delay during business hours and scheduled a manual VACUUM ANALYZE during the low-traffic window (02:00-04:00 UTC) via a cron job. The manual vacuum doesn't respect cost delay, so it catches up faster. This felt a bit like giving up on tuning in favor of brute force, but it actually works well.

What I watched to validate

The key query for monitoring dead tuple accumulation:

SELECT relname, n_dead_tup, n_live_tup,
       round(n_dead_tup::numeric / nullif(n_live_tup,0) * 100, 1) AS dead_pct,
       last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

I ran this every 15 minutes for the first month after each config change, watching the dead_pct trend. The hard part: changes take days or weeks to fully show their effect because you need enough write traffic to go through multiple vacuum cycles.

Table bloat is a separate problem

Even after tuning autovacuum to run more aggressively, the table was still bloated from the months of under-vacuuming before. Regular VACUUM reclaims dead tuple space for reuse within the table but doesn't shrink the file on disk. VACUUM FULL shrinks the file but takes an exclusive lock and rewrites the entire table, which wasn't feasible for a 400GB table with high traffic.

We used pg_repack (not affiliated, just a tool that works) to do an online repack of the table. It took about 4 hours on our hardware and reduced table size from 380GB to 260GB. Planning to do this annually now, after each year of accumulation.

Back to blog