Operations: The 2AM Debugging Toolkit

This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can browse by specific problem. Suffering from high operational costs? Check out the Cost Optimization community insights guide.

When Everything is Broken: Emergency Diagnostics

Community philosophy: "If something looks odd, even just slightly, something is wrong. Investigate before it gets worse."

The nightmare scenario: "One of the worst days of my life over the past three years... it locked the database you couldn't insert anything couldn't read anything CPU went crazy memory usage went crazy"

EMERGENCY: Production Incident Queries (Copy-Paste Ready)

When your ClickHouse is down at 2AM, run these in order:

Learning: Incident Pattern Recognition

Understand the failure modes with working examples:

Memory Exhaustion Detection

Bad Data Detection

The 2AM Methodology

Follow this exact sequence when everything is broken:

Phase 1: Immediate Triage (30 seconds)

Run system.errors - any non-zero = active incident
Check disk space - "It took us from 12 to 4 AM... it was as simple as low disk"
Look for replication lag > 5 minutes

Phase 2: Resource Investigation (2 minutes)

Find memory-hungry queries in system.processes
Check for stuck merges running >1 hour
Kill obviously problematic queries

Phase 3: Data Quality Check (5 minutes)

Look for bad partitions (1998, 2050 dates)
Check for parts explosion (>1000 parts per table)

Emergency Actions Reference

Production-tested solutions:

Problem	Detection Query	Solution
Memory OOM	`SELECT * FROM system.processes WHERE memory_usage > 8GB`	Enable `external_aggregation=1`
Disk Full	`SELECT sum(bytes_on_disk) FROM system.parts`	Delete old partitions, expand disk
Replication Lag	`SELECT * FROM system.replicas WHERE absolute_delay > 300`	Check network, restart lagging replica
Stuck Query	`SELECT * FROM system.processes WHERE elapsed > 300`	`KILL QUERY WHERE query_id = '...'`
Parts Explosion	`SELECT count() FROM system.parts WHERE active=1`	Enable async_insert, increase batch sizes

The golden rule: "Problems very rarely just pop out of nowhere there are signs... investigate before it goes from 15 milliseconds to 30 seconds"

Community War Stories & Lessons

Disk Space Issues:

"Default AWS EBS limit of 16TB hits you when you least expect it"
"Other nodes keep sending data to the full node creating cascading failure"

Memory Exhaustion:

"Out of memory typically appears when you have a big aggregation with a lot of keys"
"Enable external aggregation - query will be slower but won't crash"

Bad Data:

"Developers send data from 1998 or 2050 causing partition chaos"
"Always validate timestamps before they hit production"

The key insight: Most 2AM incidents are preventable if you recognize the warning signs and have ready-to-use diagnostic queries.

When Everything is Broken: Emergency Diagnostics​

EMERGENCY: Production Incident Queries (Copy-Paste Ready)​

Learning: Incident Pattern Recognition​

Memory Exhaustion Detection​

Bad Data Detection​

The 2AM Methodology​

Phase 1: Immediate Triage (30 seconds)​

Phase 2: Resource Investigation (2 minutes)​

Phase 3: Data Quality Check (5 minutes)​

Emergency Actions Reference​

Community War Stories & Lessons​