Skip to main content
Skip to main content

Operations: The 2AM Debugging Toolkit

This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can browse by specific problem. Suffering from high operational costs? Check out the Cost Optimization community insights guide.

When Everything is Broken: Emergency Diagnostics

Community philosophy: "If something looks odd, even just slightly, something is wrong. Investigate before it gets worse."

The nightmare scenario: "One of the worst days of my life over the past three years... it locked the database you couldn't insert anything couldn't read anything CPU went crazy memory usage went crazy"

EMERGENCY: Production Incident Queries (Copy-Paste Ready)

When your ClickHouse is down at 2AM, run these in order:

Learning: Incident Pattern Recognition

Understand the failure modes with working examples:

Memory Exhaustion Detection

Bad Data Detection

The 2AM Methodology

Follow this exact sequence when everything is broken:

Phase 1: Immediate Triage (30 seconds)

  1. Run system.errors - any non-zero = active incident
  2. Check disk space - "It took us from 12 to 4 AM... it was as simple as low disk"
  3. Look for replication lag > 5 minutes

Phase 2: Resource Investigation (2 minutes)

  1. Find memory-hungry queries in system.processes
  2. Check for stuck merges running >1 hour
  3. Kill obviously problematic queries

Phase 3: Data Quality Check (5 minutes)

  1. Look for bad partitions (1998, 2050 dates)
  2. Check for parts explosion (>1000 parts per table)

Emergency Actions Reference

Production-tested solutions:

ProblemDetection QuerySolution
Memory OOMSELECT * FROM system.processes WHERE memory_usage > 8GBEnable external_aggregation=1
Disk FullSELECT sum(bytes_on_disk) FROM system.partsDelete old partitions, expand disk
Replication LagSELECT * FROM system.replicas WHERE absolute_delay > 300Check network, restart lagging replica
Stuck QuerySELECT * FROM system.processes WHERE elapsed > 300KILL QUERY WHERE query_id = '...'
Parts ExplosionSELECT count() FROM system.parts WHERE active=1Enable async_insert, increase batch sizes

The golden rule: "Problems very rarely just pop out of nowhere there are signs... investigate before it goes from 15 milliseconds to 30 seconds"

Community War Stories & Lessons

Disk Space Issues:

  • "Default AWS EBS limit of 16TB hits you when you least expect it"
  • "Other nodes keep sending data to the full node creating cascading failure"

Memory Exhaustion:

  • "Out of memory typically appears when you have a big aggregation with a lot of keys"
  • "Enable external aggregation - query will be slower but won't crash"

Bad Data:

  • "Developers send data from 1998 or 2050 causing partition chaos"
  • "Always validate timestamps before they hit production"

The key insight: Most 2AM incidents are preventable if you recognize the warning signs and have ready-to-use diagnostic queries.