The most glamorous parts of any distributed system are obvious: the database holding your user records, the application servers processing requests, the CDN pushing bytes to browsers around the world. Engineers obsess over their latency, tune their throughput, and page each other at 3 a.m. when they fall over. Nobody talks much about the small, quiet process sitting in the corner, consuming almost no CPU, generating almost no traffic, doing almost nothing.

That process is often the most important thing running.

The Coordinator Problem

Distributed systems have a coordination problem that doesn’t get enough attention in architecture discussions. When you have multiple nodes doing work, something has to decide which node is authoritative, what happens when a node disappears, and whether a given piece of state is safe to read or write. The natural instinct is to spread this coordination logic across the nodes themselves, letting them gossip and vote and reach consensus organically.

This works, up to a point. But the complexity scales badly, and the failure modes get exotic fast. What you actually want, in many cases, is a dedicated coordination service: something that does nothing except track state and arbitrate between nodes. Apache ZooKeeper was built precisely for this purpose. So was etcd, which Kubernetes uses as its brain. Neither does anything resembling application work. They store small amounts of data, respond to watches, and broker locks. On a busy cluster, etcd might be handling a few thousand operations per second while your application servers are fielding hundreds of thousands. Its resource footprint is laughably small by comparison.

But when etcd goes down, Kubernetes stops being able to schedule pods, reschedule failed workloads, or update configuration. The cluster doesn’t crash immediately. It keeps running whatever it was already running. Then, gradually, it starts getting confused. Nodes that fail don’t get replaced. Deployments stall. Alerts that should fire don’t fire. The system doesn’t fail loudly; it fails quietly, and by the time anyone notices, the situation is worse than it needed to be.

Minimalist heartbeat monitor illustration suggesting quiet but essential activity
The services generating the least traffic often carry the most risk when they fail.

The Watcher That Never Sleeps

There’s a related pattern in observability infrastructure. Prometheus, the monitoring system that became a de facto standard after Kubernetes adopted it, runs a scrape loop. It reaches out to instrumented services at regular intervals, collects metrics, and stores them locally. It also evaluates alerting rules and fires alerts to Alertmanager when conditions are met.

Under normal operating conditions, this loop is almost invisible. It consumes modest CPU, a predictable amount of disk, and generates internal network traffic that nobody would notice. Engineers rarely think about it. The dashboards it powers are the thing that gets attention, not the process producing the data underneath them.

Until it stops. When Prometheus goes down, your metrics keep being generated by your applications. They just aren’t being collected. Your dashboards go blank or show stale data. Your alerts don’t fire. You are now flying blind, and whatever incident caused Prometheus to fall over is now happening in the dark. The very system that should tell you something is wrong is itself the broken thing.

This is why serious infrastructure teams treat their monitoring infrastructure with the same redundancy and operational care as their primary application infrastructure. Not because Prometheus is fragile (it isn’t, particularly), but because the failure mode of a silent watcher is categorically worse than the failure mode of a loud application. A broken API server returns 500 errors. A broken alerting system returns nothing at all.

Small Processes, Catastrophic Blast Radius

The pattern extends beyond coordination and monitoring. Consider certificate management. A process like cert-manager, running in a Kubernetes cluster, watches for certificates approaching expiration and renews them automatically. It is not in any request path. It handles no user traffic. For weeks at a time it may do essentially nothing visible. Then a certificate approaches its expiration date, cert-manager renews it silently, and life continues. The only evidence it worked is the absence of an outage.

When cert-manager fails, or when it’s misconfigured, or when the cluster it runs in has a permissions problem that prevents it from completing renewals, nothing immediately breaks. The existing certificate is still valid. The countdown continues. Then the certificate expires, TLS handshakes start failing, and users see connection errors. The gap between cause and effect, sometimes measured in weeks, makes these failures genuinely hard to diagnose. Engineers find themselves debugging TLS errors without immediately thinking to check whether their automated renewal process has been silently broken for a month.

The blast radius of these quiet processes is disproportionate to their resource consumption, and that disproportionality is exactly what makes them dangerous. Engineers allocate attention roughly in proportion to visibility. A process that handles millions of requests per day gets watched closely. A process that renews certificates once every 60 days gets forgotten between renewals. The fastest code is often the code that never runs, but code that rarely runs also rarely gets tested in production conditions.

The Operational Lesson

There’s a practical principle buried in all of this. The services that sit outside your main request path deserve operational investment that isn’t proportional to their resource consumption. They deserve it proportional to what breaks when they fail.

This means a few concrete things. Quiet coordination and monitoring services need their own monitoring. This sounds circular, but it isn’t: you can monitor your monitoring system from a separate, independent system. Many teams use an external uptime check that simply verifies Prometheus is responding and that alerts are flowing through to their destination. The check itself is trivial. The protection it provides is not.

It also means these services deserve space in your runbooks and incident playbooks. When a system behaves strangely during an incident, one of the first questions should be whether any coordination or monitoring services are degraded. Too often they’re checked last, after engineers have already spent time debugging symptoms whose root cause was a broken watcher.

Finally, it means being suspicious of services that are easy to ignore. The cron job that runs once a day, the background worker that compacts your database, the health-check sidecar that decides whether a node is fit to receive traffic: these are not afterthoughts. They are load-bearing parts of your infrastructure that happen to be quiet about it. The server doing nothing is holding everything up.