The Gotchas of Self-Managing a Vanilla Kubernetes Cluster in Production

Running your own Kubernetes control plane sounds empowering. No vendor lock-in. Full control. No managed service markup. And then six months later you’re paged at 2 AM because your etcd is out of disk space and half your control plane is dead.

This is not a post to talk you out of self-managed Kubernetes — there are legitimate reasons to run it. But if you’re seriously considering it, you deserve an honest accounting of what you’re signing up for.

The Control Plane Is a Distributed System You Now Own

With a managed service like EKS or GKE, the control plane is someone else’s problem. Self-managed means the API server, controller manager, scheduler, and etcd are yours. All of them.

etcd is the sharpest edge. It’s the source of truth for your entire cluster, and it requires real operational care:

etcd is extremely sensitive to disk latency. Run it on spinning disks and you’ll see mysterious leader elections and timeouts. SSDs or bust.
The database grows over time. You need scheduled compaction and defragmentation, or it will bloat and eventually hit storage limits.
Backups are not automatic. You need to snapshot etcd on a schedule and — critically — test that you can actually restore from those snapshots. “We have backups” and “we can restore from backups” are very different things.

High availability means more complexity. A production control plane needs at least three nodes for etcd quorum. That also means a load balancer sitting in front of the API server — one you have to provision, configure, and maintain.

Upgrades are manual and order-dependent. Kubernetes has strict component skew rules. You upgrade etcd first, then the API server, then the controller manager and scheduler, then your worker nodes. Get the order wrong or let components drift too far apart in versions and you’re in unsupported territory. Kubernetes also drops old versions fast — you’ll be doing this upgrade dance roughly every 6-9 months to stay on a supported release. And every upgrade means checking compatibility matrices for your CNI, ingress controller, storage driver, and any other add-ons. They all have their own version support windows.

Networking Is Your Problem (All of It)

Kubernetes gives you a networking model. It does not give you a network.

Your CNI choice matters more than you think. Flannel is simple and works, but has no NetworkPolicy support. Calico adds NetworkPolicy but introduces its own operational surface. Cilium is powerful — eBPF-based, excellent observability, great performance — but has a steeper learning curve and requires a modern kernel. Most teams pick a CNI at setup and never revisit it, which is fine until they need something it doesn’t support.

Ingress is a blank slate. You need to pick, install, configure, and maintain an ingress controller. NGINX Ingress and Traefik are common choices. Then you need cert-manager for TLS, plus whatever DNS management you’re using. None of this ships with the cluster.

CoreDNS tuning is often skipped. Default CoreDNS settings work fine at small scale. At higher pod counts or request rates, DNS becomes a bottleneck that’s genuinely hard to diagnose. Know how to tune it before you need to.

Certificates Will Expire at the Worst Time

Kubernetes clusters have a lot of TLS certificates — the API server cert, etcd peer certs, kubelet client certs, the front-proxy cert, and more. They all have expiry dates.

The default expiry for most certificates generated by kubeadm is one year. kubeadm will automatically rotate these when you run an upgrade, so if you upgrade at least once a year, you’re mostly covered. If you don’t upgrade (maybe the cluster is “stable” and nobody touches it), those certs will expire and your cluster will stop working with very little warning.

The cluster CA itself has a 10-year default expiry. That sounds like plenty of time — until you’re running the cluster for 10 years and it isn’t.

Build cert expiry monitoring into your observability stack from day one. kube-state-metrics doesn’t track cert expiry out of the box; you’ll need something like x509-certificate-exporter or custom alerting.

Node Lifecycle Is a Real Workflow

In a managed service, node upgrades and replacements are largely automated. Self-managed means you own the node lifecycle:

OS patching: drain the node, apply patches, reboot, uncordon. At scale, this is a workflow you need to build and maintain. Do it node by node to avoid taking down too much capacity at once.
Node autoscaling is not built in. You can run Cluster Autoscaler with a cloud provider integration, but you configure and operate it yourself.
Configuration drift across nodes is sneaky. If kubelet configs, containerd settings, or kernel parameters diverge between nodes, you’ll get weird, hard-to-reproduce bugs.

You Get Zero Observability Out of the Box

A fresh Kubernetes cluster has no metrics, no logs, no dashboards, and no alerts. You’re starting from scratch.

At minimum you need:

metrics-server for kubectl top and HPA to function
kube-state-metrics and node-exporter for cluster and node metrics
Prometheus + Alertmanager for metric collection and alerting
Something for logs — Loki, EFK stack, or forwarding to a cloud service
Grafana or equivalent for dashboards

This is a project, not a one-hour task. Budget time to build it properly and test that your alerts actually fire when they should.

Security Defaults Are Underwhelming

Kubernetes ships with RBAC enabled, which is good. But several other defaults require active hardening:

Secrets are not encrypted at rest. By default, Kubernetes secrets are stored as base64-encoded data in etcd — base64 is encoding, not encryption. Anyone with access to etcd has access to your secrets in plaintext. Encryption at rest requires explicit configuration with an EncryptionConfiguration resource.

Default service account permissions are often too broad. Many workloads run with the default service account, which may have more permissions than needed. Audit RBAC regularly.

Network policies are opt-in. Without NetworkPolicy resources, all pods can talk to all other pods. This is a fully flat network by default. Write policies early, before you have dozens of services that are hard to reason about.

Pod Security Admission (which replaced the deprecated PodSecurityPolicy) is available but not enforced by default. Decide on your policy (privileged/baseline/restricted per namespace) and apply it intentionally.

Backup and Disaster Recovery Is Your Entire Responsibility

Beyond etcd snapshots, you need a story for workload-level backup and restore.

Velero is the standard tool — it backs up Kubernetes resources and, with provider plugins, persistent volume snapshots. Install it, configure it, schedule backups, and then test restoring from those backups into a clean cluster. If you haven’t done a restore drill, you don’t have a backup strategy, you have backup theater.

When Does Self-Managed Kubernetes Actually Make Sense?

Given all of the above, it’s worth being honest about when to run vanilla Kubernetes vs. a managed service:

Good reasons to self-manage:

Air-gapped or on-premises environments where managed services aren’t available
Hard data sovereignty or compliance requirements that rule out cloud providers
You want to deeply understand Kubernetes internals (legitimate, but expensive in time)
Extreme scale where control plane costs from managed services are significant

Reasons that sound good but aren’t:

“We want full control” — you can get a lot of control from managed services with less operational burden
“We don’t want to be locked in” — Kubernetes workloads are portable; the control plane isn’t the lock-in risk
“It’s cheaper” — until you factor in engineering time

If you have a small platform team or no dedicated platform engineers, a managed service will almost certainly serve you better. The toil of self-managed Kubernetes is real, recurring, and scales poorly with team size.

The Bottom Line

Self-managed Kubernetes is a legitimate choice for the right organization. But go in clear-eyed about what you’re taking on: a distributed system you need to operate, upgrade, secure, back up, and monitor. None of that is automatic, and the failure modes are painful.

If you do choose to run it, invest in automation early. Treat cluster upgrades as routine, not heroic. Build your observability before you need it. And please, for the love of etcd, test your backups.

Have you run vanilla Kubernetes in production? What gotcha hit you hardest? I’d love to hear it.