3 More Kubernetes Worst Practices That Are Causing You Pain

Running Kubernetes gets tricky on so many levels, and I explored some of them in the previous edition of this newsletter (find it here in case you missed it).

But there’s more ground to cover.

Here are 3 more mistakes many teams keep making – and how to avoid them.

1. Scaling workload replicas manually

Having a manual number of replicas in the dev environment is expected, but keeping the number of replicas static in production is wasteful. Especially if the number of replicas is going into 2, 3, or even 4 digits.

Some companies scale by schedule, scale down with a scheduled script, or a CronJob. This solution is better than nothing, but it’s quite fragile, static, and not based on feedback from the real world.

Why? Imagine this scenario:

Your marketing team carried out a campaign, and you’re hit with three times the expected traffic. Once it’s gone, you need replicas dropped fast. This approach requires constant babysitting.

Scheduling is a good solution for dev and staging environments if your engineering team is in the same time zone. You can scale dev workloads to zero replicas during nights and weekends, reducing cloud costs further.

2. Using CPU as the metric for scaling workload replicas

While we’re still on the topic of scaling, is CPU really the metric you want to use here?

You might think about scaling replicas based on utilization metrics like CPU by using the built-in K8s HorizontalPodAutoscaler CRDs.

If pod utilization reaches a threshold, create more replicas. If utilization drops, reduce replicas. It makes sense, right?

The problem with HPA?

It uses a metric that doesn’t actually make sense.

K8s capacity autoscalers need a minute or two to add capacity and schedule newly created replicas. This often causes HPA to overcompensate, creating more replicas than needed.

Next thing you know, the node autoscaler supplies the needed capacity, HPA starts reducing replicas – and your cluster is left with terrible utilization.

Bad data inputs lead to bad outputs, meaning scaling by simplistic utilization metrics causes subpar scaling results.

How do you solve this?

High-maturity teams scale workloads based on business metrics: number of user sessions, jobs unprocessed, etc.

The best tool for the job right now is KEDA which can scale based on PubSub queues and database queries (there are tons of scalers, or you can write your own).

This leads to the creation of replicas earlier, quicker scaling decisions, and without overcompensation. Users get snappy performance, and you get much lower infrastructure costs.

3. Running a self-hosted Kubernetes control plane

There was a time when if you wanted to run Kubernetes on AWS, you’d either build your self-hosted control plane with kOps or kubeadm or you had to use native, then half-cooked EKS.

Things have changed since then.

Does running the control plane on your own make sense?

Kubernetes offerings from major cloud providers are good enough for 95% of scenarios (I haven’t personally experienced the remaining 5%).

GKE, EKS, and AKS control planes cost just $72 per month and provide automatic scaling and support.

Then there's the question of time and compute cost

If you want to run your own control plane for 5000-pod clusters, your control plane costs will go well into four digits. Backups, etcd replication delays, API increased latencies, name resolution, and load balancing issues…

Why would someone force this pain on themselves willingly?

Also, cloud providers force control plane upgrades with strict cadence without the ability to opt-out. Which is an actual benefit.

We’ve seen kOps clusters running in production with tens of thousands of CPUs on a Kubernetes version that was 4 years old and no longer supported (14 minor versions behind).

Cloud-managed K8s just won’t allow a similar downward spiral.

The list of Kubernetes blunders doesn’t end here

We covered 8 of the most impactful mistakes in this and our previous email.

That's plenty to take care of, but if you’re really dedicated to this, we listed a grand total of 16(!) K8s things you should avoid, like:

Not exposing the application’s health status through Liveness and Readiness probes
Trying to use burstable instances to reduce cloud costs
Isolating workloads without strong requirements for it

You can see the extended list here.

See you in the next edition of Kubernetes IRL!

Best,

Allen

Found this email useful? Forward it to your friends and coworkers who need more Kubernetes best practices in their lives.