While we’re still on the topic of scaling, is CPU really the metric you want to use here?
You might think about scaling replicas based on utilization metrics like CPU by using the built-in K8s HorizontalPodAutoscaler CRDs.
If pod utilization reaches a threshold, create more replicas. If utilization drops, reduce replicas. It makes sense, right?
The problem with HPA?
It uses a metric that doesn’t actually make sense.
K8s capacity autoscalers need a minute or two to add capacity and schedule newly created replicas. This often causes HPA to overcompensate, creating more replicas than needed.
Next thing you know, the node autoscaler supplies the needed capacity, HPA starts reducing replicas – and your cluster is left with terrible utilization.
Bad data inputs lead to bad outputs, meaning scaling by simplistic utilization metrics causes subpar scaling results.
How do you solve this?
High-maturity teams scale workloads based on business metrics: number of user sessions, jobs unprocessed, etc.
The best tool for the job right now is KEDA which can scale based on PubSub queues and database queries (there are tons of scalers, or you can write your own).
This leads to the creation of replicas earlier, quicker scaling decisions, and without overcompensation. Users get snappy performance, and you get much lower infrastructure costs.