Kubernetes Autoscaling in 2026: HPA, VPA, KEDA, Cluster Autoscaler, and Karpenter

Autoscaling in Kubernetes does not operate at a single layer. HPA, VPA, KEDA, Cluster Autoscaler, and Karpenter solve different problems: some change the number of pods, others help determine appropriate requests and limits, and others add or remove cluster nodes.

More specifically:

  • HPA (Horizontal Pod Autoscaler) scales the number of replicas when the load is clearly reflected in CPU, memory, or the application's internal metrics.
  • VPA (Vertical Pod Autoscaling) helps determine the appropriate pod size: requests and limits, which affect scheduling, cost, and load calculation for HPA.
  • KEDA (Kubernetes Event-driven Autoscaling) is needed when the primary signal is outside the pod: a queue, Kafka lag, events, a schedule, or a cloud metric.
  • Cluster Autoscaler adds or removes nodes within predefined node groups when pods do not have enough capacity.
  • Karpenter selects capacity more flexibly to meet pod requirements: instance types, zones, spot/on-demand, NodePool limits, and consolidation.

The main mistake is choosing a tool for the wrong signal. HPA can increase replicas, but it will not add nodes. KEDA can detect a queue, but it will not fix incorrect requests. VPA can recommend a pod size, but it will not protect against a sudden traffic spike. Cluster Autoscaler and Karpenter can provide capacity, but they will not decide how many replicas the application needs.

Cost savings also do not appear simply because an autoscaler has been enabled. Cost is affected by requests, minReplicas, maxReplicas, node pool limits, warm-up time, scale-down policy, SLO, and whether cold starts are acceptable. The right sequence is therefore to first define the load signal, then the scaling object, then the upper and lower scaling boundaries, and only after that choose HPA, VPA, KEDA, Cluster Autoscaler, or Karpenter.

Why HPA Alone Is Usually Not Enough

A team enables HPA, sees the number of replicas increase, and considers scaling handled. A week later, the picture looks different: some pods are stuck in Pending, the Kafka queue is growing faster than CPU usage, nighttime scale-to-zero causes cold starts, and the AWS, Google Cloud, or Azure bill barely decreases.

The problem is usually not HPA itself. It simply operates at its own layer: the number of replicas. If pods have nowhere to run, you need node-level autoscaling. If load is reflected not in CPU usage but in queue length or consumer lag, you need an event-driven signal. If pods have been running for years with incorrect resource requests, autoscaling decisions will be based on distorted inputs.

That is why the key question is not “which autoscaler is better,” but “what exactly needs to be scaled”: the number of pods, the size of a pod, or the cluster nodes.

The Three Levels of Autoscaling in Kubernetes

After the TL;DR, it is important to establish the basic idea: in Kubernetes, autoscaling does not scale an abstract “load”; it scales specific objects. A tool may work correctly but still fail to solve the problem if the wrong scaling level is chosen.

For example, an API service sees an increase in traffic. HPA scales the Deployment from 5 to 20 replicas, but some of the new pods remain Pending: the current nodes do not have enough free CPU and memory when requests are taken into account. In this case, HPA has already done its job: it requested more replicas. The problem then moves to the cluster level: Cluster Autoscaler or Karpenter is needed to add compute capacity.

Before choosing a tool, it is useful to separate the areas of responsibility:

LevelWhat it changesToolsTypical signal
Application replicasNumber of podsHPA, KEDACPU, memory, external metrics, queue, events
Pod sizerequests and limitsVPAHistorical CPU/RAM usage
Cluster nodesAvailable compute capacityCluster Autoscaler, KarpenterPending pods, underutilized nodes

This table helps avoid a common mistake: trying to cover all levels with a single autoscaler. HPA and KEDA are responsible for the number of replicas, but they do not guarantee that there will be capacity to schedule them. VPA helps determine the right pod size, but it does not create more replicas when traffic spikes. Cluster Autoscaler and Karpenter add or remove nodes, but they do not decide how many instances of the application need to run.

Therefore, HPA, VPA, KEDA, Cluster Autoscaler, and Karpenter should not be viewed as direct substitutes for one another. They answer different questions: how many pods are needed, what size they should be, and whether the cluster has enough capacity to run them.

Application-level autoscaling: HPA, VPA, and KEDA

After separating the layers, you can move on to the first layer: the application. This is where teams typically choose between HPA, VPA, and KEDA, but these tools answer different questions. HPA and KEDA change the number of replicas, while VPA helps determine the appropriate pod size. If these are conflated, autoscaling will start responding to the wrong problem.

HPA: when the signal is visible from within the pod

HPA scales a Deployment, ReplicaSet, or StatefulSet by changing the number of pods. It is typically used for web/API services where load is well reflected in CPU, memory, or custom metrics, such as requests per second, latency, or the size of an internal queue.

An important detail: CPU utilization is calculated relative to requests. If a pod consumes 500 millicores with a request of 1 CPU, HPA sees that as 50%. If the same pod consumes 500 millicores with a request of 250 millicores, that is already 200%.

As a result, incorrect requests distort HPA decisions. Overestimated values lead to delayed scale-up and unnecessary reserved capacity. Underestimated values lead to aggressive replica growth, CPU throttling, unstable metrics, and noisy alerts. HPA works well only when the baseline pod size is set close to reality.

VPA: when the problem is pod sizing

VPA solves a different problem: it adjusts requests and limits. It is useful when a service has been running with outdated limits for a long time, consistently requests too little memory, or, conversely, reserves too much CPU and RAM.

However, VPA is not a replacement for HPA when traffic increases. If the number of requests grows fivefold, a single properly sized pod may still not be enough. VPA helps determine how large a pod should be, but it does not answer the question of how many replicas the application needs.

Care is required when using VPA together with HPA based on CPU or memory. VPA changes requests, while HPA calculates utilization relative to those values. If rules are not defined, one tool may change the calculation baseline while the other reacts to the changed metric.

KEDA: when the signal comes from outside

KEDA is needed when the primary signal is not inside the pod. This may be the length of a RabbitMQ or SQS queue, Kafka lag, a schedule, an event, a cloud metric, or another external source.

This scenario is common for consumers. CPU usage may still be low, but the queue is already growing, and what matters to the business is not CPU utilization but the time it takes to process messages. In this situation, CPU-based HPA reacts too late, while KEDA scales replicas based on the actual load signal.

KEDA often does not compete with HPA; instead, it manages HPA under the hood, adding signal sources that are not available in a standard CPU-based configuration. For batch scenarios, KEDA provides ScaledJob: it starts jobs based on events, but this is a separate mode, not a universal replacement for a Deployment.

At the application level, the choice can be summarized as follows:

  • Signal inside the pod — HPA;
  • Wrong pod size — VPA;
  • Signal outside the pod — KEDA;
  • Batch events — KEDA ScaledJob;
  • Pods have no nodes to run on — node-level autoscaling is required.

If HPA or KEDA has already created new pods but they remain Pending, the issue is no longer at the application level. The next step is to look at cluster capacity: whether there are enough nodes, which node groups or NodePools are available, and which component is responsible for adding capacity.

Node-Level Autoscaling: Cluster Autoscaler and Karpenter

Cluster Autoscaler and Karpenter manage the cluster’s compute capacity, not the number of application replicas. Their main signals are pods in the Pending state that could not be scheduled due to insufficient capacity, or nodes that can be removed after the load decreases.
 

Cluster Autoscaler: predictable node groups

Cluster Autoscaler, or CA, is the classic node-level autoscaler. It looks at pods that failed to be scheduled, checks whether a node can be added to an existing node group, and increases the size of that group.

When load decreases, CA looks for nodes whose pods can be safely moved elsewhere and scales down the node group. This approach works well for clusters with predefined groups, such as general-purpose, GPU, memory-optimized, and others.

CA’s main limitation is that it operates within the node groups that have already been defined. If a group is restricted to specific instance types, zones, or sizes, CA scales that exact set. This is predictable and easy to control, but it can be less flexible: the cluster may wait for a specific node type even though a cheaper or faster-to-provision alternative is available in the cloud.

Karpenter: flexible capacity provisioning

Karpenter addresses the same problem more dynamically. It provisions nodes based on the actual requirements of pods: CPU, memory, architecture, zone, instance type, and cost and availability constraints.

A key concept in Karpenter is the NodePool. It is a set of rules for creating nodes for specific workloads. For example, you can restrict instance types, zones, the Spot/On-Demand strategy, resource limits, and placement requirements.

Another important mechanism is consolidation: packing workloads more densely and removing underutilized nodes. Karpenter can determine that pods can be placed at lower cost or more densely, create a new node, move the workload, and remove the excess capacity.

However, this flexibility requires constraints. Consolidation involves voluntary evictions, so it must be coordinated with PodDisruptionBudget, disruption budgets, graceful shutdown, and availability requirements. Otherwise, savings on nodes can turn into application instability.

How to choose between CA and Karpenter

In practice, the difference can be boiled down to three rules:

  • Cluster Autoscaler is simpler for clusters with clear node groups and moderate dynamics.
  • Karpenter is more useful when fast capacity selection, different instance types, spot/on-demand, costly node pools, and active downtime reduction are important.
  • Neither tool replaces HPA, VPA, or KEDA: they provide placement for pods, but they do not decide how many pods an application should have.

Therefore, node-level autoscaling should be viewed as an extension of application autoscaling. HPA or KEDA can request more pods, and VPA can refine their size, but only Cluster Autoscaler or Karpenter can provide the capacity those pods need to run.

After that, you can move on to a practical choice based on the load signal: CPU, memory, queues, batch jobs, volatile traffic, or expensive node pools require different combinations of tools.

Decision matrix: which autoscaler to choose based on the load signal

Once the autoscaling layers are separated, it is easier to choose the right tool based on the signal. You need to identify where the bottleneck is: inside the application, in an external event stream, in the pod size, or at the node level.

The selection matrix looks like this:

Signal or scenarioPrimary toolWhat else to check
CPU load for web/APIHPArequests, minReplicas, maxReplicas, pod startup delay
Memory loadHPA based on memory or VPAMemory requests/limits, memory leaks, cache behavior
Queue, Kafka lag, consumer backlogKEDAmaxReplicaCount, cooldown, processing rate of a single replica
Event-driven load with idle periodsKEDA scale-to-zeroCold start, application warm-up, acceptable latency
Batch jobsKEDA ScaledJob or Kubernetes JobsParallelism, quotas, backoff, impact on services with SLOs
Pods in PendingCluster Autoscaler or Karpenterrequests, quotas, affinity, node group or NodePool limits
Expensive node poolsKarpenterNodePool limits, spot/on-demand, consolidation, PDB
Persistently incorrect pod sizingVPAmin/max allowed, VPA mode, compatibility with HPA

This matrix helps assign ownership of the decision. HPA and KEDA handle the number of replicas. VPA helps determine the right pod size. Cluster Autoscaler and Karpenter handle cluster capacity.

For a web/API service with CPU load, HPA is usually enough, but only if requests are set close to actual usage. If requests are too high, HPA may react too late when load increases. If they are too low, you may get aggressive scale-up, throttling, and noisy alerts.

For consumers, it is more important to look not only at CPU, but also at the queue or lag. If the queue is growing while CPU is still low, CPU-based HPA will react too late. In this case, KEDA better reflects the real problem: how many messages have accumulated and how quickly they need to be processed.

For batch jobs, it is important to limit parallelism. Otherwise, autoscaling can create many short-lived pods, quickly provision expensive nodes, and displace services with SLOs. This requires quotas, Job limits, a backoff policy, and clear boundaries for the impact on the cluster.

If pods have already been created but remain in Pending, changing HPA or KEDA thresholds will not help. The problem has moved to the node level: you need to look at Cluster Autoscaler or Karpenter, node group or NodePool limits, available zones, instance types, quotas, and placement constraints.

In practice, the choice is almost always a combination of tools. An API may use HPA based on CPU, VPA in recommendation mode to refine requests, and Karpenter to control expensive node pools. A consumer may scale with KEDA based on lag, but as the number of replicas grows, it still needs a node-level autoscaler.

Autoscaling and Cost Optimization: Where It Saves Money and Where Hidden Overspending Occurs

Autoscaling can reduce costs, but it does not guarantee savings automatically. It removes excess capacity only when the boundaries are configured correctly: requests, minReplicas, maxReplicas, node pool limits, the scale-down policy, and the acceptable warm-up time.

The challenge is that the same setting affects two things at once: cost and reliability. If you keep too many replicas and oversized nodes, the infrastructure becomes expensive. If you reduce everything too aggressively, you can end up with cold starts, higher p95/p99 latency, and pods stuck in Pending during peak load.

A simple example: an API runs across three zones. If minReplicas=6, the service maintains warm spare capacity and handles the morning peak faster. If the minimum is reduced to 2, the bill will go down, but when load begins, pods and nodes may need several minutes to start. For an internal service, this may be acceptable; for a critical API, it already creates a risk of an SLA violation.

It is better to view the main settings as trade-offs:

SettingWhat it providesWhere the risk is
minReplicasMaintains a minimum buffer or reduces costs during idle periodsA value that is too low causes cold starts and increased latency
maxReplicasLimits costs and protects downstream systemsA ceiling that is too low may not handle peak load
requestsAffect scheduling, HPA, and placement densityOverstating them leads to overspending; understating them leads to throttling and OOM
cooldown / stabilization windowSmooths replica reduction after a peakScale-down that is too fast causes flapping
node pool / NodePool limitsPrevent expensive nodes from growing uncheckedWith strict limits, pods may remain in Pending
ConsolidationRemoves underutilized nodesAggressive evictions can impact availability

requests are especially important. They affect three things at once: how HPA calculates utilization, how the scheduler places pods, and how densely the workload fits onto nodes. Overstated requests buy unused capacity. Understated requests look cost-efficient only until CPU throttling, OOM, restarts, and unexpected autoscaler decisions begin.

minReplicas and maxReplicas define the boundaries of behavior. The minimum determines how much warm capacity the team is willing to keep in advance. The maximum limits costs and protects dependencies: the database, broker, and external APIs. But if the maximum is too strict, autoscaling itself can become the cause of an SLA violation during peak load.

Warm-up time should also be treated as part of scaling. An autoscaler does not create usable capacity instantly: first a signal appears, then a pod is created, a node is brought up if needed, the image is pulled, readiness probes pass, connections are opened, and caches are warmed. If the full path takes 3–5 minutes, minReplicas=0 is not suitable for every workload.

That is why saving money in Kubernetes is not about “scaling everything to zero at night.” The right goal is to manage the buffer: keep it where it is needed for the SLA, and remove it where it merely hides incorrect requests, minimums that are too high, or expensive nodes with no real load.

The next step is to apply this logic to common scenarios: APIs, queues, batch jobs, unstable traffic, and expensive node pools.

Configuration for common scenarios

API services

For a critical API, you should not blindly reduce minReplicas. The calculation should account not only for CPU, but for the entire startup sequence: pulling the image, starting the pod, readiness probes, connecting to dependencies, and warming up the cache.

If latency increases before CPU usage does, an HPA based on CPU alone may not be enough. In that case, consider adding a requests-per-second metric, latency, or another application-level metric that better reflects the actual load on the service.

Kafka, RabbitMQ, and SQS queues

For queues, CPU is less important than the rate at which messages accumulate and are processed. You need to understand how many messages a single replica can handle and how quickly the queue needs to return to normal.

In KEDA, maxReplicaCount should not be set “with maximum headroom.” It must account for the limits of downstream systems: databases, external APIs, brokers, or the service to which the consumer sends the result. Scaling consumers too quickly may not speed up processing and may instead overload a dependency.

Batch jobs and event-driven workloads

For batch scenarios, you need to limit concurrency, the number of active jobs, and namespace quotas in advance. Otherwise, background processing can crowd out API services or create expensive short-lived capacity.

Limits on Jobs, carefully chosen backoffLimit and TTL settings, separate node pools or NodePool resources, and priorities via PriorityClass help here. The main goal is to prevent a batch workload from consuming the entire cluster simply because a large number of events arrived.

Volatile traffic

Short spikes usually require rapid scale-up and more cautious scale-down. Scale-up should react quickly enough, while scale-down is best smoothed with a stabilization window and a pause before reducing replicas.

This reduces flapping—a situation where pods are created and deleted faster than they can deliver value. With volatile traffic, overly aggressive cost savings often degrade p95/p99 more than they reduce costs.

Expensive node pools

For GPU, memory-optimized, or large on-demand instances, it is important to set limits on the NodePool or node group. Otherwise, regular workloads may accidentally land on expensive capacity, and autoscaling can quickly drive up the bill.

This calls for taints/tolerations, affinity, NodePool limits, separate rules for spot/on-demand capacity, and careful consolidation. Karpenter is useful for flexible instance selection, but it must be constrained with PDBs, disruption budgets, and availability requirements.

The same overall check applies to all scenarios: if new pods go into Pending, replica scaling has already hit the cluster’s capacity limit. At that point, you need to check Cluster Autoscaler or Karpenter, node pool limits, quotas, and available instance types instead of continuing to adjust HPA or KEDA thresholds.

Conclusion

Choosing autoscaling in Kubernetes in practice starts not with the tool, but with the signal and the level of control. If load is increasing within the application, you need HPA or request adjustments via VPA. If a queue or external event stream is growing, KEDA is the more logical choice. If pods have already been created but cannot be scheduled, the issue has moved to the node layer, where Cluster Autoscaler or Karpenter come into play.

The main risk is that an autoscaler may be enabled but controlling the wrong entity. The practical sequence is to identify the signal, determine the scaling target, define scale-up and scale-down limits, and verify startup latency, SLOs, and cost. Only then does autoscaling become a controlled capacity planning mechanism rather than a source of unpredictable expenses.

FAQ

How does HPA differ from KEDA?

HPA scales the number of replicas based on CPU, memory, or metrics available through the Kubernetes metrics API. KEDA adds external signals: queues, Kafka lag, events, schedules, and cloud metrics. In practice, KEDA often manages HPA under the hood, but provides it with load sources that are not available in a standard CPU-based configuration.

Can HPA and VPA be used together?

Yes, but with caution. VPA changes resource requests, while HPA calculates CPU and memory utilization relative to those requests. If both tools affect the same workload without clear rules, HPA behavior can become unstable. VPA is often used in recommendation mode, while HPA is left to scale replicas.

Why do pods stay in Pending even though the HPA is working?

Because the HPA increases the number of pods, but it does not add nodes. If the scheduler cannot find available CPU or memory that satisfies the configured requests, new pods remain in Pending. In this case, check Cluster Autoscaler or Karpenter, node group or NodePool limits, quotas, affinity, and available instance types.

When should you choose Karpenter instead of Cluster Autoscaler?

Karpenter is useful when you need more flexible capacity management: different instance types, expensive node pools, spot/on-demand, NodePool limits, consolidation, and rapid node selection to match pod requirements. Cluster Autoscaler is simpler and more predictable for traditional node groups with moderate scaling dynamics.

Is scale-to-zero safe?

Yes, if the workload can tolerate a cold start. For internal jobs, infrequent events, and non-critical handlers, this can be a reasonable cost-saving measure. For APIs with strict SLAs or scenarios where the time to first response matters, it is better to keep the minimum number of replicas above zero.

What has the greatest impact on autoscaling costs?

The biggest factors are requests, minReplicas, maxReplicas, node pool or NodePool limits, warm-up time, the scale-down policy, and consolidation. Simply enabling an autoscaler does not guarantee savings: if the boundaries are configured incorrectly, scaling can even increase the bill.

Sources

1. Kubernetes Documentation — Horizontal Pod Autoscaling


2. Kubernetes Autoscaler / Cluster Autoscaler FAQ


3. KEDA Documentation — ScaledObject specification


4. Karpenter Documentation — NodePools

Comment

Subscribe to our newsletter to get articles and news