Peerobyte / Community / Blog / Optimizing Kubernetes Costs: Requests/Limits, Quotas, Spot Nodes, and Rightsizing

Optimizing Kubernetes Costs: Requests/Limits, Quotas, Spot Nodes, and Rightsizing

Last updated: Jun 16, 2026 18 minutes reading time

In Kubernetes, even an “underutilized” cluster can be expensive. kubectl top and Grafana show actual CPU and memory consumption, but the scheduler places Pods based not on average usage, but on declared requests. If teams allocate resources “with headroom,” Kubernetes treats that capacity as occupied, even if the application does not use it most of the time.

The cost chain looks like this:

container requests

→ pod requests

→ scheduler

→ node allocatable

→ number of nodes

→ cloud bill

Therefore, reducing limits by itself usually does not lower the bill. Limits cap a container’s consumption after it starts, but they do not make the cluster keep fewer nodes. For cost savings, it is more important to configure requests, Pod packing density, node pools, and autoscaling correctly.

The main levers work together:

Rightsizing — brings over-provisioned requests closer to the actual load profile, while leaving headroom for peaks and SLA compliance;
ResourceQuota — limits the aggregate requests in a namespace and helps align technical limits with team budgets;
LimitRange — sets default values and limits for Pods without proper requests and limits;
Autoscaling and node pools — turn freed-up capacity into a smaller number of billable nodes;
Spot nodes — reduce the cost of part of the capacity, but are suitable only for workloads that can tolerate interruptions.

The correct sequence is as follows: first, understand how much CPU and memory Kubernetes already considers reserved; then identify overprovisioned requests and check peaks, throttling, OOMKilled events, evictions, and SLAs. After that, you can introduce quotas, review node pools, configure autoscaling, and move suitable workloads to Spot.

The main criterion for successful optimization is not an attractive graph showing low utilization, but a reduction in paid-for but unused capacity without degrading the service.

How Kubernetes requests translate into cluster cost

After the TL;DR, the key question is: why can a cluster be expensive even when CPU and memory appear to be available in the charts? The reason is that Kubernetes makes Pod placement decisions based not on average actual usage, but on declared requests.

The scheduler looks at requests, not average usage

For the scheduler, what matters is how much CPU and memory a Pod requested up front. If a container is configured with 500m CPU and 1Gi of memory, Kubernetes assumes that exactly that amount must be reserved for the container, even if the application uses less most of the time.

A Pod’s total request is the sum of the requests from all containers in it. This means you need to account not only for the main application, but also for sidecar containers: a service mesh proxy, logging agents, a security agent, and metrics containers. A lightweight service can become noticeably heavier for the scheduler because of its infrastructure overhead.

The scheduler then looks for a node with enough available capacity. But available capacity is not the full size of the virtual machine. A node has node allocatable: some CPU and memory has already been subtracted for the kubelet, system reserves, the container runtime, and DaemonSets. So a node with 8 vCPUs does not mean that all 8 vCPUs can be allocated to user Pods.

Where the gap between charts and the bill appears

The cost chain looks like this:

container requests

→ pod requests

→ scheduler

→ node allocatable

→ number of nodes

→ cloud bill

This is exactly where the gap appears between “the cluster is barely utilized” and “the infrastructure is expensive.” A process may actually consume little CPU and memory, but to the scheduler, those resources are already taken by requests. It cannot place a new Pod on CPU or RAM that is formally reserved by another Pod.

For example, a Pod requests 2 vCPU and 8 GiB RAM but typically consumes 400m CPU and 2 GiB RAM. On a chart, this looks uneventful. But for scheduling purposes, that Pod occupies exactly 2 vCPU and 8 GiB. If there are many such Pods, the cluster runs out of capacity based on requests before it runs out based on actual consumption.

Memory often becomes the limiting factor. CPU may appear to be available, but a new Pod still cannot be scheduled if there are no suitable nodes left in terms of requested memory.

Why the autoscaler adds nodes

If Pods do not fit on the current nodes, Cluster Autoscaler or Karpenter can add new ones. In the cloud, this is a direct cost: you pay for nodes or VMs, not for the average resource consumption of the processes inside Pods.

As a result, costs increase not only because of the millicores and gigabytes of memory that are actually used. They also increase because of the capacity the cluster must maintain for the declared requests.

For cost calculations, this is easy to understand through a simple rule: if a request is higher than actual usage, the unused portion becomes idle cost — paid-for capacity that sits unused.

This also defines the right starting point for optimization: first, identify Pods, containers, and sidecars with overprovisioned requests. Only then should you move on to limits, quotas, node pools, and spot nodes. Otherwise, the team will be fighting visible utilization while the billable problem is elsewhere — in the capacity Kubernetes already considers allocated.

Requests and limits: what affects scheduling and what constrains consumption

After the requests → scheduler → nodes → bill chain, a practical question arises: which parameters in the manifest actually affect cost, and which are needed to control container behavior after startup.

At first glance, it may seem that lowering limits should also reduce the bill. But the cloud provider bills for nodes, while Kubernetes schedules Pods based on requests. As a result, savings do not come from the mere fact of lowering limits, but from changing node requirements: their number, size, or type.

requests are the resource requirements that Kubernetes treats as necessary to start a Pod. If a container requests 500m CPU, that is half a vCPU; 1 CPU is one CPU in Kubernetes terms. Memory is usually specified in Mi and Gi, for example 512Mi or 2Gi. These are the values used for scheduling.

limits serve a different purpose. They define the upper bound on a container’s resource consumption after it starts. They are not a direct cost-saving tool, but a protective mechanism: they constrain a process if it begins to consume too many resources and interferes with neighboring workloads on the node.

The difference can be summarized as follows:

Parameter	What it means	Affects scheduling	Main risk
CPU request	Declared CPU requirement	Yes	Overstating it leaves capacity idle; understating it leads to unstable performance
Memory request	Declared memory requirement	Yes	Overstating it causes expensive overprovisioning; understating it can lead to eviction under memory pressure
CPU limit	CPU upper bound	Not directly	throttling, higher latency, reduced throughput
Memory limit	Memory upper bound	Not directly	OOMKilled, restarts, instability

Requests determine how much capacity Kubernetes considers occupied, while limits define protective consumption boundaries. To reduce costs, you need to adjust reservations and drive changes in node requirements. To maintain stability, do not set limits aggressively without analyzing peaks, SLAs, and throttling or OOMKilled events.

CPU limits for API services and background workers require particular care. A team may lower the CPU limit expecting to reduce costs, but the nodes will remain the same. Instead, the application will hit the limit more often: throttling will increase, latency will rise, throughput will decrease, and the financial impact may be zero.

With memory, the risk is even more severe. Exceeding the Memory limit usually results in OOMKilled and a container restart. As a result, memory often becomes not only a scheduling constraint but also a source of instability during spikes. CPU on the nodes may appear available, but the Pod will not be scheduled if there is not enough requested memory. And a memory limit that is too low can turn a normal load spike into a restart.

Correctly separating these parameters gives you two different control levers. Through requests, the team manages cost and scheduling density. Through limits, it manages isolation and runtime risks.

Practical example: rightsizing a namespace before and after

After separating requests and limits, it is useful to look at how rightsizing affects cost in practice. Important: optimization does not necessarily reduce the application’s actual resource consumption. It reduces the capacity that Kubernetes treats as occupied for scheduling.

Consider a hypothetical namespace. Based on the graphs, it looks quiet: about 12 vCPU and 60 GiB of RAM in actual usage. However, the Pods in the namespace request a total of 40 vCPU and 160 GiB of RAM. For the scheduler, this is a heavy load, even if the processes actually use less.

After revising the requests, the picture might look like this:

Metric	Before rightsizing	After rightsizing	What changes
Namespace CPU request	40 vCPU	18–22 vCPU	Less CPU is treated as occupied
Namespace memory request	160 GiB	80–90 GiB	Memory is freed up for scheduling Pods
Actual CPU usage	about 12 vCPU	no significant change	The application has not started consuming less CPU
Actual memory usage	about 60 GiB	no significant change	The workload profile stayed the same
Headroom for peaks	excessive	closer to the actual profile	Headroom remains, but it no longer ties up excess capacity
Potential impact on nodes	more nodes are kept allocated because of requests	some nodes can be freed up	Savings appear at the level of paid VMs

The conclusion from the table is straightforward: rightsizing does not “speed up” the application or make it consume fewer resources. It removes excess reservation that causes the cluster to keep more paid capacity than it needs.

Now add a hypothetical calculation. Suppose the namespace is placed in a separate node pool or occupies a sufficiently large share of a shared pool. One node has 8 vCPU and 32 GiB of RAM, but after system reservations, DaemonSets, and system processes, about 7.2 vCPU and 28 GiB of RAM are available for user Pods.

Before rightsizing, the namespace requests 40 vCPU and 160 GiB of RAM. This requires about 6 nodes, both for CPU and for memory. After rightsizing, we take the upper bound of the new requests: 22 vCPU and 90 GiB of RAM. That amount already fits into about 4 nodes.

This gives a modeled effect: the pool can shrink from 6 to 4 nodes. The hypothetical savings are 2 nodes, or about 33% of the cost of this pool.

This is not a guaranteed savings amount. In a real environment, the result depends on Pod fragmentation, the workload mix, availability zone requirements, anti-affinity, the minimum node pool size, and autoscaler settings. But the model shows the main point: savings appear not when a number in the manifest is reduced, but when paid nodes can then be removed, or at least the next scale-out event can be postponed.

It is important not to use average consumption as the new request. In the example, 12 vCPU and 60 GiB of RAM are a lower bound for analysis, not a safe new request. You need headroom for peaks, deployments, background tasks, and daily fluctuations. Therefore, CPU requests are reduced not to 12 vCPU, but to the 18–22 vCPU range; memory is reduced not to 60 GiB, but to 80–90 GiB.

CPU and memory should be checked separately. You may free up a lot of CPU but still be unable to remove a node if memory remains fragmented. Or the opposite may happen: reducing memory requests can sharply improve scheduling, even though CPU already looked available in the graphs.

If nodes are not removed after rightsizing, this is not always a failure. The cluster may simply take longer to reach the scaling threshold. The next step is to examine Pod distribution, node sizes, resource fragmentation, and autoscaler settings.

Quotas and overprovisioned namespaces: how to curb teams’ appetite

There are several namespaces, and each one keeps extra capacity “just in case.” As a result, teams compete not for actual consumption, but for reserved capacity, and the cluster grows even without a noticeable increase in load.

ResourceQuota helps set an upper bound on the total CPU/RAM requests and limits allowed in a namespace. But a quota is not an automatic optimization mechanism. It curbs appetite, but it does not verify whether specific requests are justified.

For example, a team is given a namespace with a quota of 80 vCPU. In practice, its services use about 20 vCPU, but their combined requests are already set to 60 vCPU. Formally, the team is still within the limit. But for the scheduler, that capacity is already reserved, other Pods are scheduled less efficiently, the autoscaler adds nodes, and costs rise because of inflated reservations within the “approved” budget.

For quotas to work as a management tool, you need not only numbers, but also rules:

Caps on requests and limits at the namespace level;
Regular quota reviews based on consumption data;
Cost visibility for the team: how much CPU/RAM it reserves;
A clear procedure for requesting an increase if the workload has actually grown;

A link between quotas and rightsizing, not just an administrative restriction.

Without a process, quotas quickly become either a formality or a source of blocked releases. A ceiling that is too loose establishes overprovisioning as the norm. One that is too strict can stop a deployment at the wrong moment.

LimitRange can also be used: it sets default values and acceptable bounds for Pods without properly configured requests and limits. But LimitRange also does not replace workload profile analysis. It helps ensure resource settings are not left unspecified or uncontrolled, but it does not determine how much a specific service actually needs.

The takeaway is simple: quotas define budget boundaries for teams, while rightsizing verifies that resources within those boundaries are not reserved unnecessarily. After configuring limits, you therefore need to move on to a safe process for reviewing requests and limits, so that cost savings do not turn into SLA violations.

How to rightsize without degrading the service

ResourceQuota sets a ceiling for a namespace, but it does not determine which requests and limits are safe for a specific service. That is the purpose of rightsizing: not a one-time “cutting of headroom,” but a regular data-driven review that accounts for peaks, SLAs, and the actual ability to free up nodes.

The main mistake is looking only at average usage. An API service may use little CPU for most of the day, but during a peak, deployment, or batch processing window it can become resource-constrained and cause latency to increase. For CPU, this mistake often shows up as throttling. For memory, the consequences are more severe: values that are too low can lead to OOMKilled events, evictions, and restarts.

Safe rightsizing is best performed as a procedure:

Collect historical usage data for the service’s full operating cycle;
Look not only at averages, but also at p95/p99, peaks, seasonality, cron jobs, and batch processing windows;
Analyze CPU and memory separately;
Check throttling, OOMKilled events, evictions, latency, error rate, and SLO violations;
Account for the workload type: API, queue consumer, cron job, batch/ML, or stateful service;
Change requests and limits gradually, through releases and observation;
After the changes, check whether nodes have been freed up or whether scaling has at least been deferred.

Autoscaling tools help, but they do not replace this process. VPA can provide recommendations for requests, but automatic mode should be enabled carefully in production: it does not know all business peaks and SLA requirements. HPA changes the number of replicas, but it does not fix oversized requests. Cluster Autoscaler or Karpenter will remove nodes only when enough schedulable capacity has been freed to remove an entire node.

Correct rightsizing is therefore not simply “we reduced requests and it became cheaper.” Two outcomes need to be confirmed: the service has not lost stability, and the cluster can actually place Pods more efficiently. If nodes are not removed immediately, the effect can still be useful: the cluster will take longer to reach the scaling threshold.

Once requests, quotas, and rightsizing are in order, you can move on to the next cost-saving lever: spot nodes. But they should be considered only after basic optimization: a cheaper node will not fix an oversized Pod request.

Spot Nodes: Where They Reduce Costs and Where They Create Risk

Spot nodes are lower-cost but interruptible capacity from a cloud provider. The provider can reclaim such a node when capacity is constrained or conditions change. In Kubernetes, this usually means Pod eviction, node draining or termination, and restarting the workload on other capacity.

For this reason, the decision to move workloads to spot should not be based on price alone. It is important to understand whether the application can tolerate the sudden loss of a node.

Before migrating, you should check:

Whether the service has multiple replicas;
Whether the replicas are distributed across different nodes or zones;
Whether the application can terminate cleanly with graceful shutdown;
Whether the task can be safely rerun after an interruption;
Whether operations are idempotent, especially in queues and batch processing;
Whether a PodDisruptionBudget, anti-affinity, or topology spread constraints are configured where needed;
How an interruption will affect SLA compliance, latency, and error rate.

In practice, spot is best treated as a separate node pool with clear rules for workload placement. This is done using taints/tolerations, node affinity, and separate node pools. This gives the team explicit control over which Pods can run on interruptible capacity and which must remain on regular or on-demand nodes.

A brief classification looks like this:

Category	Examples	Decision
Usually suitable	CI runners, temporary environments, batch jobs, cron jobs, ML tasks, some queue consumers	Can be moved to spot if the tasks support reruns and graceful termination
Suitable with conditions	Stateless services with multiple replicas, non-critical background workers	Requires replicas on different nodes/zones, readiness/liveness probes, graceful shutdown, and SLA metrics monitoring
Better not to move	Databases, critical stateful services, single-replica applications, latency-sensitive APIs, cluster system components	An interruption can lead to downtime, loss of availability, higher p95/p99, or recovery issues

Spot nodes are suitable for workloads that can tolerate the loss of a replica or a task restart. If a service cannot withstand sudden eviction, the savings can quickly turn into an incident.

It is best to introduce a spot pool gradually. A good starting point is CI, batch jobs, and some queue consumers. After the migration, you need to monitor not only cost savings, but also eviction frequency, recovery time, queue growth, task restarts, and the impact on SLA compliance.

If moving to spot results in unstable processing, increased latency, or manual intervention, those savings may cost more than keeping the workload on predictable capacity.

Conclusion

Cost optimization in Kubernetes does not start with cheaper nodes, but with understanding how much capacity the cluster already considers reserved. requests affect Pod placement and the number of nodes required, so rightsizing is useful when it improves scheduling density, frees up VMs, or at least delays the next scaling event.

The practical sequence is as follows: identify over-reserved CPU and memory, recalculate requests and limits with the SLA in mind, set boundaries through quotas, review node pools and autoscaling, and then move suitable workloads to spot instances. Successful optimization is not simply low utilization on a chart; it means less paid-for idle capacity without an increase in throttling, OOMKilled events, evictions, or SLO violations.

FAQ

Why is the cluster expensive if CPU and memory are barely utilized?

Because Kubernetes schedules Pods based on requests, not average actual usage. If requests are set too high, the scheduler considers the nodes occupied even when the processes are actually using very few resources.

Which matters more for cost savings: requests or limits?

For costs, requests usually matter more because they affect Pod placement and the number of nodes required. Limits restrict a container’s resource consumption after it starts, but they do not reduce the bill by themselves if the node mix does not change.

How can you tell if requests are set too high?

Compare historical usage against requested resources: CPU and memory separately, looking not only at averages but also at p95/p99, spikes, cron jobs, and deployment windows. After making changes, it is important to monitor throttling, OOMKilled events, evictions, latency, and SLA violations.

Can ResourceQuotas help reduce costs?

Yes, but indirectly. Quotas cap the total requests and limits in a namespace and help align resource consumption with a team’s budget. However, they do not automatically optimize services: if requests are overprovisioned within the quota, overspending will persist.

What workloads can be moved to spot nodes?

Stateless services with multiple replicas, background workers, batch jobs, CI runners, queue consumers, and other workloads that can tolerate interruption and restart are usually suitable.

Which workloads are best left on regular or on-demand nodes?

Databases, critical stateful services, single-replica applications, latency-sensitive services, system components, and workloads with strict SLAs should be run on predictable capacity. They should be moved to spot only after a separate risk assessment and failure testing.

Sources

1. Kubernetes Documentation — Resource Management for Pods and Containers

2. Kubernetes Documentation — Resource Quotas

3. OpenCost Specification

4. AWS EKS Best Practices — Cost Optimization

Comment

Similar texts

See more posts

09 Jun 2026