Table of contents
- What you will have at the end
- Prerequisites
- How Cluster Autoscaler decides to scale
- Configure on EKS
- Configure on GKE
- Configure on AKS
- Choose an expander strategy
- Tune scale-down timing
- Diagnose why a node will not scale down
- Diagnose why a Pending pod does not trigger scale-up
- Balance similar node groups across zones
- Observe and debug Cluster Autoscaler
- When to use Karpenter instead
- When to escalate
What you will have at the end
A working Cluster Autoscaler (CA) deployment that automatically adds nodes when pods cannot be scheduled and removes underutilized nodes after a configurable cooldown. You will know how to tune the key timing parameters, unblock stuck scale-down operations, and read CA's status output to understand what it is doing and why.
Prerequisites
kubectlconnected to a Kubernetes 1.28+ cluster on EKS, GKE, or AKS- Cluster admin permissions (RBAC) and cloud IAM permissions to manage node groups/ASGs/VMSS
- Familiarity with resource requests and limits. CA uses requests, not actual utilization, for every scheduling decision
- PodDisruptionBudgets configured on production workloads (CA respects PDBs during scale-down)
How Cluster Autoscaler decides to scale
Scale-up
CA polls the API server every scan-interval (default 10 seconds). When it finds pods in Pending state with reason: Unschedulable, it simulates adding a node from each available node group and picks one using the configured expander strategy.
The decision is based entirely on resource requests. A pod requesting 2 CPU and 4Gi memory on a node with 4 CPU allocatable will register as 50% utilized, even if actual CPU usage is 3%.
Scale-down
A node becomes a scale-down candidate when three conditions are true simultaneously:
- The sum of all pod requests on the node is below
scale-down-utilization-threshold(default 0.5, i.e. 50%) - Every pod on the node can be rescheduled onto other existing nodes
- The node has been in this state for at least
scale-down-unneeded-time(default 10 minutes)
On top of that, scale-down-delay-after-add (default 10 minutes) blocks all scale-down after any scale-up event. This is deliberate thrash prevention. Under default settings, the effective minimum time from "node becomes underutilized" to "node is removed" is 10-20 minutes.
Version matching
CA's minor version must match the Kubernetes minor version. A cluster running Kubernetes 1.32.x needs CA 1.32.x. Patch version does not need to match. Update CA immediately after upgrading the control plane.
Configure on EKS
On EKS you deploy the open-source Cluster Autoscaler yourself. It needs IAM permissions and ASG discovery tags.
Step 1: create an IAM policy
CA needs read access to describe ASGs and EC2 instances, and write access (scoped to tagged ASGs) to set desired capacity and terminate instances.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:DescribeTags",
"ec2:DescribeImages",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplateVersions",
"ec2:GetInstanceTypesFromInstanceRequirements",
"eks:DescribeNodegroup"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"autoscaling:ResourceTag/k8s.io/cluster-autoscaler/enabled": "true",
"autoscaling:ResourceTag/k8s.io/cluster-autoscaler/production-main": "owned"
}
}
}
]
}
Replace production-main with your cluster name. The Condition block ensures CA can only modify ASGs tagged for your cluster.
Step 2: associate IAM with the CA service account
Option A: EKS Pod Identity (preferred for new clusters)
eksctl create podidentityassociation \
--cluster production-main \
--namespace kube-system \
--service-account-name cluster-autoscaler \
--role-arn arn:aws:iam::123456789012:policy/ClusterAutoscalerRole
Option B: IRSA (older clusters)
eksctl create iamserviceaccount \
--cluster=production-main \
--namespace=kube-system \
--name=cluster-autoscaler \
--attach-policy-arn=arn:aws:iam::123456789012:policy/ClusterAutoscalerPolicy \
--approve
Step 3: tag your ASGs
Every ASG (or EKS Managed Node Group) that CA should manage needs two tags:
| Tag key | Value |
|---|---|
k8s.io/cluster-autoscaler/enabled |
true |
k8s.io/cluster-autoscaler/production-main |
owned |
eksctl adds these automatically. Terraform and CloudFormation require manual addition.
For scaling from zero (node group min=0), add template tags so CA can infer node properties without a live node:
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/os = linux
k8s.io/cluster-autoscaler/node-template/label/workload-type = batch
Step 4: deploy Cluster Autoscaler
Pin the image tag to match your cluster's Kubernetes minor version:
# For Kubernetes 1.32.x
helm upgrade --install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=production-main \
--set awsRegion=eu-west-1 \
--set image.tag=v1.32.0 \
--set extraArgs.balance-similar-node-groups=true \
--set extraArgs.skip-nodes-with-local-storage=false \
--set podAnnotations."cluster-autoscaler\.kubernetes\.io/safe-to-evict"='"false"' \
--set priorityClassName=system-cluster-critical
The safe-to-evict: "false" annotation prevents CA from evicting its own pod. The system-cluster-critical priority class ensures CA is not preempted under node pressure.
Step 5: verify
kubectl logs -n kube-system -l app.kubernetes.io/name=cluster-autoscaler --tail=20
Expected output includes lines like Cluster Autoscaler version v1.32.0 and periodic Calculating unneeded nodes. No ERROR lines about IAM or STS.
Mixed instance policies on EKS
All instance types in a MixedInstancePolicy ASG must have the same vCPU count and RAM. CA uses the first instance type listed for scheduling simulation. Larger types waste resources (CA schedules fewer pods than the node can fit). Smaller types cause scheduling failures (CA promises more capacity than the node delivers).
Separate On-Demand and Spot into distinct ASGs. Do not use a base capacity + spot overflow strategy within a single ASG.
Configure on GKE
GKE's Cluster Autoscaler is a fully managed component running in the control plane. You do not deploy the open-source CA.
Enable on a new cluster
gcloud container clusters create production-cluster \
--num-nodes=2 \
--location=europe-west4-a \
--node-locations=europe-west4-a,europe-west4-b \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=8
Enable on an existing node pool
gcloud container clusters update production-cluster \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=8 \
--node-pool=default-pool
For multi-zone node pools on GKE 1.24+, use --total-min-nodes and --total-max-nodes to control the count across all zones rather than per zone.
Choose an autoscaling profile
GKE exposes two autoscaling profiles:
| Profile | Behavior |
|---|---|
balanced (default) |
Moderate scale-down, keeps some buffer for incoming workloads |
optimize-utilization |
Aggressive scale-down for cost savings, may increase scheduling latency |
gcloud container clusters update production-cluster \
--autoscaling-profile=optimize-utilization
GKE limitations
- Cannot scale to zero on any node pool
- Maximum cluster size: 15,000 nodes
- Node auto-provisioning (NAP) is a separate feature that creates new node pools dynamically; it is distinct from the autoscaler managing existing pools
Configure on AKS
AKS manages CA internally. Configure it through az CLI, never by editing VMSS autoscaling settings in the Azure portal (they conflict).
Enable on a new cluster
az aks create \
--resource-group production-rg \
--name production-aks \
--node-count 2 \
--vm-set-type VirtualMachineScaleSets \
--load-balancer-sku standard \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 8 \
--generate-ssh-keys
Update autoscaler on an existing node pool
az aks nodepool update \
--resource-group production-rg \
--cluster-name production-aks \
--name nodepool1 \
--update-cluster-autoscaler \
--min-count 1 \
--max-count 10
Configure the autoscaler profile
AKS exposes a cluster-wide autoscaler profile that applies to all node pools:
az aks update \
--resource-group production-rg \
--name production-aks \
--cluster-autoscaler-profile \
scan-interval=30s \
scale-down-delay-after-add=5m \
scale-down-unneeded-time=5m \
scale-down-utilization-threshold=0.5 \
max-graceful-termination-sec=300 \
balance-similar-node-groups=true
The profile cannot be set per node pool.
Choose an expander strategy
When multiple node groups can satisfy a pending pod, the expander selects which group to expand:
| Expander | Behavior | Best for |
|---|---|---|
random (default) |
Picks a random eligible group | Equivalent groups, simple setups |
least-waste |
Picks the group leaving the least unused CPU/memory | General-purpose cost efficiency |
most-pods |
Picks the group that fits the most pods | Burst workloads |
priority |
User-defined ranking via ConfigMap | Spot before on-demand, preferred instance types |
Expanders can be chained: --expander=priority,least-waste applies priority rules first, then breaks ties by least waste.
Priority ConfigMap example (prefer Spot over on-demand):
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
10:
- .*spot.*
50:
- .*ondemand.*
Higher numbers are tried first. Regex patterns match node group names.
Tune scale-down timing
The most impactful flags for cost optimization:
| Flag | Default | What it controls |
|---|---|---|
--scale-down-unneeded-time |
10m | How long a node must be underutilized before removal |
--scale-down-delay-after-add |
10m | Cooldown after any scale-up event |
--scale-down-delay-after-delete |
scan-interval | Cooldown after a node deletion |
--scale-down-delay-after-failure |
3m | Cooldown after a failed scale-down attempt |
--scale-down-utilization-threshold |
0.5 | Utilization below which a node is "underutilized" |
For development clusters where cost matters more than stability:
--scale-down-unneeded-time=3m \
--scale-down-delay-after-add=2m \
--scale-down-utilization-threshold=0.4
For production clusters, keep the defaults or increase them. The 10-minute buffers exist because premature scale-down followed by immediate scale-up is more expensive (node boot time, pod migration, potential service disruption) than keeping an idle node for a few extra minutes.
Diagnose why a node will not scale down
CA evaluates nodes for removal every scan-interval. If a node stays despite low utilization, one of these is blocking it:
PodDisruptionBudgets at zero
kubectl get pdb --all-namespaces -o wide
# Look for ALLOWED DISRUPTIONS = 0
A PDB with maxUnavailable: 0 or minAvailable equal to the total replica count blocks eviction on that node. Fix by using percentage-based PDBs (minAvailable: 75%) or by increasing replica counts.
Pods with local storage
By default, pods using emptyDir or hostPath volumes block eviction because data would be lost. Set --skip-nodes-with-local-storage=false globally, or annotate specific pods:
metadata:
annotations:
# CA 1.26+ — mark specific volumes as safe to evict
cluster-autoscaler.kubernetes.io/safe-to-evict-local-volumes: "cache-volume,tmp-dir"
System pods in kube-system
--skip-nodes-with-system-pods=true (default on most providers) prevents removing nodes running kube-system pods other than DaemonSets. Add-ons like metrics-server or coredns can trap nodes. Either set the flag to false or annotate individual system pods:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
The safe-to-evict annotation
Any pod annotated with cluster-autoscaler.kubernetes.io/safe-to-evict: "false" blocks scale-down of its node. This is appropriate for the CA pod itself and for long-running batch jobs with expensive restart costs.
Node affinity constraints
If pods on a node have requiredDuringSchedulingIgnoredDuringExecution affinity rules that no other node satisfies, the node cannot be drained. Prefer preferredDuringSchedulingIgnoredDuringExecution where possible.
Timing delays
Both scale-down-delay-after-add and scale-down-unneeded-time must have elapsed. After a scale-up event, CA waits the full delay-after-add period before evaluating any node for removal, regardless of how underutilized it is.
Diagnose why a Pending pod does not trigger scale-up
A pod stuck in Pending does not guarantee CA will act. CA only helps if adding a node from an existing node group would allow the pod to schedule. Check these causes in order:
- No node group matches the pod's constraints. If the pod has a
nodeSelector,nodeAffinity, or taint toleration that no node group provides, CA cannot help. Check CA events:
kubectl get events --field-selector source=cluster-autoscaler,reason=NotTriggerScaleUp
-
Node group maximum reached. The ASG or VMSS
maxcount is a hard ceiling. CA logs this asmax node group size reached. -
Scale-from-zero without template tags. A node group at zero nodes needs ASG tags describing its labels, taints, and resources. Without them, CA cannot evaluate whether scaling the group would help.
-
Too many unready nodes. If more than
max-total-unready-percentage(default 45%) of nodes are NotReady, CA halts all operations. -
new-pod-scale-up-delayis set. Pods younger than this value are ignored. Useful for bursty workloads where the scheduler handles spikes before CA needs to intervene.
Balance similar node groups across zones
--balance-similar-node-groups=true makes CA distribute nodes evenly across node groups that have matching instance types, labels, and taints. This is the recommended setting for multi-AZ clusters.
Without it, CA picks a single group based on the expander strategy, which can concentrate all new nodes in one availability zone. During a zone failure, that concentration becomes a single point of failure.
On AKS, set it in the autoscaler profile:
az aks update --resource-group production-rg --name production-aks \
--cluster-autoscaler-profile balance-similar-node-groups=true
On GKE, this behavior is managed internally by the autoscaler.
Observe and debug Cluster Autoscaler
Status ConfigMap
kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml
Reports the last scale-up and scale-down times, node group states, and overall health.
Logs
kubectl logs -n kube-system deployment/cluster-autoscaler --tail=50
Key log patterns:
| Pattern | Meaning |
|---|---|
Scale-up: no suitable node group found |
Pod constraints unsatisfiable by any group |
pod has local storage |
Local storage blocking scale-down |
pod has PDB |
PDB blocking scale-down |
Node X was unneeded for X min |
Countdown to scale-down |
Scale-down: removing node X |
Active node removal |
Events
# Scale-up decisions
kubectl get events --field-selector source=cluster-autoscaler,reason=ScaleUp
# Scale-up not triggered (constraint mismatches)
kubectl get events --field-selector source=cluster-autoscaler,reason=NotTriggerScaleUp
# Warnings
kubectl get events --field-selector source=cluster-autoscaler,type=Warning
EKS: verify API access
kubectl exec -n kube-system -it deployment/cluster-autoscaler -- \
aws autoscaling describe-auto-scaling-groups --region eu-west-1 --output text --query 'length(AutoScalingGroups)'
If this fails, IRSA or Pod Identity is misconfigured.
AKS: enable control plane logs
Enable the cluster-autoscaler category under AKS control plane resource logs and query them in Log Analytics.
When to use Karpenter instead
Karpenter is a fundamentally different approach to node autoscaling. Where CA works with predefined node groups and cloud-provider scaling APIs, Karpenter calls the EC2 Fleet API directly, picking the best instance type per pending pod batch.
| Dimension | Cluster Autoscaler | Karpenter |
|---|---|---|
| Provisioning speed | 3-5 min (ASG spin-up) | 45-60 s (EC2 Fleet direct) |
| Instance selection | Predefined per node group | Any instance matching NodePool requirements |
| Consolidation | Removes idle nodes only | Empty, multi-node, and single-node consolidation |
| Cloud support | 24+ cloud providers | EKS only (as of early 2026) |
Choose CA when: you run GKE or AKS (where Karpenter is not available), you need strict AMI control per node group, you operate in regulated environments with predefined instance pools, or you run a multi-cloud setup.
Choose Karpenter when: you run EKS and want faster provisioning, better bin-packing, automated consolidation, and Spot diversification without managing ASGs. AWS recommends Karpenter for new EKS clusters as of 2025.
Both can run simultaneously during migration. Deploy Karpenter alongside CA, then gradually scale CA down as workloads shift to Karpenter-provisioned nodes.
When to escalate
Collect this information before asking for help:
- CA logs:
kubectl logs -n kube-system deployment/cluster-autoscaler --tail=100 - Status ConfigMap:
kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml - PDB status:
kubectl get pdb --all-namespaces -o wide - Pending pods and their events:
kubectl get pods --field-selector=status.phase=Pending -Aandkubectl describe pod <name> - CA events:
kubectl get events --field-selector source=cluster-autoscaler - Node group configuration (ASG tags, VMSS settings, GKE node pool config)
- CA version and Kubernetes version (
kubectl version --short) - Cloud provider and region