Table of contents
- What you will have at the end
- Prerequisites
- Why Karpenter instead of Cluster Autoscaler
- Install Karpenter on EKS
- Create a NodePool and EC2NodeClass
- Spot and on-demand with weighted NodePools
- Disruption, consolidation, and drift
- Migrate from Cluster Autoscaler
- Monitor Karpenter with Prometheus and Grafana
- Production hardening checklist
- Common gotchas
- When to escalate
What you will have at the end
A running Karpenter installation on EKS that provisions nodes based on actual pod requirements, consolidates underutilized capacity automatically, handles Spot interruptions, and exposes metrics to your Prometheus monitoring stack.
Prerequisites
- An EKS cluster running Kubernetes 1.28+
kubectl,helm, andawsCLI installed locally- IAM permissions to create roles, policies, and EC2 tags
- Subnets and security groups tagged for Karpenter discovery (covered in the install steps)
- Prometheus and Grafana installed if you want the monitoring section to work immediately
Why Karpenter instead of Cluster Autoscaler
Cluster Autoscaler (CA) scans for pending pods on a timer, then asks an Auto Scaling Group to add a node from a predefined set of instance types. That round-trip takes 3–5 minutes on a good day.
Karpenter skips the ASG entirely. It watches for unschedulable pods event-by-event, computes which instance type fits the batch best from up to 60 candidates, and calls the EC2 Fleet API directly. The result: nodes ready in 45–60 seconds.
| Dimension | Cluster Autoscaler | Karpenter |
|---|---|---|
| Trigger | Periodic scan (10+ s) | Event per pending pod |
| Node selection | Predefined node groups | All instance types matching NodePool |
| Provisioning time | 3–5 min (ASG delay) | 45–60 s (EC2 Fleet direct) |
| Consolidation | Removes idle nodes only | Empty, multi-node, and single-node consolidation |
| Instance flexibility | Limited to node group types | Any instance satisfying requirements |
Teams switching from CA to Karpenter commonly report 20–40% cost reduction from better bin-packing and automated consolidation alone.
One clarification that trips people up: Karpenter replaces Cluster Autoscaler, not HPA or VPA. HPA scales pods, VPA right-sizes resource requests, Karpenter provisions the nodes those pods land on. They are complementary layers, not alternatives.
Install Karpenter on EKS
Step 1: set environment variables
export KARPENTER_NAMESPACE="kube-system"
export KARPENTER_VERSION="1.11.1" # latest stable as of April 2026
export K8S_VERSION="1.31" # match your EKS cluster version
export CLUSTER_NAME="production-main"
export AWS_DEFAULT_REGION="eu-west-1"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
Step 2: create IAM roles
Karpenter needs two IAM roles:
- KarpenterNodeRole for the EC2 instances it launches. Attach
AmazonEKSWorkerNodePolicy,AmazonEKS_CNI_Policy,AmazonEC2ContainerRegistryReadOnly, andAmazonSSMManagedInstanceCore. - KarpenterControllerRole for the controller pod itself, using IRSA. This role needs scoped EC2 permissions:
ec2:RunInstances,ec2:CreateFleet,ec2:TerminateInstances,ec2:DescribeInstances,ec2:DescribeSubnets,ec2:DescribeSecurityGroups,ec2:DescribePlacementGroups(required since v1.11),ec2:CreateTags,ec2:DeleteTags,iam:PassRole,iam:ListInstanceProfiles,ssm:GetParameter,sqs:ReceiveMessage,sqs:DeleteMessage, among others.
Security note: any principal that can create or delete the tags karpenter.sh/managed-by, karpenter.sh/nodepool, and kubernetes.io/cluster/${CLUSTER_NAME} can indirectly influence what Karpenter provisions. Restrict tag CRUD in your IAM policies.
Step 3: tag subnets and security groups
# Karpenter discovers subnets and security groups by tag
aws ec2 create-tags \
--resources subnet-0abc1234 subnet-0def5678 sg-0aabb1122 \
--tags Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}
Karpenter picks the subnet with the most available IPs per availability zone.
Step 4: install with Helm
helm registry logout public.ecr.aws # clear stale tokens
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "${KARPENTER_VERSION}" \
--namespace "${KARPENTER_NAMESPACE}" \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--wait
Step 5: verify the controller is running
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=20
Expected output includes lines like controller started and watching for pending pods. No ERROR lines about IAM or STS.
Create a NodePool and EC2NodeClass
A NodePool defines what kind of nodes Karpenter may provision (instance families, capacity types, architectures, limits). An EC2NodeClass defines how to provision them on AWS (AMI, subnets, security groups, IAM, disk).
# ec2nodeclass.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
role: "KarpenterNodeRole-production-main"
amiSelectorTerms:
- alias: "al2023@v20250301" # pin in production; never use @latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "production-main"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "production-main"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
encrypted: true
metadataOptions:
httpEndpoint: enabled
httpTokens: required # IMDSv2 only
httpPutResponseHopLimit: 1
# nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
team: platform
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["m", "c", "r"] # broad families for bin-packing flexibility
expireAfter: 720h # 30 days; forces node refresh
terminationGracePeriod: 48h # hard deadline on draining
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
budgets:
- nodes: "10%"
limits:
cpu: "1000"
memory: "1000Gi"
weight: 50
Apply both:
kubectl apply -f ec2nodeclass.yaml -f nodepool.yaml
Verify Karpenter sees the NodePool:
kubectl get nodepools
Expected output:
NAME NODECLASS WEIGHT AGE
default default 50 12s
Spot and on-demand with weighted NodePools
For workloads that tolerate interruption, split into two NodePools with weight-based priority:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot
spec:
weight: 100 # higher weight = tried first
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["m", "c", "r", "m6i", "c6i", "r6i"] # broad for price-capacity-optimized
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "500"
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: on-demand-fallback
spec:
weight: 10
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "500"
Karpenter tries the weight-100 Spot pool first. If Spot capacity is unavailable, it falls back to the weight-10 on-demand pool. For Spot, Karpenter uses the price-capacity-optimized allocation strategy, which balances price and interruption probability rather than blindly picking the cheapest pool. Keep instance families broad: restricting to fewer than 15 instance types blocks single-node Spot consolidation.
For GPU workloads, use a separate NodePool with taints to isolate expensive GPU nodes from general workloads.
Disruption, consolidation, and drift
Karpenter's disruption model has two categories:
Voluntary (rate-limited by disruption budgets):
- Consolidation runs in three tiers: delete empty nodes first, then try multi-node consolidation (merge workloads from several nodes onto fewer), then single-node consolidation (replace a node with a smaller one).
WhenEmptyOrUnderutilizedenables all three.WhenEmptyenables only empty-node deletion. - Drift detects when a running node no longer matches the desired NodePool or EC2NodeClass spec (changed AMI, updated requirements, modified security groups). Karpenter replaces drifted nodes gracefully.
Forceful (not rate-limited):
- Expiration drains and terminates nodes when
expireAfterelapses (default 720h / 30 days). - Interruption handles EC2 lifecycle events: Spot 2-minute warnings, scheduled maintenance, instance stop signals. Karpenter pre-provisions a replacement during the warning window.
Disruption budgets by reason
Since v1.0 you can scope budgets per disruption reason:
disruption:
budgets:
- nodes: "20%"
reasons: ["Drifted"]
- nodes: "10%"
reasons: ["Underutilized"]
- nodes: "0"
reasons: ["Empty"]
schedule: "0 9 * * mon-fri" # freeze empty-node removal during business hours
duration: 8h
Protecting specific pods
Add karpenter.sh/do-not-disrupt: "true" as a pod annotation to block voluntary disruption (consolidation, drift) on that pod's node. This does not block expiration or Spot interruption. Pair it with PodDisruptionBudgets for broader availability guarantees.
Migrate from Cluster Autoscaler
Karpenter and Cluster Autoscaler can run simultaneously. Zero-downtime migration follows this sequence:
Step 1: prepare workloads
Add PodDisruptionBudgets to every production Deployment. Without PDBs, scaling down old node groups causes immediate eviction of all replicas. Set accurate resource requests so Karpenter can bin-pack effectively.
Step 2: deploy Karpenter alongside CA
Install Karpenter as described above. Use nodeAffinity on the Karpenter controller Deployment to pin it to nodes in your existing managed node group. Karpenter must not run on nodes it manages (circular dependency if it evicts its own controller).
Step 3: create NodePool and EC2NodeClass
Apply the manifests from the previous sections. Karpenter starts watching for unschedulable pods immediately but does not touch existing CA-managed nodes.
Step 4: scale Cluster Autoscaler to zero
kubectl scale deployment cluster-autoscaler -n kube-system --replicas=0
Step 5: gradually reduce node group capacity
Lower the minSize and desiredCapacity of your ASGs incrementally. As workloads naturally churn (deployments, scaling events, pod restarts), pods land on Karpenter-provisioned nodes. Old nodes drain through normal turnover.
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name eks-managed-ng-1 \
--min-size 2 \
--desired-capacity 2
Maintain at least 2 nodes per AZ in the initial node group until you have confirmed Karpenter handles all workloads.
Step 6: verify and clean up
kubectl get nodes -L karpenter.sh/nodepool
Nodes with a karpenter.sh/nodepool label are Karpenter-managed. Once no workloads remain on unmanaged nodes, delete the old ASGs.
Salesforce migrated over 1,000 EKS clusters to Karpenter using this exact phased approach.
Monitor Karpenter with Prometheus and Grafana
Karpenter exposes Prometheus metrics at karpenter.kube-system.svc.cluster.local:8080/metrics. If you run kube-prometheus-stack, add a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: karpenter
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: karpenter
endpoints:
- port: http-metrics
path: /metrics
Key metrics to track
| Metric | What it tells you |
|---|---|
karpenter_nodes_created_total |
How many nodes Karpenter has provisioned |
karpenter_nodes_terminated_total |
How many nodes were removed (consolidation, expiry, interruption) |
karpenter_pods_startup_duration_seconds |
Time from pod creation to running state |
karpenter_scheduler_queue_depth |
Pending pod batches waiting for nodes |
karpenter_voluntary_disruption_decisions_total |
Consolidation and drift decisions |
karpenter_nodes_termination_duration_seconds |
Drain time; high p95 signals stuck PDBs |
karpenter_voluntary_disruption_eligible_nodes |
Nodes eligible for consolidation that are not being acted on |
Full reference: Karpenter metrics documentation.
Grafana dashboards
Import these from Grafana Labs:
- Karpenter Overview (ID 21699) for NodePool, node, and pod counts
- Karpenter Performance (ID 22173) for cloud provider errors and pod startup latency
- Karpenter Activity (ID 18862) for scale-up/down event timelines
Alerts worth configuring
- Sustained high queue depth:
karpenter_scheduler_queue_depth > 5for more than 2 minutes means Karpenter cannot find capacity. Check NodePool limits, instance availability, and IAM permissions. - Slow termination:
histogram_quantile(0.95, karpenter_nodes_termination_duration_seconds_bucket) > 600means draining takes longer than 10 minutes. Look for blocking PDBs ordo-not-disruptannotations. - Provisioning surge: a sudden spike in
rate(karpenter_nodeclaims_created_total[5m])may indicate a broken HPA loop or a rogue Deployment. - Consolidation blocked:
karpenter_voluntary_disruption_eligible_nodesstays high whilekarpenter_voluntary_disruption_decisions_totalstays flat. Disruption budgets or PDBs are preventing cleanup.
Production hardening checklist
- Pin AMI versions. Use
al2023@v20250301, not@latest. Test AMI updates in staging before rolling them via drift. - Run Karpenter on Fargate or a dedicated managed node group. Never on Karpenter-managed nodes. A circular dependency means Karpenter evicts its own controller.
- Require IMDSv2. Set
httpTokens: requiredin EC2NodeClassmetadataOptionsto block SSRF-based credential theft. - Set NodePool resource limits. Always define
limits.cpuandlimits.memoryto cap spending per NodePool. - Use IRSA for the controller role. Never attach IAM permissions via EC2 instance metadata.
- Keep instance families broad for Spot. Fewer than 15 instance type options blocks single-node Spot consolidation.
- Set
terminationGracePeriodwhen usingexpireAfter. Without it, a pod annotated withdo-not-disruptblocks node drain indefinitely.
Common gotchas
Pods stuck in Pending despite available NodePools. The pod's requirements (resource requests, node selectors, tolerations) do not fit within any NodePool's requirements. Run kubectl describe pod <name> and check the Events section for scheduling failure reasons. Karpenter can only provision nodes that satisfy the intersection of NodePool constraints and pod constraints.
Nodes created then immediately terminated. The EC2 instance launches but fails to join the cluster. Common causes: missing VPC endpoints for STS or SSM in private clusters, incorrect security group rules blocking kubelet communication, or wrong IAM instance profile.
Consolidation not happening. Check karpenter_voluntary_disruption_eligible_nodes. If it is high but decisions are zero, disruption budgets or PDBs are blocking. Also verify consolidationPolicy is set to WhenEmptyOrUnderutilized, not WhenEmpty.
Windows node slowness. Windows nodes take ~6 minutes to join the cluster plus 15–20 minutes to pull the base image. This is an inherent platform limitation, not a Karpenter issue. Do not expect sub-minute provisioning for Windows workloads.
v1.0 migration failures. If you are upgrading from Karpenter v0.x, the Provisioner and AWSNodeTemplate CRDs are removed in v1.0. Run karpenter-convert -f provisioner.yaml > nodepool.yaml before upgrading. v1.1 drops v1beta1 entirely.
When to escalate
Collect this information before asking for help:
- Karpenter controller logs:
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=100 - NodePool and EC2NodeClass specs:
kubectl get nodepools -o yamlandkubectl get ec2nodeclasses -o yaml - Pending pods and their events:
kubectl get pods --field-selector=status.phase=Pending -A - NodeClaim status:
kubectl get nodeclaims -o wide - Karpenter version:
helm list -n kube-system | grep karpenter - EKS cluster version and platform version
- IAM role ARNs for both controller and node roles