What you will have at the end
A Kubernetes cluster where non-critical workloads run on spot or preemptible nodes at 60–80% cost reduction, with interruption handlers that drain pods gracefully before reclamation, PodDisruptionBudgets that protect availability, and instance type diversification that keeps interruption rates below 5%.
Prerequisites
kubectlconnected to an EKS (1.28+) or GKE (1.25+) clusterhelminstalled locally for handler deployments- Familiarity with Karpenter NodePool configuration (for the Karpenter path) or Cluster Autoscaler (for the managed node group path)
- PodDisruptionBudgets configured on production workloads
- IAM permissions to create SQS queues and EventBridge rules (AWS) or node pool management permissions (GCP)
How interruption notices work
The cloud provider decides to reclaim your spot capacity. What happens next depends on the provider.
AWS: 2-minute warning via IMDS
AWS emits a Spot Instance interruption notice exactly 2 minutes before terminating or stopping the instance. The notice is available at the IMDS endpoint:
http://169.254.169.254/latest/meta-data/spot/instance-action
When an interruption is scheduled, the endpoint returns:
{"action": "terminate", "time": "2026-04-09T14:22:00Z"}
When nothing is pending, it returns HTTP 404. That 404 is the normal state.
AWS also sends a separate rebalance recommendation when a spot instance is at elevated risk. This arrives before the 2-minute notice (sometimes significantly earlier) and gives you time to proactively move workloads before actual reclamation.
GCP: 30-second ACPI signal
GCP sends an ACPI G2 Soft Off signal giving the VM up to 30 seconds to shut down gracefully. If the instance has not stopped within that window, GCP sends a hard kill signal.
That 30-second window is tight. Plan for it in your terminationGracePeriodSeconds.
GCP terminology note: preemptible VMs have a 24-hour maximum runtime and Google recommends migrating to Spot VMs, which have no inherent runtime cap. Both share the same 30-second interruption window.
Step 1: taint spot nodes and configure tolerations
Spot nodes need a taint to prevent non-spot-tolerant workloads from landing on them. Without the taint, your database primary could end up on a spot node.
On EKS with Karpenter, add the taint in the NodePool spec (covered in Step 3).
On GKE, apply the taint when creating the spot node pool:
gcloud container node-pools create spot-pool \
--cluster=production-main \
--spot \
--node-taints=cloud.google.com/gke-spot="true":NoSchedule
GKE automatically labels spot nodes with cloud.google.com/gke-spot: "true" and cloud.google.com/gke-provisioning: "spot" (GKE 1.25.5+).
Workloads that should run on spot need a matching toleration:
# For GKE spot nodes
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: "true"
effect: NoSchedule
# For Karpenter-managed spot nodes (custom taint)
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Step 2: install an interruption handler
Something needs to watch for the cloud provider's reclamation signal and drain pods before the node disappears. The right tool depends on your setup.
Option A: Karpenter native handling (EKS, recommended)
If you already run Karpenter, you do not need a separate handler. Karpenter has handled Spot interruption notifications natively since v0.19.0 via an SQS queue and EventBridge.
When an interruption arrives:
- EventBridge forwards the event to the SQS queue
- Karpenter's interruption controller taints the node
NoScheduleand begins draining pods - In parallel, Karpenter provisions a replacement node from the NodePool
- Replacement is typically ready before the 2-minute window expires
Enable it by passing the queue name to Karpenter:
karpenter --interruption-queue=karpenter-interruption-queue
One limitation: Karpenter publishes events for Spot rebalance recommendations but does not act on them proactively. If you want rebalance-triggered replacement, you still need NTH in Queue mode alongside Karpenter.
Option B: aws-node-termination-handler (EKS, without Karpenter)
The aws-node-termination-handler (NTH) runs in two mutually exclusive modes:
IMDS mode (DaemonSet): polls the IMDS endpoint every 5 seconds. No extra AWS infrastructure required. Best for simple spot setups.
helm install aws-node-termination-handler \
eks/aws-node-termination-handler \
--set enableSpotInterruptionDraining=true \
--set enableRebalanceMonitoring=true \
--set enableScheduledEventDraining=true
Queue mode (Deployment): monitors an SQS queue fed by EventBridge. Supports ASG lifecycle hooks with grace periods up to 48 hours via RecordLifecycleActionHeartbeat. Required for long-running batch jobs.
helm install aws-node-termination-handler \
eks/aws-node-termination-handler \
--set enableSqsTerminationDraining=true \
--set queueURL=https://sqs.eu-west-1.amazonaws.com/123456789012/nth-queue
| Factor | IMDS mode | Queue mode |
|---|---|---|
| Infrastructure needed | None | SQS + EventBridge + IAM |
| ASG lifecycle hooks | No | Yes (up to 48h) |
| Long-running batch jobs | No | Yes |
| Simple spot nodes | Recommended | Overkill |
Option C: GKE kubelet graceful shutdown
GKE 1.20+ handles preemption automatically. The kubelet intercepts the ACPI signal and gracefully terminates pods. No separate handler needed.
For GKE clusters older than 1.20, deploy the k8s-node-termination-handler from GoogleCloudPlatform.
Step 3: configure Karpenter for spot (EKS)
This NodePool tells Karpenter to prefer spot, fall back to on-demand when capacity is unavailable, and diversify across instance types to reduce interruption risk:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-workloads
spec:
template:
spec:
taints:
- key: "spot"
value: "true"
effect: NoSchedule
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # spot preferred, on-demand fallback
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.xlarge # diversify across families and generations
- m5a.xlarge
- m6i.xlarge
- m6a.xlarge
- m5d.xlarge
- m5n.xlarge
- m4.xlarge
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
budgets:
- nodes: "20%"
reasons: ["Empty", "Drifted"]
- nodes: "5" # max 5 nodes disrupted at once for other reasons
Karpenter prioritizes capacity types in order: reserved > spot > on-demand. When spot capacity is unavailable, Karpenter caches the InsufficientCapacityError for 3 minutes and falls back to on-demand.
SpotToSpotConsolidation
Karpenter can replace a running spot node with a cheaper spot node. Enable it via Helm:
helm upgrade karpenter oci://public.ecr.aws/karpenter/karpenter \
--set settings.featureGates.spotToSpotConsolidation=true
This triggers only when 15+ instance type options with lower pricing exist, preventing convergence on a single high-interruption type. It uses the price-capacity-optimized strategy. I would recommend validating consolidation patterns in a staging environment first.
Step 4: set PDBs and graceful termination periods
Spot interruptions trigger the Eviction API, which respects PodDisruptionBudgets. A well-configured PDB prevents all replicas from being evicted simultaneously.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: web
unhealthyPodEvictionPolicy: AlwaysAllow # prevents stalled drains
Set unhealthyPodEvictionPolicy: AlwaysAllow so unhealthy pods do not block eviction when the PDB budget is already consumed.
Termination grace periods
The terminationGracePeriodSeconds covers both the preStop hook and the container's SIGTERM handler. Match it to your cloud provider's interruption window:
- AWS (2-minute window): set
terminationGracePeriodSeconds: 90for most workloads. Batch jobs on Queue mode NTH with lifecycle hooks can go higher. - GCP (30-second window): set
terminationGracePeriodSeconds: 25to stay within the hard cutoff.
A minimal preStop hook that allows load balancer deregistration:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # wait for endpoint removal
Production services behind load balancers should verify deregistration is complete instead of relying on a fixed sleep.
Minimum replicas
Run at least 2 replicas of any spot-tolerant Deployment. A single replica on a spot node means a full outage on interruption. No PDB saves you from that.
Step 5: select instance types for low interruption rates
Interruption is not random. The AWS Spot Instance Advisor shows per-type interruption frequency in bands: <5%, 5–10%, 10–15%, 15–20%, >20%. Across all types, the historical average is below 5%.
Three rules for instance type selection:
- Diversify across 5–10 types of the same vCPU and memory class (e.g.,
m5.xlarge,m5a.xlarge,m6i.xlarge,m6a.xlarge). The more pools Karpenter or the ASG can draw from, the lower the chance all pools are exhausted simultaneously. - Span multiple Availability Zones. The same instance type can have very different interruption rates across AZs.
- Use
capacity-optimizedorprice-capacity-optimizedallocation. These strategies select from the deepest capacity pools rather than the cheapest price alone.
The spotinfo CLI lets you browse interruption rates and savings percentages per instance type and region from the terminal.
Verify the setup
After deploying, confirm each layer works:
# Check spot nodes are running and tainted
kubectl get nodes -l karpenter.sh/capacity-type=spot -o wide
kubectl describe node <spot-node> | grep -A2 Taints
# Check NTH or Karpenter interruption controller is running
kubectl get pods -n kube-system | grep -E 'node-termination|karpenter'
# Check PDBs are in place
kubectl get pdb --all-namespaces
Expected state: spot nodes carry the correct taint, the interruption handler pod is Running, and PDBs show disruptionsAllowed > 0 for workloads with multiple replicas.
Workload suitability checklist
Not every workload belongs on spot. Use this checklist:
- Good fit: stateless web servers, API workers, CI/CD runners, batch data processing pipelines, non-production environments, ML training with checkpointing
- Possible with care: StatefulSets with persistent volumes when the application supports graceful checkpoint/restore (Kafka, Redis with RDB snapshots, distributed ML training)
- Not a fit: single-node databases (PostgreSQL, MySQL in primary mode), workloads that cannot tolerate a 2-minute (AWS) or 30-second (GCP) restart window, latency-sensitive services with no on-demand fallback
The rule: if your application can survive a process restart with zero data loss, it can run on spot.
When to escalate
Collect the following before reaching out:
kubectl get events --field-selector reason=Eviction -Aoutputkubectl describe node <affected-node>(look for conditions and taints)- NTH or Karpenter controller logs:
kubectl logs -n kube-system deploy/aws-node-termination-handlerorkubectl logs -n kube-system deploy/karpenter - The PDB status:
kubectl get pdb -A -o wide - Instance type and AZ of the interrupted node
- Whether workloads restarted successfully on replacement nodes