Goal
At the end of this article you will know how to use taints, tolerations, node affinity, pod anti-affinity, and topology spread constraints to control exactly which pods run on which nodes in a Kubernetes cluster.
Prerequisites
- A Kubernetes cluster running v1.28 or later with
kubectlaccess and permission to taint nodes and create Deployments - At least two nodes in the cluster (scheduling constraints are meaningless with a single node)
- Familiarity with Kubernetes Services and pod labels. If a pod gets stuck in Pending after applying these rules, the pod Pending troubleshooting guide covers every scheduler failure in detail.
Taints and tolerations: repelling pods from nodes
A taint is a property on a node that repels pods unless a pod explicitly tolerates it. Think of it as a "keep out" sign. A toleration on a pod says "I can handle that sign."
A taint has three parts: a key, an optional value, and an effect.
Adding and removing taints
# Add a taint to a node
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
# Verify taints on a node
kubectl describe node gpu-node-1 | grep Taints
# Remove a specific taint (note the trailing minus)
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule-
The three taint effects
NoSchedule is the most common. No new pods without a matching toleration will land on the node. Pods already running are not evicted. Use this for reserving GPU nodes, spot pools, or tenant-dedicated hardware.
PreferNoSchedule is the soft version. The scheduler tries to avoid the node but will place pods there if no alternatives exist. Existing pods stay put. Use this when you want to steer traffic away from certain nodes without hard-blocking.
NoExecute is the strictest. Pods without a matching toleration are evicted immediately. New pods without a toleration cannot schedule either. This is the effect the node lifecycle controller applies automatically when a node becomes not-ready or unreachable.
When a node has multiple taints, Kubernetes evaluates them as a combined filter: if any unmatched taint has NoSchedule, the pod is blocked. If the only unmatched taints are PreferNoSchedule, the scheduler avoids but may proceed. If any unmatched taint has NoExecute, running pods get evicted.
Writing tolerations
Tolerations go in spec.tolerations on the pod (or in the pod template of a Deployment). Two operators control matching:
Equal matches a specific key, value, and effect:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
Exists matches any value for the given key:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
An empty key with Exists tolerates all taints on all keys. DaemonSets like monitoring agents sometimes use this so they run on every node regardless of taints. Use it sparingly.
tolerationSeconds with NoExecute
When a NoExecute taint is added to a node (during maintenance or after a node condition), pods with a matching toleration can specify how long they stay before eviction:
tolerations:
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 120 # pod gets 2 minutes to shut down gracefully
Built-in automatic taints
The node lifecycle controller adds taints automatically when conditions are detected. These are the most common:
| Taint | Trigger | Effect |
|---|---|---|
node.kubernetes.io/not-ready |
Node Ready condition is False | NoExecute |
node.kubernetes.io/unreachable |
Node Ready condition is Unknown | NoExecute |
node.kubernetes.io/memory-pressure |
MemoryPressure condition | NoSchedule |
node.kubernetes.io/disk-pressure |
DiskPressure condition | NoSchedule |
node.kubernetes.io/pid-pressure |
PIDPressure condition | NoSchedule |
The DaemonSet controller automatically adds tolerations for these built-in taints so system pods (CNI plugins, log collectors, monitoring agents) keep running. But if you add custom taints to nodes, DaemonSets will not tolerate them unless you add the toleration to the DaemonSet spec explicitly.
Node affinity: attracting pods to nodes
Taints repel. Node affinity does the opposite: it attracts pods toward nodes with specific labels. It replaces the older nodeSelector with more expressive matching.
Required vs. preferred
requiredDuringSchedulingIgnoredDuringExecution is a hard rule. The pod stays Pending if no node matches. IgnoredDuringExecution means the pod is not evicted if node labels change after scheduling.
preferredDuringSchedulingIgnoredDuringExecution is a soft preference. The scheduler tries to find a match but will schedule elsewhere if needed. Each preference carries a weight between 1 and 100. Higher weight means stronger preference.
YAML example
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-job
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present # must have a GPU
operator: In
values:
- "true"
- key: topology.kubernetes.io/zone # and be in one of these zones
operator: In
values:
- eu-west-1a
- eu-west-1b
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- p4d.24xlarge # prefer the bigger GPU instance
containers:
- name: training
image: ml-training:v2.3
resources:
limits:
nvidia.com/gpu: 1
Operators and term logic
Node affinity supports six operators: In, NotIn, Exists, DoesNotExist, Gt, and Lt. The last two compare integer label values.
Multiple nodeSelectorTerms are ORed: the pod schedules if any term matches. Multiple expressions within a single matchExpressions list are ANDed: all must match. If you combine nodeSelector and nodeAffinity, both must be satisfied.
Well-known node labels
These labels are set automatically on most cloud-provisioned nodes and are safe to target with affinity rules:
kubernetes.io/arch: amd64 # or arm64
kubernetes.io/os: linux # or windows
node.kubernetes.io/instance-type: m5.large
topology.kubernetes.io/region: eu-west-1
topology.kubernetes.io/zone: eu-west-1a
The full list is in the Kubernetes reference documentation.
Why you need both taints and affinity for exclusive placement
This is the point most guides skip. Tolerations alone let a pod run on a tainted node, but nothing stops that pod from landing on a non-tainted node instead. Node affinity alone attracts a pod to specific nodes, but nothing blocks other pods from consuming capacity on those same nodes.
For exclusive placement (GPU pods on GPU nodes only, nothing else on GPU nodes), you need both:
- Taint the node so non-GPU pods cannot land there
- Add a toleration so GPU pods are allowed
- Add node affinity so GPU pods are directed there specifically
Without all three, either the wrong pods land on the node or the right pods land on the wrong node.
Pod affinity and anti-affinity
Inter-pod affinity and anti-affinity constrain scheduling based on labels of pods already running on a node, not labels on the node itself.
topologyKey
The topologyKey field defines what "same location" means. It is a node label key whose value defines the topology domain:
| topologyKey | Meaning |
|---|---|
kubernetes.io/hostname |
Same node |
topology.kubernetes.io/zone |
Same availability zone |
topology.kubernetes.io/region |
Same region |
Pod affinity: co-locate for latency
Force a pod onto the same node as a Redis cache it depends on:
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: kubernetes.io/hostname
Pod anti-affinity: spread for availability
Prevent two replicas of the same Deployment from landing on the same node:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-server
topologyKey: kubernetes.io/hostname
Use preferredDuringSchedulingIgnoredDuringExecution instead of required when the replica count might exceed the node count. A hard anti-affinity rule with 5 replicas and 3 nodes leaves 2 pods Pending forever.
Performance note
Pod affinity and anti-affinity require the scheduler to check pod labels across the cluster on every scheduling cycle. In clusters with hundreds of nodes or thousands of pods, this adds significant latency to scheduling decisions. For simple spreading, topology spread constraints are more efficient.
Topology spread constraints
Topology spread constraints distribute pods evenly across failure domains (zones, nodes, regions) without the per-pod label scanning overhead of pod anti-affinity.
Core fields
spec:
topologySpreadConstraints:
- maxSkew: 1 # max difference in pod count between domains
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
labelSelector:
matchLabels:
app: web
maxSkew is the maximum allowed difference in pod count between the busiest and emptiest topology domain. Must be greater than 0. A maxSkew of 1 with 3 zones and 6 replicas means 2-2-2 distribution.
whenUnsatisfiable controls what happens when the constraint cannot be met. DoNotSchedule keeps the pod Pending (use for HA-critical services). ScheduleAnyway places the pod anyway but prefers nodes that minimize skew.
minDomains (GA since Kubernetes 1.30) sets the minimum number of topology domains that must exist before the constraint applies. Only valid with DoNotSchedule.
Dual-constraint example: spread across zones and nodes
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone # strict zone balance
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
- maxSkew: 2
topologyKey: kubernetes.io/hostname # soft node balance
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web
This gives strict zone distribution (no zone has more than 1 extra pod) while allowing some node imbalance within a zone. In practice, this is the pattern I see most often in production clusters running stateless web services.
Practical patterns
GPU node isolation
Expensive GPU nodes should only run GPU workloads. The combination of taint + toleration + affinity enforces this:
# Label and taint GPU nodes
kubectl label nodes gpu-node-1 accelerator=nvidia-a100
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
GKE automatically adds the taint nvidia.com/gpu=present:NoSchedule when you create GPU node pools.
The GPU Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training
spec:
replicas: 2
selector:
matchLabels:
app: ml-training
template:
metadata:
labels:
app: ml-training
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-a100
containers:
- name: training
image: ml-training:v2.3
resources:
limits:
nvidia.com/gpu: 1 # GPU resources go in limits only, not requests
GPU resources must be specified in limits only, not in requests. The NVIDIA, AMD, or Intel device plugin must be installed on the cluster for GPU scheduling to work.
Spot / preemptible node isolation
Fault-tolerant batch workloads run on cheaper spot instances. Critical workloads stay on on-demand nodes. Each cloud provider uses different taint keys:
| Cloud | Spot node label | Automatic taint |
|---|---|---|
| AKS | kubernetes.azure.com/scalesetpriority=spot |
kubernetes.azure.com/scalesetpriority=spot:NoSchedule |
| GKE | cloud.google.com/gke-spot=true |
cloud.google.com/gke-spot=true:NoSchedule |
| EKS | eks.amazonaws.com/capacityType=SPOT |
Custom (typically spot=true:NoSchedule) |
A batch worker that tolerates spot nodes and handles eviction gracefully:
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-worker
spec:
replicas: 8
selector:
matchLabels:
app: batch-worker
template:
metadata:
labels:
app: batch-worker
spec:
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 120 # 2-minute grace period when spot is reclaimed
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/capacity-type
operator: In
values:
- spot
containers:
- name: worker
image: batch-processor:v1.8
Combine with a PodDisruptionBudget to limit how many pods can be unavailable simultaneously when spot nodes are reclaimed:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: batch-worker-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: batch-worker
If you use Karpenter instead of the cluster autoscaler, define separate NodePools for spot and on-demand with different karpenter.sh/capacity-type requirements and matching taints.
Multi-tenant dedicated nodes
Each tenant gets nodes that only their workloads can use:
kubectl label nodes tenant-a-node-1 tenant=team-alpha
kubectl taint nodes tenant-a-node-1 tenant=team-alpha:NoSchedule
spec:
tolerations:
- key: "tenant"
operator: "Equal"
value: "team-alpha"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tenant
operator: In
values:
- team-alpha
A critical caveat: taints and tolerations are not a security boundary. A misconfigured pod can add a matching toleration and bypass the restriction. For real multi-tenant isolation, enforce toleration restrictions with a policy engine like Kyverno or OPA/Gatekeeper. For label security, use the node-restriction.kubernetes.io/ prefix so kubelet cannot self-modify those labels.
Verify the result
After applying taints, tolerations, and affinity rules, verify everything landed correctly:
# Check all taints on all nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Confirm the pod is running on the expected node
kubectl get pod <pod-name> -o wide
# If a pod is stuck in Pending, read the scheduler events
kubectl describe pod <pod-name>
The Events section in kubectl describe pod shows exactly which scheduler filter failed. Look for messages like 0/5 nodes are available: 3 node(s) had untolerated taint {nvidia.com/gpu: present}. The pod Pending troubleshooting guide covers every scheduler failure message in detail.
Common troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Pod Pending with "untolerated taint" | Missing toleration in pod spec | Add the matching toleration |
| Pod Pending with "didn't match node affinity" | Node lacks the required label | Add the label to the node or relax to preferred |
| Pod Pending with "didn't match pod anti-affinity" | More replicas than nodes | Switch to preferred anti-affinity or add nodes |
| Pod lands on wrong node despite affinity | Toleration without node affinity | Add node affinity to direct the pod, not just allow it |
| DaemonSet not running on tainted node | Custom taint without toleration in DaemonSet | Add the custom taint toleration to the DaemonSet spec |