Kubernetes taints, tolerations, and node affinity: controlling pod placement

Kubernetes schedules pods to any available node by default. When you need GPU workloads on GPU nodes only, batch jobs on spot instances, or tenant workloads on dedicated hardware, you combine taints (which repel pods from nodes), tolerations (which let specific pods override that repulsion), and node affinity (which attracts pods toward nodes with matching labels). This guide covers all three mechanisms, topology spread constraints, and practical patterns for production clusters.

Goal

At the end of this article you will know how to use taints, tolerations, node affinity, pod anti-affinity, and topology spread constraints to control exactly which pods run on which nodes in a Kubernetes cluster.

Prerequisites

A Kubernetes cluster running v1.28 or later with kubectl access and permission to taint nodes and create Deployments
At least two nodes in the cluster (scheduling constraints are meaningless with a single node)
Familiarity with Kubernetes Services and pod labels. If a pod gets stuck in Pending after applying these rules, the pod Pending troubleshooting guide covers every scheduler failure in detail.

Taints and tolerations: repelling pods from nodes

A taint is a property on a node that repels pods unless a pod explicitly tolerates it. Think of it as a "keep out" sign. A toleration on a pod says "I can handle that sign."

A taint has three parts: a key, an optional value, and an effect.

Adding and removing taints

# Add a taint to a node
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

# Verify taints on a node
kubectl describe node gpu-node-1 | grep Taints

# Remove a specific taint (note the trailing minus)
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule-

The three taint effects

NoSchedule is the most common. No new pods without a matching toleration will land on the node. Pods already running are not evicted. Use this for reserving GPU nodes, spot pools, or tenant-dedicated hardware.

PreferNoSchedule is the soft version. The scheduler tries to avoid the node but will place pods there if no alternatives exist. Existing pods stay put. Use this when you want to steer traffic away from certain nodes without hard-blocking.

NoExecute is the strictest. Pods without a matching toleration are evicted immediately. New pods without a toleration cannot schedule either. This is the effect the node lifecycle controller applies automatically when a node becomes not-ready or unreachable.

When a node has multiple taints, Kubernetes evaluates them as a combined filter: if any unmatched taint has NoSchedule, the pod is blocked. If the only unmatched taints are PreferNoSchedule, the scheduler avoids but may proceed. If any unmatched taint has NoExecute, running pods get evicted.

Writing tolerations

Tolerations go in spec.tolerations on the pod (or in the pod template of a Deployment). Two operators control matching:

Equal matches a specific key, value, and effect:

tolerations:
- key: "nvidia.com/gpu"
  operator: "Equal"
  value: "present"
  effect: "NoSchedule"

Exists matches any value for the given key:

tolerations:
- key: "nvidia.com/gpu"
  operator: "Exists"
  effect: "NoSchedule"

An empty key with Exists tolerates all taints on all keys. DaemonSets like monitoring agents sometimes use this so they run on every node regardless of taints. Use it sparingly.

tolerationSeconds with NoExecute

When a NoExecute taint is added to a node (during maintenance or after a node condition), pods with a matching toleration can specify how long they stay before eviction:

tolerations:
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 120  # pod gets 2 minutes to shut down gracefully

Built-in automatic taints

The node lifecycle controller adds taints automatically when conditions are detected. These are the most common:

Taint	Trigger	Effect
`node.kubernetes.io/not-ready`	Node Ready condition is False	NoExecute
`node.kubernetes.io/unreachable`	Node Ready condition is Unknown	NoExecute
`node.kubernetes.io/memory-pressure`	MemoryPressure condition	NoSchedule
`node.kubernetes.io/disk-pressure`	DiskPressure condition	NoSchedule
`node.kubernetes.io/pid-pressure`	PIDPressure condition	NoSchedule

The DaemonSet controller automatically adds tolerations for these built-in taints so system pods (CNI plugins, log collectors, monitoring agents) keep running. But if you add custom taints to nodes, DaemonSets will not tolerate them unless you add the toleration to the DaemonSet spec explicitly.

Node affinity: attracting pods to nodes

Taints repel. Node affinity does the opposite: it attracts pods toward nodes with specific labels. It replaces the older nodeSelector with more expressive matching.

Required vs. preferred

requiredDuringSchedulingIgnoredDuringExecution is a hard rule. The pod stays Pending if no node matches. IgnoredDuringExecution means the pod is not evicted if node labels change after scheduling.

preferredDuringSchedulingIgnoredDuringExecution is a soft preference. The scheduler tries to find a match but will schedule elsewhere if needed. Each preference carries a weight between 1 and 100. Higher weight means stronger preference.

YAML example

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.present    # must have a GPU
            operator: In
            values:
            - "true"
          - key: topology.kubernetes.io/zone  # and be in one of these zones
            operator: In
            values:
            - eu-west-1a
            - eu-west-1b
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: node.kubernetes.io/instance-type
            operator: In
            values:
            - p4d.24xlarge   # prefer the bigger GPU instance
  containers:
  - name: training
    image: ml-training:v2.3
    resources:
      limits:
        nvidia.com/gpu: 1

Operators and term logic

Node affinity supports six operators: In, NotIn, Exists, DoesNotExist, Gt, and Lt. The last two compare integer label values.

Multiple nodeSelectorTerms are ORed: the pod schedules if any term matches. Multiple expressions within a single matchExpressions list are ANDed: all must match. If you combine nodeSelector and nodeAffinity, both must be satisfied.

Well-known node labels

These labels are set automatically on most cloud-provisioned nodes and are safe to target with affinity rules:

kubernetes.io/arch: amd64        # or arm64
kubernetes.io/os: linux           # or windows
node.kubernetes.io/instance-type: m5.large
topology.kubernetes.io/region: eu-west-1
topology.kubernetes.io/zone: eu-west-1a

The full list is in the Kubernetes reference documentation.

Why you need both taints and affinity for exclusive placement

This is the point most guides skip. Tolerations alone let a pod run on a tainted node, but nothing stops that pod from landing on a non-tainted node instead. Node affinity alone attracts a pod to specific nodes, but nothing blocks other pods from consuming capacity on those same nodes.

For exclusive placement (GPU pods on GPU nodes only, nothing else on GPU nodes), you need both:

Taint the node so non-GPU pods cannot land there
Add a toleration so GPU pods are allowed
Add node affinity so GPU pods are directed there specifically

Without all three, either the wrong pods land on the node or the right pods land on the wrong node.

Pod affinity and anti-affinity

Inter-pod affinity and anti-affinity constrain scheduling based on labels of pods already running on a node, not labels on the node itself.

topologyKey

The topologyKey field defines what "same location" means. It is a node label key whose value defines the topology domain:

topologyKey	Meaning
`kubernetes.io/hostname`	Same node
`topology.kubernetes.io/zone`	Same availability zone
`topology.kubernetes.io/region`	Same region

Pod affinity: co-locate for latency

Force a pod onto the same node as a Redis cache it depends on:

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - redis-cache
        topologyKey: kubernetes.io/hostname

Pod anti-affinity: spread for availability

Prevent two replicas of the same Deployment from landing on the same node:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web-server
        topologyKey: kubernetes.io/hostname

Use preferredDuringSchedulingIgnoredDuringExecution instead of required when the replica count might exceed the node count. A hard anti-affinity rule with 5 replicas and 3 nodes leaves 2 pods Pending forever.

Performance note

Pod affinity and anti-affinity require the scheduler to check pod labels across the cluster on every scheduling cycle. In clusters with hundreds of nodes or thousands of pods, this adds significant latency to scheduling decisions. For simple spreading, topology spread constraints are more efficient.

Topology spread constraints

Topology spread constraints distribute pods evenly across failure domains (zones, nodes, regions) without the per-pod label scanning overhead of pod anti-affinity.

Core fields

spec:
  topologySpreadConstraints:
  - maxSkew: 1                             # max difference in pod count between domains
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule       # or ScheduleAnyway
    labelSelector:
      matchLabels:
        app: web

maxSkew is the maximum allowed difference in pod count between the busiest and emptiest topology domain. Must be greater than 0. A maxSkew of 1 with 3 zones and 6 replicas means 2-2-2 distribution.

whenUnsatisfiable controls what happens when the constraint cannot be met. DoNotSchedule keeps the pod Pending (use for HA-critical services). ScheduleAnyway places the pod anyway but prefers nodes that minimize skew.

minDomains (GA since Kubernetes 1.30) sets the minimum number of topology domains that must exist before the constraint applies. Only valid with DoNotSchedule.

Dual-constraint example: spread across zones and nodes

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone   # strict zone balance
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: web
- maxSkew: 2
  topologyKey: kubernetes.io/hostname        # soft node balance
  whenUnsatisfiable: ScheduleAnyway
  labelSelector:
    matchLabels:
      app: web

This gives strict zone distribution (no zone has more than 1 extra pod) while allowing some node imbalance within a zone. In practice, this is the pattern I see most often in production clusters running stateless web services.

Practical patterns

GPU node isolation

Expensive GPU nodes should only run GPU workloads. The combination of taint + toleration + affinity enforces this:

# Label and taint GPU nodes
kubectl label nodes gpu-node-1 accelerator=nvidia-a100
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

GKE automatically adds the taint nvidia.com/gpu=present:NoSchedule when you create GPU node pools.

The GPU Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "present"
        effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: In
                values:
                - nvidia-a100
      containers:
      - name: training
        image: ml-training:v2.3
        resources:
          limits:
            nvidia.com/gpu: 1   # GPU resources go in limits only, not requests

GPU resources must be specified in limits only, not in requests. The NVIDIA, AMD, or Intel device plugin must be installed on the cluster for GPU scheduling to work.

Spot / preemptible node isolation

Fault-tolerant batch workloads run on cheaper spot instances. Critical workloads stay on on-demand nodes. Each cloud provider uses different taint keys:

Cloud	Spot node label	Automatic taint
AKS	`kubernetes.azure.com/scalesetpriority=spot`	`kubernetes.azure.com/scalesetpriority=spot:NoSchedule`
GKE	`cloud.google.com/gke-spot=true`	`cloud.google.com/gke-spot=true:NoSchedule`
EKS	`eks.amazonaws.com/capacityType=SPOT`	Custom (typically `spot=true:NoSchedule`)

A batch worker that tolerates spot nodes and handles eviction gracefully:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-worker
spec:
  replicas: 8
  selector:
    matchLabels:
      app: batch-worker
  template:
    metadata:
      labels:
        app: batch-worker
    spec:
      tolerations:
      - key: "spot"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 120   # 2-minute grace period when spot is reclaimed
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/capacity-type
                operator: In
                values:
                - spot
      containers:
      - name: worker
        image: batch-processor:v1.8

Combine with a PodDisruptionBudget to limit how many pods can be unavailable simultaneously when spot nodes are reclaimed:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: batch-worker-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: batch-worker

If you use Karpenter instead of the cluster autoscaler, define separate NodePools for spot and on-demand with different karpenter.sh/capacity-type requirements and matching taints.

Multi-tenant dedicated nodes

Each tenant gets nodes that only their workloads can use:

kubectl label nodes tenant-a-node-1 tenant=team-alpha
kubectl taint nodes tenant-a-node-1 tenant=team-alpha:NoSchedule

spec:
  tolerations:
  - key: "tenant"
    operator: "Equal"
    value: "team-alpha"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: tenant
            operator: In
            values:
            - team-alpha

A critical caveat: taints and tolerations are not a security boundary. A misconfigured pod can add a matching toleration and bypass the restriction. For real multi-tenant isolation, enforce toleration restrictions with a policy engine like Kyverno or OPA/Gatekeeper. For label security, use the node-restriction.kubernetes.io/ prefix so kubelet cannot self-modify those labels.

Verify the result

After applying taints, tolerations, and affinity rules, verify everything landed correctly:

# Check all taints on all nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Confirm the pod is running on the expected node
kubectl get pod <pod-name> -o wide

# If a pod is stuck in Pending, read the scheduler events
kubectl describe pod <pod-name>

The Events section in kubectl describe pod shows exactly which scheduler filter failed. Look for messages like 0/5 nodes are available: 3 node(s) had untolerated taint {nvidia.com/gpu: present}. The pod Pending troubleshooting guide covers every scheduler failure message in detail.

Common troubleshooting

Symptom	Likely cause	Fix
Pod Pending with "untolerated taint"	Missing toleration in pod spec	Add the matching toleration
Pod Pending with "didn't match node affinity"	Node lacks the required label	Add the label to the node or relax to `preferred`
Pod Pending with "didn't match pod anti-affinity"	More replicas than nodes	Switch to `preferred` anti-affinity or add nodes
Pod lands on wrong node despite affinity	Toleration without node affinity	Add node affinity to direct the pod, not just allow it
DaemonSet not running on tainted node	Custom taint without toleration in DaemonSet	Add the custom taint toleration to the DaemonSet spec

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy