Pod Pending: "didn't match pod topology spread constraints" with untolerated taints

A pod with topology spread constraints stays Pending even though the cluster has free capacity. The scheduler reports both untolerated taints and unsatisfied topology spread constraints in the same FailedScheduling event. The cause is the default nodeTaintsPolicy: Ignore, which counts unreachable tainted nodes in the spread math and creates a deadlock in multi-tenant clusters. The fix is to set nodeTaintsPolicy: Honor on the constraint.

The symptom

A pod stays in Pending. The cluster has free capacity. kubectl describe pod returns a FailedScheduling event that mixes two seemingly unrelated complaints in one message:

Warning  FailedScheduling  default-scheduler
0/12 nodes are available:
  3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane},
  3 node(s) had untolerated taint {dedicated: tenant-a},
  6 node(s) didn't match pod topology spread constraints.
preemption: 0/12 nodes are available: 12 Preemption is not helpful for scheduling.

The pod has a topologySpreadConstraints block on topology.kubernetes.io/region with maxSkew: 1 and whenUnsatisfiable: DoNotSchedule. Two of the three replicas are already running. The third has nowhere to go. Scaling up the worker pool does not help. Adding more replicas to the deployment does not help.

What this actually means

The scheduler's topology spread plugin calculates pod distribution across "eligible domains". By default, nodeTaintsPolicy is Ignore, which means the plugin counts every node carrying the topology label, regardless of whether the pod can actually land there. In a multi-tenant cluster where one region's nodes are tainted for another tenant, that region appears as a domain with zero pods and pulls the global minimum to zero. The constraint then forces scheduling toward the unreachable domain. The taint blocks it. Deadlock.

How the spread math actually works

maxSkew is the maximum allowed difference between any eligible domain's pod count and the global minimum. With whenUnsatisfiable: DoNotSchedule, the scheduler refuses to place a pod if doing so would push the difference past maxSkew. The "eligible domain" definition is where this article's bug lives.

Consider a 12-node cluster split across three regions. The pod tolerates the worker taints in eu-west-1 and eu-central-1 but not the dedicated taint in us-east-1:

Region Worker nodes Pods running Reachable for this pod?
eu-west-1 3 1 Yes
eu-central-1 3 1 Yes
us-east-1 3 0 No (taint dedicated=tenant-a)

With the default nodeTaintsPolicy: Ignore, all three regions are eligible domains. Global minimum across eligible domains is zero (the count in us-east-1). Placing in eu-west-1 or eu-central-1 would yield 2-0=2, exceeding maxSkew: 1. Placing in us-east-1 would satisfy the math (1-0=1) but the taint rejects the binding. No placement works.

With nodeTaintsPolicy: Honor, the plugin filters us-east-1 out of the eligible domain set before counting. The math is now over two domains: eu-west-1 (1) and eu-central-1 (1). Placing the third pod in either yields 2-1=1, which equals maxSkew: 1. The pod schedules.

Why the default is Ignore

Before Kubernetes 1.25, the topology spread plugin had no taint awareness at all. Tainted nodes were always counted as domain members. Defaulting nodeTaintsPolicy to Ignore preserves that pre-1.25 behavior and avoids silently changing scheduling outcomes for workloads that happen to work under the old counting logic. The graduation timeline for NodeInclusionPolicyInPodTopologySpread is alpha in 1.25, beta (enabled by default) in 1.26, and GA via PR #130920 in 1.33.

This deadlock pattern was reported as kubernetes/kubernetes#107464 ("Pod Topology Spread takes into account unschedulable tainted nodes") and the cordon-during-upgrade variant as #106127. The fix was the new policy fields, not a default change.

The companion field nodeAffinityPolicy defaults to Honor, which is the opposite default. The asymmetry exists because the scheduler already filtered nodeAffinity-mismatched nodes via a separate plugin chain before topology spread ran. Defaulting to Honor preserved that pre-existing implicit behavior, while taint awareness was genuinely new and got the conservative Ignore default.

Diagnosis

Confirm the deadlock with this sequence. Run them in order.

1. Read the FailedScheduling event. On clusters running Kubernetes 1.28 or newer, the modern command is kubectl events (GA in 1.28):

kubectl events --for pod/<pod-name> -n <namespace> --types=Warning

On older clusters, fall back to kubectl describe pod <pod-name> and read the Events section. The composite message will contain both untolerated taint and didn't match pod topology spread constraints as separate clauses if the deadlock applies.

2. Map nodes to regions and taints. This is the diagnostic that confirms the deadlock structure:

kubectl get nodes -L topology.kubernetes.io/region \
  -o custom-columns=NAME:.metadata.name,\
REGION:'.metadata.labels.topology\.kubernetes\.io/region',\
TAINTS:.spec.taints

You should see one or more regions where every node carries a taint that the pending pod does not tolerate. That region is the ghost domain in the math.

3. Check the pod's tolerations against those taints.

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.tolerations}'

Cross-reference each NoSchedule and NoExecute taint from step 2 against the pod's tolerations. If the tainted region's taints are absent from the pod's toleration list, you have the deadlock.

4. Reconstruct what the scheduler is counting. Count matching pods per region for the same labelSelector the constraint uses:

kubectl get pods -n <namespace> \
  -l <label-selector-from-constraint> \
  -o custom-columns=NAME:.metadata.name,\
REGION:'.metadata.labels.topology\.kubernetes\.io/region',\
NODE:.spec.nodeName

If the per-region pod counts match the structure in the math table above (a tainted region at zero pulling the global minimum down), the deadlock is confirmed.

5. Check the current nodeTaintsPolicy value on the constraint.

kubectl get pod <pod-name> -n <namespace> \
  -o jsonpath='{.spec.topologySpreadConstraints[*].nodeTaintsPolicy}'

An empty result means the field is unset and falling back to the default Ignore. That confirms the fix.

Solution: set nodeTaintsPolicy: Honor

The minimum change is one line on the constraint:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/region
    whenUnsatisfiable: DoNotSchedule
    nodeTaintsPolicy: Honor       # excludes nodes with untolerated taints
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: cache-replica

After applying, the tainted region drops out of the eligible domain set and the spread math is recalculated over the reachable regions only.

When the Helm chart does not expose the field

Many Helm charts expose topologySpreadConstraints as a values block but stop at maxSkew, topologyKey, and whenUnsatisfiable. As of April 2026, the DandyDeveloper redis-ha chart is one example: nodeTaintsPolicy is not in the values schema and not in the StatefulSet template.

In that case, patch the rendered manifest with Kustomize JSON 6902:

# kustomization.yaml
patches:
  - target:
      kind: StatefulSet
      name: redis-ha-server
    patch: |-
      - op: add
        path: /spec/template/spec/topologySpreadConstraints/0/nodeTaintsPolicy
        value: Honor

The add operation creates the field because it does not exist on the rendered object. The path index 0 assumes one constraint; bump it for additional entries.

Verifying the fix took effect

Check the rendered StatefulSet before applying:

kustomize build --enable-helm . | \
  yq 'select(.kind == "StatefulSet" and .metadata.name == "redis-ha-server") | .spec.template.spec.topologySpreadConstraints'

After applying, confirm the running pod sees the field:

kubectl get pod <pod-name> -n <namespace> \
  -o jsonpath='{.spec.topologySpreadConstraints[*].nodeTaintsPolicy}'

Output should be Honor. Trigger a rescheduling for the still-pending pod with kubectl delete pod <pending-pod> so the controller recreates it with the patched spec. You will know it worked when the pod transitions from Pending to Running and lands on a node in a reachable region.

What does not solve this: whenUnsatisfiable: ScheduleAnyway

Switching whenUnsatisfiable from DoNotSchedule to ScheduleAnyway makes the symptom go away but is not equivalent to the Honor fix. It downgrades the constraint to a soft preference, which has four real consequences:

  • The scheduler tries to reduce skew but other scoring factors (resource utilization, affinity scoring) compete. Distribution becomes non-deterministic.
  • Existing imbalances are not corrected. Pods that landed suboptimally stay there until rescheduled by a descheduler or eviction.
  • Karpenter treats ScheduleAnyway constraints as real constraints during consolidation. Nodes will not be consolidated if doing so would violate even a soft constraint, which can keep underutilized nodes alive longer than necessary.
  • The original distribution intent is lost. The constraint exists to enforce a hard distribution rule, and ScheduleAnyway silently abandons that intent.

The recommendation in community discussion is to keep DoNotSchedule and add nodeTaintsPolicy: Honor. That preserves the strict-distribution intent and resolves the deadlock at its root.

When this pattern bites in production

Multi-tenant clusters generate this deadlock in a handful of recurring shapes:

  • Dedicated tenant pools. Node groups tainted with dedicated=<tenant>:NoSchedule. Workloads from other tenants cannot land there but the region or zone still counts in spread math.
  • GPU node pools. Tainted with nvidia.com/gpu:NoSchedule. Non-GPU workloads with region-level spread constraints see GPU regions as zero-pod domains.
  • Node cordoning during upgrades. A kubectl cordon applies node.kubernetes.io/unschedulable:NoSchedule. Pods evicted from cordoned nodes try to reschedule, but the cordoned node's region remains in the spread accounting at zero pods. This is the most common production trigger and the one in #106127.
  • virtual-kubelet nodes sharing a topology label with real nodes.
  • vCluster, Capsule, or other multi-tenant abstractions using taints to isolate tenants.

The deadlock is more likely with topology.kubernetes.io/region and topology.kubernetes.io/zone than with kubernetes.io/hostname. Region and zone topologies have few domains; one ghost domain breaks the math instantly. Hostname topology has hundreds of domains, so a single tainted host has a much smaller distorting effect.

When to escalate

If applying nodeTaintsPolicy: Honor does not resolve the Pending state, collect this information before reaching out:

  • Full output of kubectl describe pod <pod-name> -n <namespace>
  • kubectl get nodes -L topology.kubernetes.io/region -L topology.kubernetes.io/zone -o wide
  • kubectl get pod <pod-name> -o yaml (full spec including final tolerations and constraints)
  • All FailedScheduling events: kubectl events --for pod/<pod-name> -n <namespace> --types=Warning
  • Per-region pod distribution for the labelSelector (the step 4 command above)
  • Kubernetes version: kubectl version
  • The Helm chart name and version, or the manifest source, plus the rendered spec

This is enough to diagnose any remaining filter chain that is rejecting the pod, including stacked constraints (multiple topologySpreadConstraints entries), unsatisfiable nodeAffinity, or pod anti-affinity collisions.

How to prevent recurrence

  • In any multi-tenant cluster, set nodeTaintsPolicy: Honor as the default on every constraint that uses region or zone topology keys. Until your control plane is on Kubernetes 1.33 or newer with the GA defaults baked in, treat Ignore as wrong-by-default for multi-tenant scheduling.
  • Audit cordoned-node behavior during cluster upgrades. The same deadlock surfaces transiently when a node gets cordoned and the eviction triggers a reschedule under the old default.
  • Add nodeTaintsPolicy: Honor to any base Helm chart values you maintain internally so engineers do not need to remember it per service. For charts you do not own, keep a Kustomize overlay ready that patches it in.
  • Cross-check this against the broader Pod Pending diagnosis flow when the FailedScheduling event mixes resource and constraint clauses. The deadlock pattern is rare in single-tenant clusters but routine in shared ones, including during eviction-driven reschedules.

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.