Kubernetes DaemonSet: node-level workloads and use cases

A Deployment puts pods wherever the scheduler finds room. A DaemonSet does the opposite: it places exactly one pod on every node that matches its selector and tracks the node count automatically. That distinction is what makes log collectors, CNI plugins, and monitoring agents possible. This article explains what a DaemonSet guarantees, when you need one, and the design choices that catch people out in production.

What a DaemonSet is

A DaemonSet is a workload controller (apiVersion: apps/v1, kind: DaemonSet) that ensures a copy of a pod runs on every node that matches its selector. When a node joins the cluster, the controller creates a pod on it. When a node leaves, the pod is garbage-collected. You never set a replica count; the count is whatever the node count happens to be.

Three guarantees follow from that:

One pod per matching node. Not two, not zero. The controller maintains the invariant.
Automatic scaling with the cluster. Add a node, get a pod. Drain a node, the pod goes with it.
Node-local placement. Each pod is bound to a specific node from the moment it is created, so it sees that node's filesystem, network, and devices when those are mounted in the spec.

That is the whole job of a DaemonSet. It does not balance load. It does not roll between nodes. It does not run multiple replicas per node. It is a "one of these per machine" controller, and its API surface is built around that single idea.

Common use cases

A DaemonSet is the right answer when the workload's value comes from being on every node, not from horizontal scaling. The patterns that fit:

Log collectors. Fluent Bit deployed as a DaemonSet reads /var/log/containers/*.log from each node and forwards to a central store. One pod missing a node means logs from that node disappear.
Monitoring agents. Prometheus node-exporter, Datadog, New Relic, and the OpenTelemetry Collector in agent mode all run as DaemonSets so each node's metrics are scraped locally.
CNI plugins. Cilium, Calico, and Flannel install their data plane on every node via a DaemonSet. Pods cannot get networking on a node where the CNI agent has not landed yet.
Storage drivers. CSI node plugins run as DaemonSets so any pod that mounts a volume can talk to a local agent that handles the mount syscall. Compare that with the CSI controller plugin, which is a Deployment because it runs centrally.
Security and compliance scanners. Falco, Trivy Operator node-collector, and most runtime-security tools need a process on every node that watches syscalls or kernel events.
Ingress data planes on dedicated node pools. Some teams pin their ingress controller pods to a specific node pool with a DaemonSet plus a node selector, so each ingress node has exactly one ingress pod. (More common is a Deployment on labelled nodes, but the DaemonSet pattern is valid when you want strict 1:1.)

The common thread: each node has work to do. The work is not "handle 1/N of the traffic" but "be the agent for this specific machine."

Anatomy of a DaemonSet manifest

A minimal DaemonSet spec has the same skeleton as a Deployment minus the replica count:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostNetwork: true        # bind to the node's network namespace
      hostPID: true            # see all PIDs on the node
      containers:
        - name: node-exporter
          image: quay.io/prometheus/node-exporter:v1.8.2
          args:
            - --path.rootfs=/host
          ports:
            - name: metrics
              containerPort: 9100
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 256Mi
          volumeMounts:
            - name: rootfs
              mountPath: /host
              readOnly: true
      volumes:
        - name: rootfs
          hostPath:
            path: /

The fields that matter most:

spec.selector is required and immutable. Once a DaemonSet is created, you cannot change which pods it manages. To switch selectors you delete and recreate the DaemonSet.
spec.template is the pod spec applied to every selected node. The labels in spec.template.metadata.labels must match spec.selector.matchLabels.
spec.template.spec.restartPolicy must be Always (the default). DaemonSets do not support Never or OnFailure.
hostNetwork, hostPID, hostPath volumes appear in DaemonSets far more often than in regular workloads, because the agent's whole job is to inspect or modify the node.

The DaemonSet controller also writes a spec.affinity.nodeAffinity rule into each created pod that pins it to the node it is meant for, and sets tolerations for the controller-injected taints described below. You will see those fields on the pods even if you did not write them in the template.

Targeting a subset of nodes with nodeSelector and affinity

By default a DaemonSet places one pod on every node. If you want only a subset, use spec.template.spec.nodeSelector for the simple case:

spec:
  template:
    spec:
      nodeSelector:
        disktype: ssd

Now the DaemonSet runs only on nodes labelled disktype=ssd. The controller continues to manage exactly one pod per matching node and zero pods on the rest.

For richer selection (multi-zone targeting, "either of these instance types," exclusion logic), use nodeAffinity instead. A typical pattern for a GPU monitoring agent:

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.present
                    operator: In
                    values:
                      - "true"

A DaemonSet's affinity rules combine with the rules the controller injects. You write the "which kind of nodes" half. The controller writes the "this specific node" half. Both have to match for a pod to be created.

If you need a refresher on the difference between selectors and affinity, see taints, tolerations, and node affinity.

Tolerations and the control-plane question

The DaemonSet controller automatically injects tolerations for the well-known node-condition taints, so DaemonSet pods keep running on nodes with health issues. The injected set:

Toleration key	Effect
`node.kubernetes.io/not-ready`	NoExecute
`node.kubernetes.io/unreachable`	NoExecute
`node.kubernetes.io/disk-pressure`	NoSchedule
`node.kubernetes.io/memory-pressure`	NoSchedule
`node.kubernetes.io/pid-pressure`	NoSchedule
`node.kubernetes.io/unschedulable`	NoSchedule
`node.kubernetes.io/network-unavailable`	NoSchedule (only when `hostNetwork: true`)

This is why a log collector on a node with disk pressure keeps running. The whole point is that the agent must report on the broken state, not be evicted by it.

What the controller does not inject is a toleration for node-role.kubernetes.io/control-plane:NoSchedule. That taint is added to control-plane nodes during cluster bootstrap (kubeadm does it explicitly; managed control planes are typically hidden from your view). A DaemonSet that should run on control-plane nodes has to tolerate the taint itself:

spec:
  template:
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule

Whether you want this depends on what the agent does. A log collector or monitoring agent usually does need control-plane logs and metrics, so the toleration is correct. A workload-specific agent (a database backup helper, for example) usually has no business on the control plane.

The same logic applies to any custom taint your platform team has set on dedicated node pools (GPU, spot, tenant-isolated). The DaemonSet controller does not know about those, so you add the matching tolerations explicitly.

Resource requests and limits matter on a DaemonSet

A common misconception is that DaemonSets are exempt from the normal resource-request discipline because "they are infrastructure." They are not. Every DaemonSet pod consumes capacity that could have gone to application workloads. Multiply that by node count and a sloppy DaemonSet can easily eat 1-2 GB of memory and a full CPU per node before anyone notices.

Two consequences:

Set realistic requests so the scheduler accounts for the agent. If node-exporter is configured without requests and a heavy pod schedules onto a node, the agent can be evicted under memory pressure. That breaks the "one per node" guarantee silently.
Set limits to bound worst-case behaviour. A misbehaving log collector that allocates without bound on every node turns into a cluster-wide incident, not a per-node one.

A 50-node cluster with a DaemonSet using 100m CPU per pod is using 5 full CPUs of capacity continuously. That is not a rounding error, and it should appear in your capacity planning. The full set of mechanics is in Kubernetes resource requests and limits.

Rolling updates: maxUnavailable and maxSurge

DaemonSets support two updateStrategy.type values:

RollingUpdate is the default and the one you want in nearly every case. It replaces pods node-by-node when the template changes.
OnDelete is a manual strategy: the controller will not touch existing pods when you change the template. New pods only adopt the new template after you delete them yourself.

RollingUpdate became the default when DaemonSet graduated to apps/v1 GA in Kubernetes 1.9. Before that, the rolling update strategy existed (since 1.6) but OnDelete was the default, which surprised a lot of teams during cluster upgrades.

The rolling update is configured under spec.updateStrategy.rollingUpdate:

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1   # default
      maxSurge: 0         # default

maxUnavailable is how many pods can be unavailable simultaneously across the cluster during a rollout. The default is 1. Set it higher (or to a percentage like 25%) to roll faster on large clusters, at the cost of a wider blast radius if the new version is broken.

maxSurge is the new option, stable in Kubernetes 1.25. It lets the controller temporarily run two pods on the same node during a rollout, so the new pod is up before the old one is gone. The default is 0 (the historical behaviour). Set it to 1 or a small percentage when you need zero-downtime rollouts on a system that genuinely cannot afford a gap.

Three things to know about maxSurge:

It cannot be used together with hostPort. Two pods on the same node would conflict on the host port. The Kubernetes API rejects the combination.
maxSurge and maxUnavailable cannot both be 0. At least one of them has to be non-zero or no rollout can make progress.
Surging consumes node capacity for the duration of the rollout. On packed nodes the new pod may go Pending until the old one terminates, which negates the benefit.

For the underlying scheduling story (why DaemonSets used to be scheduled by the controller and now go through the default scheduler), the ScheduleDaemonSetPods feature gate went alpha in 1.11, beta-by-default in 1.12, GA in 1.17, and the gate was removed in 1.18. Modern clusters always go through the kube-scheduler, which is why DaemonSet pods now respect priority, preemption, and resource fit the same way other pods do.

Pod Security Standards for privileged DaemonSets

A lot of DaemonSets are privileged in some way: they mount hostPath, set hostNetwork: true, add capabilities, or run as root. That puts them in tension with the Pod Security Standards, which Kubernetes ships as built-in admission policies.

The three levels:

Restricted is the strictest. No host namespaces, no privileged containers, no hostPath. A typical CNI agent or node-exporter cannot run under Restricted at all.
Baseline prevents known privilege escalations but allows enough flexibility that some monitoring agents can run under it.
Privileged is unrestricted. System-level DaemonSets that need host access usually run here.

The right pattern is to put system DaemonSets in their own namespace (kube-system, monitoring, cilium-system) labelled with pod-security.kubernetes.io/enforce: privileged, and keep application namespaces at baseline or restricted. That way the privileged behaviour stays scoped to where it belongs and application teams cannot create privileged pods by accident. For the deeper picture see Kubernetes Pod Security Standards.

Debugging a DaemonSet (why a node has no pod)

The most common DaemonSet bug is "I expected a pod on every node but node worker-3 has none." The diagnostic loop:

# How many pods are desired vs scheduled vs ready
kubectl get daemonset <name> -n <namespace>

# Which nodes have pods
kubectl get pods -n <namespace> -l app=<name> -o wide

# Why is this node missing
kubectl describe daemonset <name> -n <namespace>

The kubectl describe output shows events from the controller, including messages like nodes are available: 1 node(s) didn't match Pod's node affinity/selector or 1 node(s) had untolerated taint. Match the message to one of the four root causes:

The node has a taint the DaemonSet does not tolerate. A custom taint on a GPU pool, a tenant-isolation taint, or a control-plane taint with a DaemonSet that does not tolerate it. Add the toleration.
The node label does not match the nodeSelector or nodeAffinity. Verify with kubectl get nodes --show-labels and compare against the DaemonSet's selector. Either label the node or relax the selector.
The node is at capacity. If the DaemonSet has resource requests and the node has no room, the pod stays Pending. Look for 0/N nodes are available: ... Insufficient cpu/memory. The fix is usually to drop another workload from that node, not to lower the agent's requests below what it needs to do its job.
A hostPort collision. If two DaemonSets both bind to host port 9100, only one wins per node. The other stays Pending with node(s) didn't have free ports.

The pod Pending troubleshooting guide covers the scheduler-side diagnostics in more depth, and the same techniques apply to DaemonSet pods because they go through the default scheduler like everything else.

What a DaemonSet is NOT

This is the section most teams need most. The four misconceptions I see in production:

A DaemonSet is not "for monitoring." It is for any workload whose unit of replication is the node, not the request. Monitoring is the most common case, but storage drivers, CNI agents, security scanners, and per-node admission helpers are equally valid. Conversely, building a "metrics aggregator" as a DaemonSet because monitoring fits there is wrong: aggregators should be Deployments, because they handle traffic, not nodes.

A DaemonSet does not skip the control plane by default. It skips control-plane nodes only because those nodes carry a node-role.kubernetes.io/control-plane:NoSchedule taint that the DaemonSet controller does not auto-tolerate. If your DaemonSet adds the toleration explicitly, it will run there. So "DaemonSets never touch control-plane nodes" is wrong both ways: it is opt-out by convention, not by design.

A DaemonSet is not exempt from resource accounting. The pods consume CPU, memory, and ephemeral storage like any others. The scheduler enforces the same fit logic. A DaemonSet without resource requests still gets QoS class BestEffort and is the first thing evicted under node pressure, which silently breaks the "one per node" promise.

DaemonSet pods do not scale independently. You do not set replicas. You do not run a Horizontal Pod Autoscaler against a DaemonSet (the API rejects it). The pod count is whatever the matching node count is, and the only way to "scale" a DaemonSet is to add or remove matching nodes. If you want per-node count flexibility, you actually want a Deployment with pod anti-affinity or topology spread constraints, not a DaemonSet.

Where to go next

For the full mechanics of node selection and tolerations as they apply to any workload (DaemonSet, Deployment, StatefulSet), see taints, tolerations, and node affinity.
For a complete worked example of a production-grade DaemonSet, the cluster logging tutorial with Fluent Bit walks through RBAC, host-path mounts, control-plane tolerations, and resource limits in context.
For the contrast that makes DaemonSet design clearer, the StatefulSets article covers the other "non-Deployment" workload controller and where its identity guarantees fit.

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy