Kubernetes StatefulSets: when pod identity and persistent storage matter

A Deployment treats every pod as interchangeable. A StatefulSet does the opposite: it assigns each pod a stable name, a stable hostname, and its own persistent volume. That distinction is what makes it possible to run databases, message brokers, and consensus-based systems on Kubernetes. This article explains the guarantees a StatefulSet provides, when you need them, and when you do not.

What a StatefulSet is

A StatefulSet is a workload controller (apiVersion: apps/v1, kind: StatefulSet) that manages pods with a sticky identity. Each pod gets an ordinal integer (0, 1, 2, ...), a stable hostname derived from that ordinal, and optionally its own PersistentVolumeClaim. These identities survive restarts, rescheduling, and node migrations.

The four guarantees:

  • Stable network identity. Pod postgres-0 is always postgres-0, regardless of which node it runs on.
  • Stable persistent storage. Each pod gets its own PVC, reattached automatically after rescheduling.
  • Ordered deployment and scaling. Pods are created sequentially (0, then 1, then 2) and deleted in reverse order.
  • Ordered rolling updates. Updates proceed from the highest ordinal down, so replicas update before the primary.

Any workload where the identity of individual instances matters (not just the count) is a candidate for a StatefulSet.

StatefulSet vs Deployment

Feature StatefulSet Deployment
Pod naming Ordinal: kafka-0, kafka-1 Random hash: nginx-7df9f9cdd8-xkj2b
Storage Per-pod PVC via volumeClaimTemplates Shared PVC or none
DNS Per-pod FQDN via headless Service No per-pod DNS
Scaling order Sequential; reverse-order deletion Parallel, unordered
Pod interchangeability Non-interchangeable; each may have a distinct role Fully interchangeable

Use a Deployment when pods are identical, carry no individual state, and any replica can handle any request. REST API servers, frontend containers, queue consumers.

Use a StatefulSet when pods need a stable identity for peer communication, leader election, or replication. PostgreSQL primary/replica clusters, Kafka brokers, ZooKeeper ensembles, etcd members, Elasticsearch nodes, Redis Cluster.

Running a StatefulSet for a stateless workload adds overhead (sequential scaling, required headless Service) with no benefit. If your pods are interchangeable, a Deployment is simpler and scales faster.

Stable pod identity

For a StatefulSet named web with replicas: 3, Kubernetes creates web-0, web-1, and web-2. Each pod's hostname matches its name. If web-1 is rescheduled to a different node, it is still web-1.

This matters because distributed systems depend on known member addresses. Kafka uses the broker ID for partition assignment and replication; a broker cannot be anonymous. etcd members bootstrap by listing all peer addresses. PostgreSQL streaming replication connects to a known primary hostname. In each case, a changing hostname breaks the application.

As of Kubernetes v1.31, you can configure a custom starting ordinal via .spec.ordinals.start, shifting the range from e.g. 5, 6, 7 instead of 0, 1, 2.

The headless Service requirement

A standard Service with a ClusterIP load-balances across all backing pods behind a single VIP. That is wrong for a database replica set: you cannot send writes to "any replica." You need to address the primary by name.

A headless Service (clusterIP: None) skips the VIP and creates individual DNS records per pod instead. For a StatefulSet named postgres with a headless Service named postgres in namespace default, CoreDNS resolves:

  • postgres-0.postgres.default.svc.cluster.local
  • postgres-1.postgres.default.svc.cluster.local
  • postgres-2.postgres.default.svc.cluster.local

The pattern is $(pod-name).$(service-name).$(namespace).svc.cluster.local, as documented in DNS for Services and Pods.

One common mistake: the StatefulSet controller does not create the headless Service for you. It must exist before the StatefulSet, and the StatefulSet's .spec.serviceName must reference it. If the Service is missing, pods run but get no stable DNS entries, and peer-to-peer communication fails silently.

volumeClaimTemplates: per-pod storage

In a Deployment, all pod replicas sharing a PVC means all pods read and write the same data. Two PostgreSQL instances writing to the same data directory would corrupt it immediately.

A StatefulSet's volumeClaimTemplates provision an individual PersistentVolumeClaim per pod. For a StatefulSet named postgres with a template named data and 3 replicas, Kubernetes creates PVCs data-postgres-0, data-postgres-1, and data-postgres-2. Each pod mounts only its own.

When a pod is deleted and recreated (node failure, rolling update), the controller recreates it with the same ordinal and reattaches the same PVC. Data survives.

Scaling down does not delete PVCs by default. This is intentional: data safety over automatic cleanup. Scaling back up reconnects the existing PVC. As of Kubernetes v1.27 (beta), graduated to stable in v1.32, a persistentVolumeClaimRetentionPolicy field controls this behavior:

spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete   # delete PVCs when StatefulSet is deleted
    whenScaled: Retain    # keep PVCs when scaling down (default)

For access modes, the official docs recommend ReadWriteOncePod over ReadWriteOnce. The older mode allows a volume to be mounted by multiple pods on the same node, which for databases is a data-corruption risk.

Ordered deployment and scaling

With the default podManagementPolicy: OrderedReady, the controller creates pods sequentially: postgres-1 only starts after postgres-0 is Running and Ready. Scale-down deletes in reverse: postgres-2 first, then postgres-1, then postgres-0.

This protects initialization-dependent systems. A ZooKeeper ensemble member needs its leader available before joining. A database replica needs the primary running before attempting replication.

The trade-off: if a pod enters a crash loop or fails its readiness probe, the controller stalls entirely. No subsequent pods are created, deleted, or updated until the stuck pod is fixed or manually deleted. This is a known production risk and the most common operational surprise with StatefulSets.

For applications that handle their own initialization (via init containers or leader election), set podManagementPolicy: Parallel to create and delete all pods simultaneously. Identity and storage guarantees still apply; only the ordering is relaxed.

Update strategies

RollingUpdate (default): when you change .spec.template, the controller updates pods in reverse ordinal order. postgres-2 updates first; postgres-0 (typically the primary) updates last, minimizing downtime in leader-based systems.

The partition field enables staged rollouts. With partition: 2 and 3 replicas, only postgres-2 gets the new template. Pods 0 and 1 stay on the old version. Lower the partition value to expand the blast radius: a canary deployment built into the controller. maxUnavailable (available since v1.24) controls how many pods can be down simultaneously during the update.

OnDelete: the controller does not touch pods automatically. Updates only happen when you manually delete a pod. The replacement picks up the new template. This is common in operator-managed workloads where the operator orchestrates its own update sequence.

Running databases in Kubernetes: trade-offs

This is the question behind most StatefulSet interest. The short answer: it is viable, but requires more operational maturity than running stateless workloads.

In favor: unified management plane (same RBAC, monitoring, GitOps pipelines), cost savings compared to cloud DBaaS, and multi-cloud portability. The ecosystem has matured: CloudNativePG is a CNCF project, Strimzi manages production Kafka at scale, and CrunchyData PGO has been production-proven since 2017.

Against: Kubernetes was designed around ephemeral workloads, and pods can be evicted at any time. Network-attached storage adds latency compared to local SSDs. Troubleshooting requires dual expertise in both Kubernetes internals and database internals. Volume snapshots are not crash-consistent for databases, so database-native backup tools remain necessary.

For production, raw StatefulSets are usually not enough. They provide the primitives (identity, storage, ordering) but not the domain-specific logic (automated failover, backup scheduling, replica promotion). That is what a Kubernetes operator adds. For PostgreSQL, CloudNativePG is the most widely adopted operator as of 2025. Some operators, like CloudNativePG, deliberately avoid using raw StatefulSets and implement custom pod management to overcome limitations like inflexible volume resizing.

A practical split: use raw StatefulSets (via Helm charts like Bitnami) for development and staging. Use a mature operator for production. Evaluate managed cloud databases if your team lacks deep expertise in both Kubernetes and the specific database.

What StatefulSets are NOT

  • Not a database operator. A StatefulSet gives you stable identity and storage. It does not understand replication lag, quorum, backup schedules, or failover. That logic lives in the operator layer above it.
  • Not required for all persistent storage. A Deployment with a ReadWriteMany PVC works fine when all pods share the same data (a CMS media directory, a shared configuration file). StatefulSets are for workloads that need per-pod storage isolation.
  • Not a DaemonSet. A DaemonSet places exactly one pod per node. A StatefulSet places a fixed number of pods across the cluster. Confused because both produce "one per thing"? The "thing" is different: DaemonSet is per-node, StatefulSet is per-ordinal.
  • Not a substitute for application-level HA. Kubernetes can restart a failed pod. It cannot promote a database replica to primary, re-balance Kafka partitions, or rejoin an etcd member to its cluster. Application-level health management is separate.

Where to go next

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.