Why most Kubernetes backup strategies don't survive real disaster recovery

Most production Kubernetes clusters have a backup strategy that has never been exercised against the disaster it is meant to handle. What separates the teams who survive a DR event from the teams who don't is operational discipline, not tooling choice.

Most production Kubernetes clusters in 2026 have a backup strategy. Velero schedules run, etcd snapshots land in object storage, CSI volume snapshots are taken on schedule. Whether any of it works when the disaster actually arrives is a separate question, and one that most teams cannot answer.

The pattern repeats across organisations: backups are scheduled, retention policies are set, dashboards report green. Then etcd corrupts during an upgrade, a namespace gets deleted against the wrong kube context, or ransomware hits a stateful workload, and the restore produces partial failures, PVCs with mangled names, finalizers that refuse to release, or a working cluster missing the data that mattered.

This piece is not a tool review. The tooling is, by and large, fine. The gap is operational discipline: most teams cannot tell you when they last successfully restored from a backup, what classes of failure their current setup actually covers, or how the strategy holds up when GitOps is reconciling on top of the restore. That gap is what turns a routine outage into a multi-day incident.

TL;DR

  • Backups that are never restored from are theatre. The Veeam 2026 Data Trust and Resilience Report found only 28% of ransomware victims fully recovered their data, despite 90% being confident they could.
  • Velero in 2026 has known restore failure modes around StatefulSets, CRDs with finalizers, operator-managed workloads, and GitOps reconciliation. Most are documented, none are surprises.
  • etcd encryption-at-rest key rotation makes older snapshots unrestorable. Teams discover this during the restore, not before.
  • CSI volume snapshots are crash-consistent, region-local, and tied to the volume's lifecycle. They are a useful primitive, not a backup.
  • The right question is not "do we have backups," it is "when did we last successfully restore from one and prove the workload came back consistent."

Table of contents

The three disasters Kubernetes backup needs to handle

Backup planning collapses if you treat "disaster" as one thing. In practice, Kubernetes clusters face three quite different failure classes, and the tooling for each is different.

Cluster-wide corruption. etcd corrupts during an upgrade. A control plane node fails in a way that takes the cluster state with it. A cloud provider has a regional outage. The recovery primitive is an etcd snapshot or a parallel cluster.

Namespace or workload-scoped data loss. A PVC gets deleted, a database is wiped by a botched migration, an application corrupts its own data. Recovery is at the object and volume level: Velero restores the namespace, the PVCs get rebound, the application restarts.

Accidental destructive operations. An operator runs kubectl delete -f against the wrong context. A pipeline applies a manifest that removes resources it should not. A helm uninstall rolls back further than expected. The disaster looks identical to data loss, but the failure mode is human, and the response window is minutes, not hours.

Each class needs a different recovery model. Teams that buy one tool and assume it covers all three end up surprised by the disaster the tool was never designed to handle.

Velero in 2026: what actually works

Velero is the dominant open-source Kubernetes backup tool and the basis for several commercial offerings, and its governance position changed in March 2026: Broadcom donated Velero to the CNCF as a Sandbox project, making it formally vendor-neutral for the first time since the Heptio days. The latest stable is v1.18.0 (March 2, 2026), with concurrent backup processing, data mover cache volumes, and an end to the long-deprecated restic path, which is fully removed in v1.19.

For straightforward scenarios, Velero does what is on the tin. Backing up a stateless namespace, capturing the Kubernetes objects with their ConfigMaps and Secrets, snapshotting attached PVCs through the CSI driver, and storing the result in object storage all work as advertised. Restoring a deleted namespace into the same cluster, or migrating a non-stateful workload across clusters with the AWS or Azure plugin, are well-trodden paths.

The trouble starts when the workload has any of the properties that production workloads actually have: persistent state owned by an operator, StatefulSet ordering requirements, CRDs with finalizers, or GitOps reconciliation. Almost every interesting workload has at least one of these.

Where Velero restoration breaks

Velero's GitHub tracker is a useful read for anyone planning a DR strategy. A few of the recurring patterns:

PVC name mangling on restore. Issue #9401 (November 2025): after restore, PVCs are created with auto-generated names like velero-viya-restore-fs-j6vcn-pvc-dc8f85ba-… instead of their original names. Workloads that bind PVCs by name (Deployments referencing a named PVC, Helm-managed releases, anything with a stable claim) break on restore. The data is there. The binding is not.

DataMover stuck waiting on plugin operations. Issue #6813: restores complete only a fraction of the operations for StatefulSet workloads, leaving the restore in WaitingForPluginOperations indefinitely. The dashboard reports progress; the workload never comes up.

CRDs with finalizers hanging. Issue #7207: a connection blip during restore leaves the Restore CR stuck "In Progress" forever because Velero failed to retry the update. The restore did not fail. It never finished.

Operator-managed databases. CloudNativePG with Velero supports snapshot recovery only, with no point-in-time recovery available. If you are running WordPress on Kubernetes or any other database-backed application via an operator, the operator's idea of a backup is not the same as Velero's. After a Velero restore, the operator reconciles its state and may overwrite the freshly restored objects because its control loop does not know a restore happened.

GitOps drift after restore. This is the one most teams discover the hard way. Velero finishes restoring the namespace. Flux or Argo CD is still running. The GitOps controller sees the cluster state diverging from Git and reconciles, overwriting the Velero restore within seconds. The official guidance is to suspend reconciliation explicitly: flux suspend kustomization before the restore, or set argocd.argoproj.io/skip-reconcile: "true" on the affected Applications. Teams that forget this step watch the restored state evaporate in real time.

None of these are bugs in the strict sense. They are predictable interactions between Velero and the workload patterns most teams run. They show up in drills. They do not show up on the backup dashboard.

The etcd snapshot lifecycle most teams get wrong

etcd snapshots are a control-plane disaster recovery primitive. The mechanics are simple: etcdctl snapshot save writes the state, you ship it to object storage, you can restore against a fresh etcd member. The Kubernetes project documents the procedure clearly.

The lifecycle most teams get wrong is encryption-at-rest key rotation. If your cluster encrypts secrets at rest with a KMS provider, the secrets in the etcd snapshot are encrypted with whatever key was primary at snapshot time. Rotate the KMS key, then try later to restore that snapshot against a cluster using the new key configuration, and the API server cannot decrypt the secrets. From the Rancher RKE documentation:

The snapshot is taken before the keys are rotated and restore is attempted after. In this case, the old keys used for encryption at the time of the snapshot no longer exist in the cluster state file.

Kubernetes 1.29 made KMS v2 generally available, which improves the encryption pipeline but does not change the underlying truth: if you rotate the KEK and lose the prior key material, snapshots taken before the rotation become partial. The cluster boots, but secrets are unreadable, and the workloads that depend on them fail in subtle ways. Kubernetes's own guidance is blunt: if a resource cannot be decrypted because keys were changed, "your only recourse is to delete that entry from the underlying etcd directly."

The fix is not complicated. Take a fresh snapshot immediately after any change to the encryption configuration, never delete the prior key material before re-encrypting all secrets, and document which key generation each snapshot corresponds to. Teams that skip the documentation step are the teams that discover the problem during the restore.

CSI snapshots are not backups

VolumeSnapshot has been GA since Kubernetes 1.20 and is the default primitive for volume-level recovery. It is also routinely treated as a backup, which it is not.

Four properties make CSI snapshots unfit as a standalone backup strategy.

Crash-consistent, not application-consistent. A snapshot taken while the database is running captures whatever was written to disk plus whatever was in the kernel page cache. The engine recovers on restart, sometimes successfully. For Postgres, MySQL, or any other database with write-ahead logging, a snapshot taken without pg_start_backup() or equivalent quiescing is a coin flip.

Same failure domain as the source. AWS EBS snapshots are stored within the same Availability Zone as the source volume unless explicitly copied across regions. The same is true for GCP Persistent Disk snapshots and Azure Disk snapshots. A regional outage takes out your source volume and your snapshots together. Portworx puts it bluntly in its own documentation: "Since snapshot data is stored in the same place as the original data, snapshots are no substitute for a backup."

Tied to the volume's lifecycle. In some configurations, deleting the source volume removes the snapshots. Misconfigured retention or an accidental delete can destroy both copies at once.

No Kubernetes object state. A CSI snapshot captures volume bytes. It does not capture Deployments, Services, ConfigMaps, Secrets, RBAC, Ingress configuration, or anything else that turns a PV into a running workload. Restoring "the data" without the surrounding objects gets you a blank cluster with a populated disk.

Snapshots are a fast, cheap primitive. Treat them as the first step in a backup, not the whole thing.

The wrong-context disaster

The most common disaster I see referenced in DevOps incident retrospectives is not infrastructure failure. It is kubectl delete -f or helm uninstall run against a context the operator believed was a sandbox. The entire ecosystem of context-isolation tools (kubie, kubert, the various kubectx hardening forks) exists because traditional kubectx makes the context change global. Run kubectx prod in one terminal, switch to another window an hour later, type kubectl delete -f manifest.yaml, and you have produced a production outage from a workflow the operator considered safe.

The tooling fix is straightforward: tools like kubie spawn a subshell per context so the change is scoped to that shell. The procedural fix matters more. Prompt indicators that show the current context, mandatory confirmation for destructive operations against production contexts, and policy-as-code admission rules that reject cluster-wide deletes from non-platform service accounts all reduce the blast radius.

When the wrong-context disaster does land, the recovery is usually a Velero restore against the affected namespace, plus a re-reconciliation of GitOps state. Which is exactly the path that the previous section said tends to fail in interesting ways, which is why this anti-pattern matters disproportionately for DR planning: the disaster you are most likely to cause yourself is the one your tooling is least good at recovering from cleanly.

The drill cadence that actually catches problems

Industry data on actually-tested backups is unflattering. The Veeam 2026 report found 90% of organizations were confident they could recover from a cyber incident, while only 28% of ransomware victims actually recovered fully. The Unitrends 2025 State of Backup and Recovery Report put the same gap in restoration terms: 60% of respondents thought they could recover in under a day; only 35% actually could.

The cadence that catches the kind of problems described above is layered, and none of the layers are optional.

Weekly automated canary restore. Pick a representative workload (a stateful database, a stateless API, an operator-managed application). On a schedule, restore it into a parallel namespace and run a validation script that checks data integrity and reachability. This catches Velero version regressions, plugin issues, and silent backup corruption before they compound.

Monthly namespace-scoped drill. Pick one namespace at random. Restore it against a parallel cluster. Time the restore. Measure how long it took to identify and resolve any problems. This catches the operator and GitOps interactions that the automated single-workload restore misses.

Quarterly full-cluster drill. etcd snapshot restore into a parallel cluster, including a deliberate encryption-configuration reset. This is the only drill that catches the KMS rotation problem before it bites in production.

Teams that run all three usually find at least one broken assumption per quarter. Teams that run none of them find their broken assumptions during the actual incident, when the cost is in customer-visible downtime instead of an afternoon of debugging.

When a standby cluster is cheaper than perfect backups

For some workloads, the operationally cheapest answer is to skip the sophisticated backup-and-restore tooling and run a warm or active-active standby instead. The AWS Well-Architected Reliability Pillar lays out the trade-offs cleanly:

Strategy RPO RTO Continuous cost
Backup and restore Hours Up to 24 hours Lowest
Pilot light Minutes Tens of minutes Low
Warm standby Seconds Minutes Medium
Multi-site active-active Near zero Near zero Highest

The cost framing is misleading on its own. A warm standby requires you to pay for compute and storage continuously. A sophisticated backup-and-restore strategy requires you to pay for the engineering effort to make it work, the drills to prove it works, and the recovery time when an incident hits. For mission-critical stateful workloads, the second column is often larger than the first, especially once the drills surface the failures.

Multi-cluster tooling in 2026 makes the standby model more accessible than it was. Karmada (CNCF) supports active-active and remote DR policies natively, and AWS publishes guidance for Karmada with EKS covering automatic failover between clusters. For multi-tenant clusters where Velero's per-tenant restore model is awkward, the standby pattern can be operationally simpler.

The decision is rarely "one or the other." A reasonable production setup uses Velero for namespace and workload-scoped recovery, etcd snapshots for control-plane corruption, and a warm or active-active standby for the workloads where minutes of downtime are not acceptable. The honest move is to be precise about which workload sits in which category, rather than asserting that one tool covers everything.

Key takeaways

  • Schedule-green is not restore-green. A backup that has not been restored from in the last quarter is a hypothesis, not a recovery plan.
  • Velero v1.18 is solid for what it does, but PVC name mangling (#9401), CRD finalizer hangs (#7207), and the operator-restore conflict are predictable failure modes. Suspend GitOps reconciliation before any restore.
  • etcd snapshots become unrestorable if you rotate encryption-at-rest keys without re-encrypting secrets first. Take a fresh snapshot after every encryption change.
  • CSI snapshots are crash-consistent, region-local, and lifecycle-tied. They are a primitive, not a backup.
  • The drill cadence that catches problems is weekly canary, monthly namespace, quarterly full cluster. Teams that skip the cadence find their problems during the incident.
  • For workloads where minutes of downtime are unacceptable, a warm or active-active standby is often operationally cheaper than a perfect restore story.

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.