Kubernetes backup and restore with Velero: application-level disaster recovery

Velero gives Kubernetes operators application-level backup and restore: namespace-scoped protection, scheduled backups to object storage, volume snapshots, and cross-cluster migration. This tutorial walks through the full lifecycle, from installing Velero and taking your first backup to scheduled retention policies, database consistency hooks, restore operations, and production monitoring.

What you will learn
Assumed starting point
How Velero fits into the Kubernetes backup picture
Install Velero
Take your first backup
Schedule backups with retention
Volume backup mechanisms: which one to pick
Database consistency with pre-backup hooks
Restore into the same or a new cluster
Monitor backup health
Common troubleshooting
What you learned

What you will learn

By the end of this tutorial you will be able to install Velero, back up Kubernetes namespaces and persistent volumes to object storage, schedule those backups with tiered retention, add pre-backup hooks so database volumes are application-consistent, restore a namespace into the same cluster or a different one, and set up Prometheus monitoring for backup health.

Assumed starting point

This tutorial assumes:

A running Kubernetes cluster (1.20+) with kubectl access
An S3-compatible object storage bucket (AWS S3, GCS, Azure Blob, or MinIO for on-premises)
Cloud provider credentials with read/write access to the bucket
The Velero CLI installed locally (v1.18+)
Familiarity with PersistentVolumes and PersistentVolumeClaims
Familiarity with StorageClasses and dynamic provisioning

If you run a managed cluster (EKS, GKE, AKS), you already have an object storage service available. For self-managed clusters, MinIO works as an S3-compatible target inside or outside the cluster.

How Velero fits into the Kubernetes backup picture

Velero (formerly Heptio Ark) is an open-source tool maintained by VMware Tanzu that backs up and restores Kubernetes API objects and persistent volume data. It queries the Kubernetes API server for resource definitions and uploads them as tarballs to object storage. It does not touch etcd directly.

That distinction matters. An etcd snapshot captures the entire control-plane state in one binary blob: every Deployment, Secret, RBAC rule, and CRD. It is the right tool for pre-upgrade rollback or full control-plane recovery. But an etcd snapshot cannot restore a single deleted namespace, cannot filter by label, and cannot move workloads from one cluster to another. On managed clusters (EKS, GKE, AKS), you do not have etcd access at all.

Velero fills that gap. Its backup operates at the API level, which means you can back up individual namespaces, filter by resource type or label, exclude stale status fields, and restore into a completely different cluster. The two tools are complementary, not interchangeable.

Two-part architecture. Velero runs a Deployment (the controller) and an optional DaemonSet (the Node Agent, for file-system-level volume backups) inside the cluster. The CLI on your workstation sends commands through the Kubernetes API.

Install Velero

This section shows the CLI install for AWS. The supported providers page lists plugins for GCP, Azure, vSphere, and community-maintained providers.

Step 1: create a bucket and credentials

Create an S3 bucket and an IAM user with a policy that grants s3:GetObject, s3:PutObject, s3:DeleteObject, and s3:ListBucket on that bucket. Store the access key in a credentials file:

cat > /tmp/credentials-velero <<EOF
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
EOF

Replace the example values with your actual IAM credentials.

Step 2: install Velero with CSI support and Node Agent

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --bucket my-velero-backups \
  --secret-file /tmp/credentials-velero \
  --backup-location-config region=eu-west-1 \
  --snapshot-location-config region=eu-west-1 \
  --use-node-agent \
  --features=EnableCSI

--use-node-agent deploys the DaemonSet needed for file system backup and CSI snapshot data movement. --features=EnableCSI enables the CSI snapshot integration.

Expected output:

Velero is installed! ⛵ Use 'velero client config set ...' to configure the CLI.

Step 3: verify the installation

kubectl get deployments -n velero
kubectl get daemonsets -n velero

You should see the velero Deployment with 1/1 available and the node-agent DaemonSet with pods running on every schedulable node.

NAME     READY   UP-TO-DATE   AVAILABLE
velero   1/1     1            1

NAME         DESIRED   CURRENT   READY
node-agent   3         3         3

Checkpoint. If the Velero pod is not Ready, check the logs: kubectl logs deploy/velero -n velero. Common issues: wrong S3 region, invalid credentials, missing bucket.

Alternative: Helm install

For production environments managed through GitOps, Helm is the better path. Configuration lives in a values.yaml committed alongside your other manifests.

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  -f velero-values.yaml

See the Helm chart documentation for the full values.yaml reference.

Take your first backup

Step 4: back up a namespace

velero backup create first-backup \
  --include-namespaces production \
  --wait

--wait blocks until the backup completes. Without it, the command returns immediately and you poll with velero backup describe.

Expected output:

Backup request "first-backup" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c...
Backup completed with status: Completed.

Step 5: inspect the backup

velero backup describe first-backup --details

This shows which resources were captured, which volumes were snapshotted, any warnings or errors, and the hook execution results. Pay attention to the Phase (should be Completed) and the Errors count (should be 0).

Step 6: verify backup contents

velero backup logs first-backup | head -30

The log lists every resource processed. You will see lines like backup/production/Deployment/my-app, confirming that API objects were written to the tarball. Volume snapshots appear as separate lines showing the snapshot mechanism used.

Checkpoint. You now have a working backup of one namespace in your S3 bucket. The next section adds automation.

Schedule backups with retention

One-off backups are a start, but production workloads need automated, recurring backups with defined retention windows.

Step 7: create a daily schedule

velero schedule create daily-production \
  --schedule="CRON_TZ=Europe/Amsterdam 0 2 * * *" \
  --include-namespaces production \
  --ttl 168h

This runs a backup at 02:00 Amsterdam time every day and keeps each backup for 168 hours (7 days). Velero's garbage collector deletes expired backups hourly.

Timezone support in the CRON_TZ= prefix is available since v1.18. Older versions use UTC only.

Step 8: verify the schedule

velero schedule get

NAME                STATUS    CREATED                       SCHEDULE                                     BACKUP TTL
daily-production    Enabled   2026-04-09 14:30:00 +0200     CRON_TZ=Europe/Amsterdam 0 2 * * *           168h0m0s

To trigger a backup from a schedule immediately (useful for testing):

velero backup create --from-schedule daily-production

Tiered retention strategy

For production, consider multiple schedules with different retention windows:

# Hourly backups, kept for 24 hours
velero schedule create hourly-production \
  --schedule="0 * * * *" \
  --include-namespaces production \
  --ttl 24h

# Weekly backups, kept for 28 days
velero schedule create weekly-production \
  --schedule="CRON_TZ=Europe/Amsterdam 0 3 * * 0" \
  --include-namespaces production \
  --ttl 672h

# Monthly backups, kept for 1 year (compliance)
velero schedule create monthly-production \
  --schedule="CRON_TZ=Europe/Amsterdam 0 4 1 * *" \
  --include-namespaces production \
  --ttl 8760h

The default TTL is 30 days when --ttl is not specified.

Volume backup mechanisms: which one to pick

Velero supports three mechanisms for backing up persistent volume data. They are mutually exclusive per volume.

Native cloud provider snapshots. Uses cloud APIs (EBS, GCE Persistent Disk, Azure Managed Disks) to create point-in-time snapshots. Fastest, lowest overhead, but snapshots are region-locked. You cannot use them for cross-region or cross-cloud migration.

CSI snapshot data movement. Creates a CSI VolumeSnapshot, then extracts the data and uploads it to your object storage bucket via Kopia. After upload, the local CSI snapshot is deleted. This is the recommended mechanism when you need durability beyond the storage system or cross-cloud portability.

velero backup create my-backup \
  --include-namespaces production \
  --snapshot-move-data

Monitor upload progress with:

kubectl -n velero get datauploads -l velero.io/backup-name=my-backup

File system backup (FSB). Reads volume data directly from running pods via the Node Agent DaemonSet. Uses Kopia for deduplication, compression, and upload. Use FSB for volumes without snapshot support: NFS, EFS, AzureFile, local volumes, emptyDir. Opt in per pod:

metadata:
  annotations:
    backup.velero.io/backup-volumes: data-volume

Or install with FSB as the default for all volumes:

velero install --use-node-agent --default-volumes-to-fs-backup

Limitation. FSB reads from live PVs, so data is not captured at a single point in time. For databases, this means crash-consistent at best. Application-consistent backup requires hooks.

Kopia replaced Restic

Velero's file system backup originally used Restic as the data mover. That changed over several releases:

Version	Change
v1.10	Kopia integrated alongside Restic
v1.12	Default uploader switched to Kopia
v1.15	Restic deprecated (warnings emitted)
v1.17	`--uploader-type=restic` removed for new backups

Existing Restic-created backups can still be restored in v1.17+. New backups use Kopia exclusively.

Database consistency with pre-backup hooks

Backing up a database volume without quiescing the database first produces a crash-consistent snapshot. That is roughly equivalent to pulling the power cord. The database will likely recover, but it is not guaranteed, and recovery takes time.

Velero pre-backup and post-backup hooks solve this by running commands inside pod containers before and after backup processing.

Step 9: add PostgreSQL backup hooks

Apply these annotations to your PostgreSQL pod (or the pod template in a StatefulSet):

metadata:
  annotations:
    pre.hook.backup.velero.io/command: >-
      ["/bin/bash", "-c",
      "psql -U postgres -c \"SELECT pg_backup_start('velero', true);\""]
    pre.hook.backup.velero.io/container: postgres
    pre.hook.backup.velero.io/timeout: 5m
    pre.hook.backup.velero.io/on-error: Fail
    post.hook.backup.velero.io/command: >-
      ["/bin/bash", "-c",
      "psql -U postgres -c \"SELECT pg_backup_stop();\""]
    post.hook.backup.velero.io/container: postgres
    post.hook.backup.velero.io/timeout: 2m
    post.hook.backup.velero.io/on-error: Continue
    backup.velero.io/backup-volumes: pgdata

on-error: Fail on the pre-hook means the backup aborts if the database cannot enter backup mode. on-error: Continue on the post-hook means the backup data is preserved even if the resume command fails.

PostgreSQL version note. pg_backup_start() / pg_backup_stop() is the current API since PostgreSQL 15. Older versions (14 and below) use pg_start_backup() / pg_stop_backup().

MySQL/MariaDB hooks

metadata:
  annotations:
    pre.hook.backup.velero.io/command: >-
      ["/bin/bash", "-c",
      "mysql -u root -p$MYSQL_ROOT_PASSWORD -e 'FLUSH TABLES WITH READ LOCK;'"]
    pre.hook.backup.velero.io/container: mysql
    pre.hook.backup.velero.io/timeout: 3m
    pre.hook.backup.velero.io/on-error: Fail
    post.hook.backup.velero.io/command: >-
      ["/bin/bash", "-c",
      "mysql -u root -p$MYSQL_ROOT_PASSWORD -e 'UNLOCK TABLES;'"]
    post.hook.backup.velero.io/container: mysql
    post.hook.backup.velero.io/timeout: 1m
    post.hook.backup.velero.io/on-error: Continue

FLUSH TABLES WITH READ LOCK blocks all writes for the duration of the backup. For high-traffic databases, back up a read replica instead of locking the primary.

Alternative: centralized hooks in the Backup spec

Instead of annotating every pod, you can define hooks centrally in the Backup or Schedule resource:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: production-with-hooks
  namespace: velero
spec:
  includedNamespaces:
    - production
  hooks:
    resources:
    - name: postgres-consistency
      labelSelector:
        matchLabels:
          app: postgres
      pre:
      - exec:
          container: postgres
          command:
            - /bin/bash
            - -c
            - "psql -U postgres -c \"SELECT pg_backup_start('velero', true);\""
          onError: Fail
          timeout: 5m
      post:
      - exec:
          container: postgres
          command:
            - /bin/bash
            - -c
            - "psql -U postgres -c \"SELECT pg_backup_stop();\""
          onError: Continue
          timeout: 2m

This approach is better for GitOps workflows where pod annotations may be managed by a separate team.

Step 10: verify hook execution

After a backup completes, check that hooks ran successfully:

velero backup describe first-backup --details

Look for the Hooks section. It lists each hook, the pod it ran on, and whether it succeeded or failed.

Checkpoint. You now have scheduled, application-consistent backups of database workloads.

Restore into the same or a new cluster

Step 11: restore a deleted namespace

velero restore create --from-backup daily-production-20260409020000 \
  --include-namespaces production

Velero restores resources in dependency order: CRDs first, then Namespaces, StorageClasses, PersistentVolumes, PersistentVolumeClaims, Secrets, ConfigMaps, and finally workloads.

By default, Velero skips resources that already exist (non-destructive). To overwrite existing resources:

velero restore create --from-backup daily-production-20260409020000 \
  --existing-resource-policy update

Step 12: restore into a different namespace

velero restore create --from-backup daily-production-20260409020000 \
  --namespace-mappings production:staging

This maps every resource from the production namespace into staging. Useful for cloning production data into a test environment.

Step 13: cross-cluster restore (disaster recovery)

For a full cluster replacement, point a new cluster at the same backup storage location. Install Velero with the same bucket and credentials, then:

# Set storage to read-only to prevent writes during recovery
kubectl patch backupstoragelocation default \
  --namespace velero \
  --type merge \
  --patch '{"spec":{"accessMode":"ReadOnly"}}'

# Restore
velero restore create --from-backup daily-production-20260409020000

# Verify
velero restore describe <restore-name> --details
kubectl get all -n production

# Return to read-write
kubectl patch backupstoragelocation default \
  --namespace velero \
  --type merge \
  --patch '{"spec":{"accessMode":"ReadWrite"}}'

Setting the storage location to ReadOnly during recovery prevents the new cluster's Velero instance from writing (and potentially corrupting) the backup repository while the restore is in progress.

Cross-cloud note. Native cloud snapshots are region-locked and cannot be restored across providers. For cross-cloud migration, the backups must use CSI snapshot data movement (--snapshot-move-data) or file system backup. Both store volume data in the object storage bucket, which is provider-independent.

Post-restore hooks

Velero supports two types of restore hooks. InitContainer hooks inject an init container into restored pods (runs before the application starts). Exec hooks run commands in containers after the pod reaches Ready.

Example: import a database dump after restore:

metadata:
  annotations:
    post.hook.restore.velero.io/container: postgres
    post.hook.restore.velero.io/command: '["/bin/bash", "-c", "psql -U postgres < /backup/backup.sql"]'
    post.hook.restore.velero.io/exec-timeout: 120s
    post.hook.restore.velero.io/wait-for-ready: "true"
    post.hook.restore.velero.io/on-error: Fail

wait-for-ready: true makes Velero wait until the pod is Ready before executing the command. For databases that need initialization time, this is important.

Checkpoint. You can now restore namespaces into the same cluster, a different namespace, or a different cluster entirely.

Monitor backup health

A backup you never test is not a backup. Monitoring catches silent failures before they matter.

Step 14: Prometheus metrics

Velero exposes metrics on port 8085. If you run the Prometheus Operator, add a scrape config:

- job_name: velero
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
    action: keep
    regex: velero
  - source_labels: [__meta_kubernetes_pod_container_port_number]
    action: keep
    regex: "8085"

Key metrics to alert on:

velero_backup_failure_total increasing: a backup schedule is failing
velero_backup_duration_seconds spiking: storage or network problems
No new velero_backup_success_total increment in 25 hours: the schedule stopped running

Step 15: Grafana dashboards

The community maintains several pre-built dashboards:

Import one of these into your Grafana instance to get a visual overview of backup health, duration trends, and storage usage.

Regular restore drills

Schedule a quarterly restore drill. Restore a recent backup into a disposable namespace, verify the workloads come up, confirm the database contains expected data, then tear it down. A backup that has never been tested is a liability.

Common troubleshooting

PartiallyFailed backup, repository not initialized. Check velero repo get and verify the Node Agent DaemonSet is running on all nodes. Verify S3 credentials and bucket access. Missing node-agent pods are the most common cause.

Velero pod OOMKilled. Large backups with many resources can exhaust memory. Increase --velero-pod-mem-limit during install. Velero v1.18 moved repository operations outside the server process to reduce OOM risk.

Stuck InProgress backup. Velero cannot resume interrupted backups. Delete the stuck backup (kubectl delete backup <name> -n velero) and trigger a new one.

LoadBalancer DNS changes after restore. Cloud load balancers get new UIDs after restore, which means new DNS names. Update CNAME records manually, or set spec.loadBalancerIP on the Service where the provider supports it.

Admission webhooks blocking restore. ValidatingWebhookConfigurations and MutatingWebhookConfigurations can reject or mutate resources during restore. Temporarily disable webhooks if restores fail with admission errors.

For deeper debugging:

velero backup describe <name> --details    # hook results and errors
velero backup logs <name>                  # per-resource processing log
kubectl -n velero get datauploads          # CSI data movement progress

Full troubleshooting reference on velero.io.

What you learned

This tutorial covered the full Velero lifecycle for application-level Kubernetes backup and disaster recovery:

Velero backs up API objects and volume data. It is complementary to etcd snapshots, not a replacement.
Three volume backup mechanisms exist: native cloud snapshots (fastest, region-locked), CSI snapshot data movement (durable, cross-cloud), and file system backup via Kopia (for unsupported volume types).
Database consistency requires explicit pre-backup hooks. Without them, volume backups are crash-consistent at best.
Scheduled backups with tiered TTL provide hourly, daily, weekly, and monthly retention.
Restores work into the same namespace, a different namespace, or a different cluster.
Prometheus metrics and regular restore drills make the difference between a backup strategy and a disaster waiting to happen.

Velero does not handle everything. It does not back up the control plane itself (that requires etcd snapshots), it has no multi-tenancy support (only cluster administrators can manage it), and cross-cloud migration requires CSI data movement or file system backup rather than native snapshots. For commercial alternatives with built-in multi-tenancy and application awareness, Veeam Kasten K10 and TrilioVault are the main options.

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy