Table of contents
- What you will learn
- Assumed starting point
- How Velero fits into the Kubernetes backup picture
- Install Velero
- Take your first backup
- Schedule backups with retention
- Volume backup mechanisms: which one to pick
- Database consistency with pre-backup hooks
- Restore into the same or a new cluster
- Monitor backup health
- Common troubleshooting
- What you learned
What you will learn
By the end of this tutorial you will be able to install Velero, back up Kubernetes namespaces and persistent volumes to object storage, schedule those backups with tiered retention, add pre-backup hooks so database volumes are application-consistent, restore a namespace into the same cluster or a different one, and set up Prometheus monitoring for backup health.
Assumed starting point
This tutorial assumes:
- A running Kubernetes cluster (1.20+) with
kubectlaccess - An S3-compatible object storage bucket (AWS S3, GCS, Azure Blob, or MinIO for on-premises)
- Cloud provider credentials with read/write access to the bucket
- The Velero CLI installed locally (v1.18+)
- Familiarity with PersistentVolumes and PersistentVolumeClaims
- Familiarity with StorageClasses and dynamic provisioning
If you run a managed cluster (EKS, GKE, AKS), you already have an object storage service available. For self-managed clusters, MinIO works as an S3-compatible target inside or outside the cluster.
How Velero fits into the Kubernetes backup picture
Velero (formerly Heptio Ark) is an open-source tool maintained by VMware Tanzu that backs up and restores Kubernetes API objects and persistent volume data. It queries the Kubernetes API server for resource definitions and uploads them as tarballs to object storage. It does not touch etcd directly.
That distinction matters. An etcd snapshot captures the entire control-plane state in one binary blob: every Deployment, Secret, RBAC rule, and CRD. It is the right tool for pre-upgrade rollback or full control-plane recovery. But an etcd snapshot cannot restore a single deleted namespace, cannot filter by label, and cannot move workloads from one cluster to another. On managed clusters (EKS, GKE, AKS), you do not have etcd access at all.
Velero fills that gap. Its backup operates at the API level, which means you can back up individual namespaces, filter by resource type or label, exclude stale status fields, and restore into a completely different cluster. The two tools are complementary, not interchangeable.
Two-part architecture. Velero runs a Deployment (the controller) and an optional DaemonSet (the Node Agent, for file-system-level volume backups) inside the cluster. The CLI on your workstation sends commands through the Kubernetes API.
Install Velero
This section shows the CLI install for AWS. The supported providers page lists plugins for GCP, Azure, vSphere, and community-maintained providers.
Step 1: create a bucket and credentials
Create an S3 bucket and an IAM user with a policy that grants s3:GetObject, s3:PutObject, s3:DeleteObject, and s3:ListBucket on that bucket. Store the access key in a credentials file:
cat > /tmp/credentials-velero <<EOF
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
EOF
Replace the example values with your actual IAM credentials.
Step 2: install Velero with CSI support and Node Agent
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.10.0 \
--bucket my-velero-backups \
--secret-file /tmp/credentials-velero \
--backup-location-config region=eu-west-1 \
--snapshot-location-config region=eu-west-1 \
--use-node-agent \
--features=EnableCSI
--use-node-agent deploys the DaemonSet needed for file system backup and CSI snapshot data movement. --features=EnableCSI enables the CSI snapshot integration.
Expected output:
Velero is installed! ⛵ Use 'velero client config set ...' to configure the CLI.
Step 3: verify the installation
kubectl get deployments -n velero
kubectl get daemonsets -n velero
You should see the velero Deployment with 1/1 available and the node-agent DaemonSet with pods running on every schedulable node.
NAME READY UP-TO-DATE AVAILABLE
velero 1/1 1 1
NAME DESIRED CURRENT READY
node-agent 3 3 3
Checkpoint. If the Velero pod is not Ready, check the logs: kubectl logs deploy/velero -n velero. Common issues: wrong S3 region, invalid credentials, missing bucket.
Alternative: Helm install
For production environments managed through GitOps, Helm is the better path. Configuration lives in a values.yaml committed alongside your other manifests.
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
-f velero-values.yaml
See the Helm chart documentation for the full values.yaml reference.
Take your first backup
Step 4: back up a namespace
velero backup create first-backup \
--include-namespaces production \
--wait
--wait blocks until the backup completes. Without it, the command returns immediately and you poll with velero backup describe.
Expected output:
Backup request "first-backup" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c...
Backup completed with status: Completed.
Step 5: inspect the backup
velero backup describe first-backup --details
This shows which resources were captured, which volumes were snapshotted, any warnings or errors, and the hook execution results. Pay attention to the Phase (should be Completed) and the Errors count (should be 0).
Step 6: verify backup contents
velero backup logs first-backup | head -30
The log lists every resource processed. You will see lines like backup/production/Deployment/my-app, confirming that API objects were written to the tarball. Volume snapshots appear as separate lines showing the snapshot mechanism used.
Checkpoint. You now have a working backup of one namespace in your S3 bucket. The next section adds automation.
Schedule backups with retention
One-off backups are a start, but production workloads need automated, recurring backups with defined retention windows.
Step 7: create a daily schedule
velero schedule create daily-production \
--schedule="CRON_TZ=Europe/Amsterdam 0 2 * * *" \
--include-namespaces production \
--ttl 168h
This runs a backup at 02:00 Amsterdam time every day and keeps each backup for 168 hours (7 days). Velero's garbage collector deletes expired backups hourly.
Timezone support in the CRON_TZ= prefix is available since v1.18. Older versions use UTC only.
Step 8: verify the schedule
velero schedule get
NAME STATUS CREATED SCHEDULE BACKUP TTL
daily-production Enabled 2026-04-09 14:30:00 +0200 CRON_TZ=Europe/Amsterdam 0 2 * * * 168h0m0s
To trigger a backup from a schedule immediately (useful for testing):
velero backup create --from-schedule daily-production
Tiered retention strategy
For production, consider multiple schedules with different retention windows:
# Hourly backups, kept for 24 hours
velero schedule create hourly-production \
--schedule="0 * * * *" \
--include-namespaces production \
--ttl 24h
# Weekly backups, kept for 28 days
velero schedule create weekly-production \
--schedule="CRON_TZ=Europe/Amsterdam 0 3 * * 0" \
--include-namespaces production \
--ttl 672h
# Monthly backups, kept for 1 year (compliance)
velero schedule create monthly-production \
--schedule="CRON_TZ=Europe/Amsterdam 0 4 1 * *" \
--include-namespaces production \
--ttl 8760h
The default TTL is 30 days when --ttl is not specified.
Volume backup mechanisms: which one to pick
Velero supports three mechanisms for backing up persistent volume data. They are mutually exclusive per volume.
Native cloud provider snapshots. Uses cloud APIs (EBS, GCE Persistent Disk, Azure Managed Disks) to create point-in-time snapshots. Fastest, lowest overhead, but snapshots are region-locked. You cannot use them for cross-region or cross-cloud migration.
CSI snapshot data movement. Creates a CSI VolumeSnapshot, then extracts the data and uploads it to your object storage bucket via Kopia. After upload, the local CSI snapshot is deleted. This is the recommended mechanism when you need durability beyond the storage system or cross-cloud portability.
velero backup create my-backup \
--include-namespaces production \
--snapshot-move-data
Monitor upload progress with:
kubectl -n velero get datauploads -l velero.io/backup-name=my-backup
File system backup (FSB). Reads volume data directly from running pods via the Node Agent DaemonSet. Uses Kopia for deduplication, compression, and upload. Use FSB for volumes without snapshot support: NFS, EFS, AzureFile, local volumes, emptyDir. Opt in per pod:
metadata:
annotations:
backup.velero.io/backup-volumes: data-volume
Or install with FSB as the default for all volumes:
velero install --use-node-agent --default-volumes-to-fs-backup
Limitation. FSB reads from live PVs, so data is not captured at a single point in time. For databases, this means crash-consistent at best. Application-consistent backup requires hooks.
Kopia replaced Restic
Velero's file system backup originally used Restic as the data mover. That changed over several releases:
| Version | Change |
|---|---|
| v1.10 | Kopia integrated alongside Restic |
| v1.12 | Default uploader switched to Kopia |
| v1.15 | Restic deprecated (warnings emitted) |
| v1.17 | --uploader-type=restic removed for new backups |
Existing Restic-created backups can still be restored in v1.17+. New backups use Kopia exclusively.
Database consistency with pre-backup hooks
Backing up a database volume without quiescing the database first produces a crash-consistent snapshot. That is roughly equivalent to pulling the power cord. The database will likely recover, but it is not guaranteed, and recovery takes time.
Velero pre-backup and post-backup hooks solve this by running commands inside pod containers before and after backup processing.
Step 9: add PostgreSQL backup hooks
Apply these annotations to your PostgreSQL pod (or the pod template in a StatefulSet):
metadata:
annotations:
pre.hook.backup.velero.io/command: >-
["/bin/bash", "-c",
"psql -U postgres -c \"SELECT pg_backup_start('velero', true);\""]
pre.hook.backup.velero.io/container: postgres
pre.hook.backup.velero.io/timeout: 5m
pre.hook.backup.velero.io/on-error: Fail
post.hook.backup.velero.io/command: >-
["/bin/bash", "-c",
"psql -U postgres -c \"SELECT pg_backup_stop();\""]
post.hook.backup.velero.io/container: postgres
post.hook.backup.velero.io/timeout: 2m
post.hook.backup.velero.io/on-error: Continue
backup.velero.io/backup-volumes: pgdata
on-error: Fail on the pre-hook means the backup aborts if the database cannot enter backup mode. on-error: Continue on the post-hook means the backup data is preserved even if the resume command fails.
PostgreSQL version note. pg_backup_start() / pg_backup_stop() is the current API since PostgreSQL 15. Older versions (14 and below) use pg_start_backup() / pg_stop_backup().
MySQL/MariaDB hooks
metadata:
annotations:
pre.hook.backup.velero.io/command: >-
["/bin/bash", "-c",
"mysql -u root -p$MYSQL_ROOT_PASSWORD -e 'FLUSH TABLES WITH READ LOCK;'"]
pre.hook.backup.velero.io/container: mysql
pre.hook.backup.velero.io/timeout: 3m
pre.hook.backup.velero.io/on-error: Fail
post.hook.backup.velero.io/command: >-
["/bin/bash", "-c",
"mysql -u root -p$MYSQL_ROOT_PASSWORD -e 'UNLOCK TABLES;'"]
post.hook.backup.velero.io/container: mysql
post.hook.backup.velero.io/timeout: 1m
post.hook.backup.velero.io/on-error: Continue
FLUSH TABLES WITH READ LOCK blocks all writes for the duration of the backup. For high-traffic databases, back up a read replica instead of locking the primary.
Alternative: centralized hooks in the Backup spec
Instead of annotating every pod, you can define hooks centrally in the Backup or Schedule resource:
apiVersion: velero.io/v1
kind: Backup
metadata:
name: production-with-hooks
namespace: velero
spec:
includedNamespaces:
- production
hooks:
resources:
- name: postgres-consistency
labelSelector:
matchLabels:
app: postgres
pre:
- exec:
container: postgres
command:
- /bin/bash
- -c
- "psql -U postgres -c \"SELECT pg_backup_start('velero', true);\""
onError: Fail
timeout: 5m
post:
- exec:
container: postgres
command:
- /bin/bash
- -c
- "psql -U postgres -c \"SELECT pg_backup_stop();\""
onError: Continue
timeout: 2m
This approach is better for GitOps workflows where pod annotations may be managed by a separate team.
Step 10: verify hook execution
After a backup completes, check that hooks ran successfully:
velero backup describe first-backup --details
Look for the Hooks section. It lists each hook, the pod it ran on, and whether it succeeded or failed.
Checkpoint. You now have scheduled, application-consistent backups of database workloads.
Restore into the same or a new cluster
Step 11: restore a deleted namespace
velero restore create --from-backup daily-production-20260409020000 \
--include-namespaces production
Velero restores resources in dependency order: CRDs first, then Namespaces, StorageClasses, PersistentVolumes, PersistentVolumeClaims, Secrets, ConfigMaps, and finally workloads.
By default, Velero skips resources that already exist (non-destructive). To overwrite existing resources:
velero restore create --from-backup daily-production-20260409020000 \
--existing-resource-policy update
Step 12: restore into a different namespace
velero restore create --from-backup daily-production-20260409020000 \
--namespace-mappings production:staging
This maps every resource from the production namespace into staging. Useful for cloning production data into a test environment.
Step 13: cross-cluster restore (disaster recovery)
For a full cluster replacement, point a new cluster at the same backup storage location. Install Velero with the same bucket and credentials, then:
# Set storage to read-only to prevent writes during recovery
kubectl patch backupstoragelocation default \
--namespace velero \
--type merge \
--patch '{"spec":{"accessMode":"ReadOnly"}}'
# Restore
velero restore create --from-backup daily-production-20260409020000
# Verify
velero restore describe <restore-name> --details
kubectl get all -n production
# Return to read-write
kubectl patch backupstoragelocation default \
--namespace velero \
--type merge \
--patch '{"spec":{"accessMode":"ReadWrite"}}'
Setting the storage location to ReadOnly during recovery prevents the new cluster's Velero instance from writing (and potentially corrupting) the backup repository while the restore is in progress.
Cross-cloud note. Native cloud snapshots are region-locked and cannot be restored across providers. For cross-cloud migration, the backups must use CSI snapshot data movement (--snapshot-move-data) or file system backup. Both store volume data in the object storage bucket, which is provider-independent.
Post-restore hooks
Velero supports two types of restore hooks. InitContainer hooks inject an init container into restored pods (runs before the application starts). Exec hooks run commands in containers after the pod reaches Ready.
Example: import a database dump after restore:
metadata:
annotations:
post.hook.restore.velero.io/container: postgres
post.hook.restore.velero.io/command: '["/bin/bash", "-c", "psql -U postgres < /backup/backup.sql"]'
post.hook.restore.velero.io/exec-timeout: 120s
post.hook.restore.velero.io/wait-for-ready: "true"
post.hook.restore.velero.io/on-error: Fail
wait-for-ready: true makes Velero wait until the pod is Ready before executing the command. For databases that need initialization time, this is important.
Checkpoint. You can now restore namespaces into the same cluster, a different namespace, or a different cluster entirely.
Monitor backup health
A backup you never test is not a backup. Monitoring catches silent failures before they matter.
Step 14: Prometheus metrics
Velero exposes metrics on port 8085. If you run the Prometheus Operator, add a scrape config:
- job_name: velero
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
action: keep
regex: velero
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: "8085"
Key metrics to alert on:
velero_backup_failure_totalincreasing: a backup schedule is failingvelero_backup_duration_secondsspiking: storage or network problems- No new
velero_backup_success_totalincrement in 25 hours: the schedule stopped running
Step 15: Grafana dashboards
The community maintains several pre-built dashboards:
Import one of these into your Grafana instance to get a visual overview of backup health, duration trends, and storage usage.
Regular restore drills
Schedule a quarterly restore drill. Restore a recent backup into a disposable namespace, verify the workloads come up, confirm the database contains expected data, then tear it down. A backup that has never been tested is a liability.
Common troubleshooting
PartiallyFailed backup, repository not initialized. Check velero repo get and verify the Node Agent DaemonSet is running on all nodes. Verify S3 credentials and bucket access. Missing node-agent pods are the most common cause.
Velero pod OOMKilled. Large backups with many resources can exhaust memory. Increase --velero-pod-mem-limit during install. Velero v1.18 moved repository operations outside the server process to reduce OOM risk.
Stuck InProgress backup. Velero cannot resume interrupted backups. Delete the stuck backup (kubectl delete backup <name> -n velero) and trigger a new one.
LoadBalancer DNS changes after restore. Cloud load balancers get new UIDs after restore, which means new DNS names. Update CNAME records manually, or set spec.loadBalancerIP on the Service where the provider supports it.
Admission webhooks blocking restore. ValidatingWebhookConfigurations and MutatingWebhookConfigurations can reject or mutate resources during restore. Temporarily disable webhooks if restores fail with admission errors.
For deeper debugging:
velero backup describe <name> --details # hook results and errors
velero backup logs <name> # per-resource processing log
kubectl -n velero get datauploads # CSI data movement progress
Full troubleshooting reference on velero.io.
What you learned
This tutorial covered the full Velero lifecycle for application-level Kubernetes backup and disaster recovery:
- Velero backs up API objects and volume data. It is complementary to etcd snapshots, not a replacement.
- Three volume backup mechanisms exist: native cloud snapshots (fastest, region-locked), CSI snapshot data movement (durable, cross-cloud), and file system backup via Kopia (for unsupported volume types).
- Database consistency requires explicit pre-backup hooks. Without them, volume backups are crash-consistent at best.
- Scheduled backups with tiered TTL provide hourly, daily, weekly, and monthly retention.
- Restores work into the same namespace, a different namespace, or a different cluster.
- Prometheus metrics and regular restore drills make the difference between a backup strategy and a disaster waiting to happen.
Velero does not handle everything. It does not back up the control plane itself (that requires etcd snapshots), it has no multi-tenancy support (only cluster administrators can manage it), and cross-cloud migration requires CSI data movement or file system backup rather than native snapshots. For commercial alternatives with built-in multi-tenancy and application awareness, Veeam Kasten K10 and TrilioVault are the main options.