What you will have at the end
A working understanding of both batch/v1 resources (Job and CronJob) and a set of production-ready manifests covering the most common patterns: single-run tasks, parallel batch processing, failure-tolerant indexed work, and timezone-aware scheduled jobs with concurrency control and automatic cleanup.
Prerequisites
- A running Kubernetes cluster (v1.27 or later for timezone support; v1.31+ for stable pod failure policies)
kubectlconfigured and able to reach the cluster- Familiarity with pod specs and YAML manifests
- For resource configuration on batch pods, see Kubernetes resource requests and limits
Job vs. CronJob
A Job creates pods, tracks how many complete successfully, and marks itself done when that count is reached. It runs once; it does not repeat. Use a Job for database migrations, data imports, one-off scripts, and any task triggered by a deployment pipeline or manual action.
A CronJob creates Jobs on a cron schedule. It is the Kubernetes equivalent of a Unix crontab entry. The CronJob controller only creates Job objects. Each Job then manages its own pods. Use a CronJob for nightly backups, periodic report generation, cache warming, and log cleanup.
The chain is always CronJob -> Job -> Pod. Every field available on a Job spec is available inside a CronJob's .spec.jobTemplate.
Minimal Job
apiVersion: batch/v1
kind: Job
metadata:
name: db-migrate
spec:
backoffLimit: 4
template:
spec:
containers:
- name: migrate
image: myapp/migrate:2.4.0
command: ["./migrate", "--target=latest"]
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
memory: "256Mi"
restartPolicy: Never
Apply and watch:
kubectl apply -f job.yaml
kubectl get jobs --watch
# Once the STATUS column shows "Complete", the migration finished.
# Read the pod logs:
kubectl logs -l job-name=db-migrate
Parallelism and completions
Jobs support three patterns, controlled by two fields:
| Pattern | .spec.completions |
.spec.parallelism |
When to use |
|---|---|---|---|
| Single pod | 1 (default) | 1 (default) | One task, one run |
| Fixed completion count | N | M (where M <= N) | N pods must succeed; M run concurrently |
| Work queue | unset (null) | M | M pods pull from an external queue; done when any pod exits 0 |
parallelism is mutable on a running Job, so you can scale batch work up or down mid-flight. For fixed-completion Jobs, the actual concurrency is min(parallelism, remaining completions).
spec:
completions: 10 # 10 pods must succeed
parallelism: 3 # up to 3 run at a time
Kubernetes runs batches of up to 3 pods until 10 total succeed.
Indexed completion mode
Setting .spec.completionMode: Indexed (stable since Kubernetes v1.24) assigns each pod a unique index from 0 to completions - 1. The index is injected as the JOB_COMPLETION_INDEX environment variable and as the pod annotation batch.kubernetes.io/job-completion-index. The Job is complete when one pod has succeeded for every index.
Use indexed Jobs when each pod handles a deterministic partition: frame N of a render, shard N of a dataset, test suite N of a matrix.
apiVersion: batch/v1
kind: Job
metadata:
name: render-frames
spec:
completions: 120
parallelism: 30
completionMode: Indexed
template:
spec:
restartPolicy: Never
containers:
- name: renderer
image: studio/renderer:4.1.0
command: ["./render", "--frame=$(JOB_COMPLETION_INDEX)"]
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
memory: "4Gi"
Verify progress:
kubectl describe job render-frames
# Look for "Completed Indexes:" — it shows which indexes are done.
Failure handling
Three fields control what happens when pods fail. They interact, so setting them together is important.
backoffLimit
spec:
backoffLimit: 4 # default is 6
The maximum number of pod failures before the Job is marked Failed. The controller retries with exponential backoff: 10s, 20s, 40s, capped at 6 minutes. Once reached, all running pods are terminated.
With restartPolicy: Never, each failed pod counts toward the limit. With restartPolicy: OnFailure, container restarts inside the same pod do not count; only outright pod failures do. Use Never when you need to inspect logs from failed pods. Use OnFailure for short-lived tasks where log retention is less important.
activeDeadlineSeconds
spec:
activeDeadlineSeconds: 3600 # 1-hour hard wall-clock limit
A wall-clock deadline on the entire Job, not per pod. Once exceeded, all running pods are terminated and the Job status becomes Failed with reason DeadlineExceeded. This takes precedence over backoffLimit: the Job fails even if retries remain.
Combine both:
spec:
backoffLimit: 4
activeDeadlineSeconds: 3600
The Job fails on whichever condition is hit first: 4 pod failures or 1 hour elapsed.
Pod failure policy
Stable since Kubernetes v1.31, pod failure policies give you fine-grained control beyond the blunt backoffLimit. Three actions are available:
FailJob: terminates the entire Job immediately. Use for non-retriable errors (known bad exit codes).Ignore: does not count the failure towardbackoffLimit. Use for infrastructure disruptions like node drains.Count: the default behavior.
Rules are evaluated in order; first match wins.
spec:
backoffLimit: 6
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: worker
operator: In
values: [42] # application-level "do not retry" signal
- action: Ignore
onPodConditions:
- type: DisruptionTarget # node drain or preemption
Per-index failure budgets (indexed Jobs)
For indexed Jobs, a single fast-failing index can exhaust the global backoffLimit before other indexes get a chance to run. Since Kubernetes v1.33, backoffLimitPerIndex and maxFailedIndexes solve this:
spec:
completionMode: Indexed
completions: 50
parallelism: 10
backoffLimitPerIndex: 2 # each index retries independently
maxFailedIndexes: 5 # stop the whole Job if >5 indexes fail
Failed indexes appear in status.failedIndexes. This pairs with the FailIndex pod failure policy action, which terminates one index without killing the entire Job.
CronJob schedule and timezone
The .spec.schedule field uses standard five-field cron syntax:
# ┌───────────── minute (0-59)
# │ ┌───────────── hour (0-23)
# │ │ ┌───────────── day of month (1-31)
# │ │ │ ┌───────────── month (1-12)
# │ │ │ │ ┌───────────── day of week (0-6, Sunday=0)
* * * * *
Common expressions:
| Expression | Meaning |
|---|---|
0 2 * * * |
Daily at 02:00 |
*/15 * * * * |
Every 15 minutes |
30 6 * * 1-5 |
Weekdays at 06:30 |
@daily |
Shorthand for 0 0 * * * |
@hourly |
Shorthand for 0 * * * * |
Timezone support
Without .spec.timeZone, the schedule runs in the timezone of the kube-controller-manager process (typically UTC). Since Kubernetes v1.27, the timeZone field is GA and accepts any IANA timezone name:
spec:
schedule: "0 9 * * 1-5"
timeZone: "Europe/Amsterdam" # 9 AM CET/CEST, Mon-Fri
Kubernetes handles daylight saving transitions automatically. Specify timeZone: "Etc/UTC" explicitly if you want UTC and do not want to depend on the controller-manager's local clock.
Do not embed TZ= or CRON_TZ= in the schedule string. Kubernetes rejects this with a validation error.
CronJob name limit: CronJob names must be 52 characters or fewer. The controller appends up to 11 characters to form child Job names, and Job names are capped at 63.
Concurrency policy
.spec.concurrencyPolicy controls what happens when a new scheduled execution fires while a previous Job from the same CronJob is still running.
| Value | Behavior | Typical use case |
|---|---|---|
Allow (default) |
Both run simultaneously | Independent, short tasks |
Forbid |
New execution is skipped (not queued) | Database operations, exclusive locks |
Replace |
Running Job is deleted, new one starts | Freshness-sensitive snapshots |
Forbid is the safe default for most batch work. Running two copies of a database backup or report generator at the same time usually causes trouble.
startingDeadlineSeconds
spec:
startingDeadlineSeconds: 3600
The maximum number of seconds after a scheduled time that the controller will still try to start a Job. If the window passes, the execution is skipped. Set this to at least 60 seconds; values under 10 risk never firing because the CronJob controller checks at roughly 10-second intervals.
If more than 100 schedules are missed within the startingDeadlineSeconds window (or since the last successful schedule, if the field is unset), the CronJob stops creating Jobs entirely.
Cleaning up finished Jobs
Without cleanup, completed and failed Jobs accumulate and degrade API server performance over time.
Standalone Jobs: ttlSecondsAfterFinished
Stable since Kubernetes v1.23. Once a Job enters Complete or Failed status, the TTL controller deletes it (and its pods) after the specified number of seconds.
spec:
ttlSecondsAfterFinished: 86400 # keep for 24 hours, then delete
Setting the value to 0 deletes immediately after completion. Leaving it unset means the Job stays forever.
CronJob history limits
CronJobs have built-in history management that is usually a better fit than TTL:
spec:
successfulJobsHistoryLimit: 3 # default: 3
failedJobsHistoryLimit: 3 # default: 1
These keep the N most recent Jobs of each status and delete older ones. For CronJobs, prefer these fields over setting ttlSecondsAfterFinished in the job template.
Suspending Jobs and CronJobs
Both resources support a suspend field. The behavior differs:
Job suspend: setting suspend: true deletes all active pods. Previously succeeded or failed pod counts are preserved. Setting it back to false resumes the Job from where it left off. Use this to free cluster resources during maintenance windows or to yield to higher-priority work.
# Suspend a running Job:
kubectl patch job render-frames -p '{"spec":{"suspend":true}}'
# Resume:
kubectl patch job render-frames -p '{"spec":{"suspend":false}}'
CronJob suspend: setting suspend: true stops the controller from creating new Jobs. Already-running Jobs continue until they finish. Use this during incidents or planned downtime.
kubectl patch cronjob nightly-backup -p '{"spec":{"suspend":true}}'
Complete CronJob example
This manifest combines the fields covered in this article into a production-ready nightly backup CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-db-backup # <= 52 characters
spec:
schedule: "0 2 * * *" # daily at 02:00
timeZone: "Europe/Amsterdam" # GA since v1.27
concurrencyPolicy: Forbid # skip if previous run is still active
startingDeadlineSeconds: 3600 # allow up to 1 hour late start
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
activeDeadlineSeconds: 3600 # hard 1-hour wall-clock limit
backoffLimit: 2
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: myapp/db-backup:1.8.0
command: ["/bin/sh", "-c", "pg_dump -h db-primary.internal mydb | gzip > /backup/mydb-$(date +%Y%m%d).sql.gz"]
resources:
requests:
cpu: "200m"
memory: "128Mi"
limits:
memory: "256Mi"
volumeMounts:
- name: backup-vol
mountPath: /backup
volumes:
- name: backup-vol
persistentVolumeClaim:
claimName: backup-pvc
Expected result: kubectl get cronjob nightly-db-backup shows a LAST SCHEDULE timestamp updating daily at 02:00 CET/CEST. kubectl get jobs --selector=batch.kubernetes.io/cronjob-name=nightly-db-backup shows the three most recent Jobs retained.
Useful operational commands
# Manually trigger a CronJob (useful for testing):
kubectl create job --from=cronjob/nightly-db-backup manual-test-$(date +%s)
# List Jobs spawned by a CronJob:
kubectl get jobs --selector=batch.kubernetes.io/cronjob-name=nightly-db-backup
# Get logs from all pods of a specific Job:
kubectl logs -l job-name=nightly-db-backup-28950720
# Delete a CronJob and all its child Jobs:
kubectl delete cronjob nightly-db-backup
Common troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
Job stuck in Active state, no pods |
Namespace has a ResourceQuota and the pod spec is missing resource requests/limits |
Add resources.requests and resources.limits to the container spec |
| CronJob never fires | startingDeadlineSeconds set below 10 |
Increase to at least 60 |
| CronJob stopped scheduling after outage | More than 100 missed schedules accumulated | Delete and recreate the CronJob, or set startingDeadlineSeconds to prevent this |
| Pods accumulate, API server slows | No TTL or history limit configured | Add ttlSecondsAfterFinished (standalone Jobs) or tune successfulJobsHistoryLimit / failedJobsHistoryLimit (CronJobs) |
| Duplicate backup runs at the same time | concurrencyPolicy is Allow (the default) |
Set concurrencyPolicy: Forbid |
For resource sizing on batch pods (CPU requests, memory limits, QoS class implications), see Kubernetes resource requests and limits.