Kubernetes Jobs and CronJobs: running batch workloads

A Kubernetes Job runs one or more pods to completion and then stops. A CronJob does the same thing on a cron schedule. Together they cover database migrations, nightly backups, report generation, and any other task that should run once or on a timer rather than continuously. This guide walks through both resources from a minimal Job through parallelism, failure handling, timezone-aware CronJobs, concurrency control, and cleanup.

What you will have at the end

A working understanding of both batch/v1 resources (Job and CronJob) and a set of production-ready manifests covering the most common patterns: single-run tasks, parallel batch processing, failure-tolerant indexed work, and timezone-aware scheduled jobs with concurrency control and automatic cleanup.

Prerequisites

A running Kubernetes cluster (v1.27 or later for timezone support; v1.31+ for stable pod failure policies)
kubectl configured and able to reach the cluster
Familiarity with pod specs and YAML manifests
For resource configuration on batch pods, see Kubernetes resource requests and limits

Job vs. CronJob

A Job creates pods, tracks how many complete successfully, and marks itself done when that count is reached. It runs once; it does not repeat. Use a Job for database migrations, data imports, one-off scripts, and any task triggered by a deployment pipeline or manual action.

A CronJob creates Jobs on a cron schedule. It is the Kubernetes equivalent of a Unix crontab entry. The CronJob controller only creates Job objects. Each Job then manages its own pods. Use a CronJob for nightly backups, periodic report generation, cache warming, and log cleanup.

The chain is always CronJob -> Job -> Pod. Every field available on a Job spec is available inside a CronJob's .spec.jobTemplate.

Minimal Job

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
spec:
  backoffLimit: 4
  template:
    spec:
      containers:
      - name: migrate
        image: myapp/migrate:2.4.0
        command: ["./migrate", "--target=latest"]
        resources:
          requests:
            cpu: "250m"
            memory: "256Mi"
          limits:
            memory: "256Mi"
      restartPolicy: Never

Apply and watch:

kubectl apply -f job.yaml
kubectl get jobs --watch
# Once the STATUS column shows "Complete", the migration finished.

# Read the pod logs:
kubectl logs -l job-name=db-migrate

Parallelism and completions

Jobs support three patterns, controlled by two fields:

Pattern	`.spec.completions`	`.spec.parallelism`	When to use
Single pod	1 (default)	1 (default)	One task, one run
Fixed completion count	N	M (where M <= N)	N pods must succeed; M run concurrently
Work queue	unset (null)	M	M pods pull from an external queue; done when any pod exits 0

parallelism is mutable on a running Job, so you can scale batch work up or down mid-flight. For fixed-completion Jobs, the actual concurrency is min(parallelism, remaining completions).

spec:
  completions: 10   # 10 pods must succeed
  parallelism: 3    # up to 3 run at a time

Kubernetes runs batches of up to 3 pods until 10 total succeed.

Indexed completion mode

Setting .spec.completionMode: Indexed (stable since Kubernetes v1.24) assigns each pod a unique index from 0 to completions - 1. The index is injected as the JOB_COMPLETION_INDEX environment variable and as the pod annotation batch.kubernetes.io/job-completion-index. The Job is complete when one pod has succeeded for every index.

Use indexed Jobs when each pod handles a deterministic partition: frame N of a render, shard N of a dataset, test suite N of a matrix.

apiVersion: batch/v1
kind: Job
metadata:
  name: render-frames
spec:
  completions: 120
  parallelism: 30
  completionMode: Indexed
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: renderer
        image: studio/renderer:4.1.0
        command: ["./render", "--frame=$(JOB_COMPLETION_INDEX)"]
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            memory: "4Gi"

Verify progress:

kubectl describe job render-frames
# Look for "Completed Indexes:" — it shows which indexes are done.

Failure handling

Three fields control what happens when pods fail. They interact, so setting them together is important.

backoffLimit

spec:
  backoffLimit: 4   # default is 6

The maximum number of pod failures before the Job is marked Failed. The controller retries with exponential backoff: 10s, 20s, 40s, capped at 6 minutes. Once reached, all running pods are terminated.

With restartPolicy: Never, each failed pod counts toward the limit. With restartPolicy: OnFailure, container restarts inside the same pod do not count; only outright pod failures do. Use Never when you need to inspect logs from failed pods. Use OnFailure for short-lived tasks where log retention is less important.

activeDeadlineSeconds

spec:
  activeDeadlineSeconds: 3600   # 1-hour hard wall-clock limit

A wall-clock deadline on the entire Job, not per pod. Once exceeded, all running pods are terminated and the Job status becomes Failed with reason DeadlineExceeded. This takes precedence over backoffLimit: the Job fails even if retries remain.

Combine both:

spec:
  backoffLimit: 4
  activeDeadlineSeconds: 3600

The Job fails on whichever condition is hit first: 4 pod failures or 1 hour elapsed.

Pod failure policy

Stable since Kubernetes v1.31, pod failure policies give you fine-grained control beyond the blunt backoffLimit. Three actions are available:

FailJob: terminates the entire Job immediately. Use for non-retriable errors (known bad exit codes).
Ignore: does not count the failure toward backoffLimit. Use for infrastructure disruptions like node drains.
Count: the default behavior.

Rules are evaluated in order; first match wins.

spec:
  backoffLimit: 6
  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: worker
        operator: In
        values: [42]               # application-level "do not retry" signal
    - action: Ignore
      onPodConditions:
      - type: DisruptionTarget     # node drain or preemption

Per-index failure budgets (indexed Jobs)

For indexed Jobs, a single fast-failing index can exhaust the global backoffLimit before other indexes get a chance to run. Since Kubernetes v1.33, backoffLimitPerIndex and maxFailedIndexes solve this:

spec:
  completionMode: Indexed
  completions: 50
  parallelism: 10
  backoffLimitPerIndex: 2     # each index retries independently
  maxFailedIndexes: 5         # stop the whole Job if >5 indexes fail

Failed indexes appear in status.failedIndexes. This pairs with the FailIndex pod failure policy action, which terminates one index without killing the entire Job.

CronJob schedule and timezone

The .spec.schedule field uses standard five-field cron syntax:

# ┌───────────── minute (0-59)
# │ ┌───────────── hour (0-23)
# │ │ ┌───────────── day of month (1-31)
# │ │ │ ┌───────────── month (1-12)
# │ │ │ │ ┌───────────── day of week (0-6, Sunday=0)
  * * * * *

Common expressions:

Expression	Meaning
`0 2 * * *`	Daily at 02:00
`/15 * * *`	Every 15 minutes
`30 6 * * 1-5`	Weekdays at 06:30
`@daily`	Shorthand for `0 0 * * *`
`@hourly`	Shorthand for `0 * * * *`

Timezone support

Without .spec.timeZone, the schedule runs in the timezone of the kube-controller-manager process (typically UTC). Since Kubernetes v1.27, the timeZone field is GA and accepts any IANA timezone name:

spec:
  schedule: "0 9 * * 1-5"
  timeZone: "Europe/Amsterdam"   # 9 AM CET/CEST, Mon-Fri

Kubernetes handles daylight saving transitions automatically. Specify timeZone: "Etc/UTC" explicitly if you want UTC and do not want to depend on the controller-manager's local clock.

Do not embed TZ= or CRON_TZ= in the schedule string. Kubernetes rejects this with a validation error.

CronJob name limit: CronJob names must be 52 characters or fewer. The controller appends up to 11 characters to form child Job names, and Job names are capped at 63.

Concurrency policy

.spec.concurrencyPolicy controls what happens when a new scheduled execution fires while a previous Job from the same CronJob is still running.

Value	Behavior	Typical use case
`Allow` (default)	Both run simultaneously	Independent, short tasks
`Forbid`	New execution is skipped (not queued)	Database operations, exclusive locks
`Replace`	Running Job is deleted, new one starts	Freshness-sensitive snapshots

Forbid is the safe default for most batch work. Running two copies of a database backup or report generator at the same time usually causes trouble.

startingDeadlineSeconds

spec:
  startingDeadlineSeconds: 3600

The maximum number of seconds after a scheduled time that the controller will still try to start a Job. If the window passes, the execution is skipped. Set this to at least 60 seconds; values under 10 risk never firing because the CronJob controller checks at roughly 10-second intervals.

If more than 100 schedules are missed within the startingDeadlineSeconds window (or since the last successful schedule, if the field is unset), the CronJob stops creating Jobs entirely.

Cleaning up finished Jobs

Without cleanup, completed and failed Jobs accumulate and degrade API server performance over time.

Standalone Jobs: ttlSecondsAfterFinished

Stable since Kubernetes v1.23. Once a Job enters Complete or Failed status, the TTL controller deletes it (and its pods) after the specified number of seconds.

spec:
  ttlSecondsAfterFinished: 86400   # keep for 24 hours, then delete

Setting the value to 0 deletes immediately after completion. Leaving it unset means the Job stays forever.

CronJob history limits

CronJobs have built-in history management that is usually a better fit than TTL:

spec:
  successfulJobsHistoryLimit: 3   # default: 3
  failedJobsHistoryLimit: 3       # default: 1

These keep the N most recent Jobs of each status and delete older ones. For CronJobs, prefer these fields over setting ttlSecondsAfterFinished in the job template.

Suspending Jobs and CronJobs

Both resources support a suspend field. The behavior differs:

Job suspend: setting suspend: true deletes all active pods. Previously succeeded or failed pod counts are preserved. Setting it back to false resumes the Job from where it left off. Use this to free cluster resources during maintenance windows or to yield to higher-priority work.

# Suspend a running Job:
kubectl patch job render-frames -p '{"spec":{"suspend":true}}'

# Resume:
kubectl patch job render-frames -p '{"spec":{"suspend":false}}'

CronJob suspend: setting suspend: true stops the controller from creating new Jobs. Already-running Jobs continue until they finish. Use this during incidents or planned downtime.

kubectl patch cronjob nightly-backup -p '{"spec":{"suspend":true}}'

Complete CronJob example

This manifest combines the fields covered in this article into a production-ready nightly backup CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-db-backup              # <= 52 characters
spec:
  schedule: "0 2 * * *"                # daily at 02:00
  timeZone: "Europe/Amsterdam"          # GA since v1.27
  concurrencyPolicy: Forbid             # skip if previous run is still active
  startingDeadlineSeconds: 3600         # allow up to 1 hour late start
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      activeDeadlineSeconds: 3600       # hard 1-hour wall-clock limit
      backoffLimit: 2
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: myapp/db-backup:1.8.0
            command: ["/bin/sh", "-c", "pg_dump -h db-primary.internal mydb | gzip > /backup/mydb-$(date +%Y%m%d).sql.gz"]
            resources:
              requests:
                cpu: "200m"
                memory: "128Mi"
              limits:
                memory: "256Mi"
            volumeMounts:
            - name: backup-vol
              mountPath: /backup
          volumes:
          - name: backup-vol
            persistentVolumeClaim:
              claimName: backup-pvc

Expected result: kubectl get cronjob nightly-db-backup shows a LAST SCHEDULE timestamp updating daily at 02:00 CET/CEST. kubectl get jobs --selector=batch.kubernetes.io/cronjob-name=nightly-db-backup shows the three most recent Jobs retained.

Useful operational commands

# Manually trigger a CronJob (useful for testing):
kubectl create job --from=cronjob/nightly-db-backup manual-test-$(date +%s)

# List Jobs spawned by a CronJob:
kubectl get jobs --selector=batch.kubernetes.io/cronjob-name=nightly-db-backup

# Get logs from all pods of a specific Job:
kubectl logs -l job-name=nightly-db-backup-28950720

# Delete a CronJob and all its child Jobs:
kubectl delete cronjob nightly-db-backup

Common troubleshooting

Symptom	Likely cause	Fix
Job stuck in `Active` state, no pods	Namespace has a `ResourceQuota` and the pod spec is missing resource requests/limits	Add `resources.requests` and `resources.limits` to the container spec
CronJob never fires	`startingDeadlineSeconds` set below 10	Increase to at least 60
CronJob stopped scheduling after outage	More than 100 missed schedules accumulated	Delete and recreate the CronJob, or set `startingDeadlineSeconds` to prevent this
Pods accumulate, API server slows	No TTL or history limit configured	Add `ttlSecondsAfterFinished` (standalone Jobs) or tune `successfulJobsHistoryLimit` / `failedJobsHistoryLimit` (CronJobs)
Duplicate backup runs at the same time	`concurrencyPolicy` is `Allow` (the default)	Set `concurrencyPolicy: Forbid`

For resource sizing on batch pods (CPU requests, memory limits, QoS class implications), see Kubernetes resource requests and limits.

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy