Kubernetes blue-green and canary deployment strategies

Rolling updates handle most stateless services well, but they cannot give you instant rollback or expose a new version to 5% of traffic before the full cutover. Blue-green and canary deployments can. This tutorial walks through both patterns natively with kubectl, then layers Argo Rollouts and Gateway API on top so you can see exactly what each tool buys you.

What you will learn

By the end of this tutorial you will be able to pick between rolling updates, blue-green, and canary based on a clear risk profile, then implement either advanced strategy two ways: first with native kubectl (no extra controllers), then with Argo Rollouts v1.9.0 once you have outgrown what native primitives can do. You will also see where Gateway API fits, and which considerations (sessions, caches, database migrations, cost) actually break a cutover in practice.

Assumed starting point

This tutorial assumes you can already run a zero-downtime rolling update and know what maxSurge, maxUnavailable, readiness probes, and a preStop hook do. If those terms feel new, start there first.

You will need:

  • kubectl connected to a Kubernetes 1.29+ cluster (1.29 ships Gateway API v1.0 as a stable extension; the rolling-update sections work on 1.26+)
  • An application packaged as a Deployment with at least 2 replicas and a working readiness probe
  • Cluster permissions to create Service, Ingress, and (optionally) HTTPRoute and Rollout resources

When to use rolling, blue-green, or canary

Each strategy trades cost, blast radius, and rollback speed differently. The honest summary:

Criterion Rolling update Blue-green Canary
Native Kubernetes support Built-in (Deployment.spec.strategy) Pattern only (label switch) Pattern only (replica ratio or ingress weight)
Resource overhead during rollout Low (maxSurge extra pods) 2x (full second environment) Low (small canary pool)
Time to expose new version Gradual, automatic Atomic switch Gradual, controlled
Speed of rollback Slow (re-roll old image) Instant (flip selector back) Fast (set weight to 0)
Old and new run concurrently Yes, briefly No (after switch) Yes, by design
Ideal for Most stateless HTTP services High-stakes releases needing instant rollback Releases that need real-traffic validation
Hard with Releases that need instant rollback Schema changes that break backward compatibility Anything with sticky sessions you cannot make backend-affine

My default position: start with a rolling update. Move to canary only when you actually need to validate against production traffic before full rollout. Move to blue-green only when instant rollback is more valuable than the doubled cost. Most teams pick canary or blue-green because it sounds professional, then never use the rollback they paid for. That is wasted complexity.

Blue-green with native kubectl

Native Kubernetes does not have a BlueGreen strategy. The pattern is two separate Deployment resources behind one Service, where you flip the Service.spec.selector to switch traffic.

Step 1: Deploy the blue version

# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 4
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: app
        image: registry.internal/myapp:v1.4.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /healthz/ready
            port: 8080

Apply it:

kubectl apply -f blue-deployment.yaml

Step 2: Create the Service that selects blue

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue          # the live version
  ports:
  - port: 80
    targetPort: 8080
kubectl apply -f service.yaml

The Service's selector controls which pods receive traffic. Right now, all four blue pods do.

Step 3: Deploy the green version alongside

The green Deployment is identical except for the version: green label and the new image:

# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 4
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: app
        image: registry.internal/myapp:v1.5.0   # the new release
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /healthz/ready
            port: 8080
kubectl apply -f green-deployment.yaml
kubectl rollout status deployment/myapp-green --timeout=5m

Green pods are now running but receive zero production traffic, because the Service still selects version: blue. This is the moment to run smoke tests against the green pods directly, either via a separate preview Service or by kubectl port-forward to a green pod.

Checkpoint: verify both versions are running

kubectl get pods -l app=myapp

Expected output:

NAME                          READY   STATUS    RESTARTS   AGE
myapp-blue-7d9f4c8b6c-abcde   1/1     Running   0          12m
myapp-blue-7d9f4c8b6c-fghij   1/1     Running   0          12m
myapp-blue-7d9f4c8b6c-klmno   1/1     Running   0          12m
myapp-blue-7d9f4c8b6c-pqrst   1/1     Running   0          12m
myapp-green-5b8c2d1f9d-uvwxy  1/1     Running   0          2m
myapp-green-5b8c2d1f9d-zabcd  1/1     Running   0          2m
myapp-green-5b8c2d1f9d-efghi  1/1     Running   0          2m
myapp-green-5b8c2d1f9d-jklmn  1/1     Running   0          2m

Eight pods running, four blue actively serving, four green idle.

Step 4: Cut over with one selector patch

kubectl patch service myapp \
  -p '{"spec":{"selector":{"app":"myapp","version":"green"}}}'

This is the cutover. The Service controller updates the EndpointSlices, kube-proxy on every node refreshes its forwarding rules, and within a few seconds new connections route to green pods only. Connections in flight against blue pods continue until they close naturally.

Why this works: the Service selector field is mutable. Changing it is the same as deploying a new Service from the cluster's perspective, except the Service IP and DNS name are preserved. That is the entire mechanism.

Step 5: Roll back instantly if something is wrong

kubectl patch service myapp \
  -p '{"spec":{"selector":{"app":"myapp","version":"blue"}}}'

This is the feature. A bad release becomes a no-op rollback with one command, because the blue pods never went away.

Step 6: Decommission blue once green is stable

After a defined soak window (typically 15 to 30 minutes of healthy production traffic), scale blue down:

kubectl scale deployment myapp-blue --replicas=0

Keep the blue Deployment object around for a release or two so a rollback after the next deploy is still possible. On the next release, blue becomes the new green, and so on.

Canary with NGINX Ingress weight annotations

Canary is different. Instead of an atomic switch, you route a small percentage of traffic to the new version, watch metrics, and increase the weight as confidence grows.

Important: ingress-nginx is retired. The kubernetes/ingress-nginx repository was archived as read-only on March 24, 2026. It still works, but it will not receive bug fixes or CVE patches. The Kubernetes Steering and Security Response Committees recommend migrating to Gateway API or another maintained ingress controller. The example below is included because it remains the most-deployed canary mechanism in 2026 (Datadog put ingress-nginx at roughly 50% of cloud-native environments at the time of retirement). Use it to understand the pattern, but plan a migration.

Step 1: Deploy the stable and canary versions

Use two Deployments with different version labels, the same as the blue-green pattern:

# stable-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-stable
spec:
  replicas: 6
  selector:
    matchLabels:
      app: myapp
      version: stable
  template:
    metadata:
      labels:
        app: myapp
        version: stable
    spec:
      containers:
      - name: app
        image: registry.internal/myapp:v1.4.0
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: myapp-stable
spec:
  selector:
    app: myapp
    version: stable
  ports:
  - port: 80
    targetPort: 8080
# canary-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
spec:
  replicas: 1                # small canary pool
  selector:
    matchLabels:
      app: myapp
      version: canary
  template:
    metadata:
      labels:
        app: myapp
        version: canary
    spec:
      containers:
      - name: app
        image: registry.internal/myapp:v1.5.0
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: myapp-canary
spec:
  selector:
    app: myapp
    version: canary
  ports:
  - port: 80
    targetPort: 8080

Each version has its own Service. That is the prerequisite for traffic splitting.

Step 2: Define the stable Ingress

# stable-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp
spec:
  ingressClassName: nginx
  rules:
  - host: app.example.internal
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-stable
            port:
              number: 80

Step 3: Define the canary Ingress with a weight annotation

The NGINX Ingress canary feature requires a second Ingress with the same host, marked as canary:

# canary-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "5"   # 5% of traffic
spec:
  ingressClassName: nginx
  rules:
  - host: app.example.internal
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-canary
            port:
              number: 80

canary-weight is the percentage of random requests routed to the canary Service. The precedence order is canary-by-header first, then canary-by-cookie, then canary-weight. Combined, you can build "always send my QA team to the canary, plus 5% of everyone else" without code changes.

Step 4: Promote in steps

The point of a canary is that you raise the weight only when the metrics agree. A typical schedule:

# Start at 5% for 10 minutes, watch error rate and latency
kubectl annotate ingress myapp-canary \
  nginx.ingress.kubernetes.io/canary-weight=25 --overwrite

# After another 10 minutes
kubectl annotate ingress myapp-canary \
  nginx.ingress.kubernetes.io/canary-weight=50 --overwrite

# Then 100, at which point all traffic is on the new version
kubectl annotate ingress myapp-canary \
  nginx.ingress.kubernetes.io/canary-weight=100 --overwrite

Once the canary is at 100%, swap the stable Deployment's image to the new version, scale the canary back to 1 replica with a placeholder, and you are ready for the next release.

What this canary cannot do

A failed canary does not roll itself back. If error rates spike at 25% weight, you have to notice and revert manually. There is no metric loop, no analysis, no automation. That is the gap Argo Rollouts fills.

Automating both strategies with Argo Rollouts

Argo Rollouts (current stable v1.9.0, released March 20, 2026) replaces the Deployment resource with a Rollout CRD. It manages the same ReplicaSets and Services, but adds step-based progressive delivery, integrated traffic routing, and analysis-driven automatic rollback.

Why use it: the canary above requires you to watch Grafana while running kubectl annotate by hand. Argo Rollouts watches Prometheus for you, advances the weight automatically when SLOs hold, and rolls back the moment they break.

Blue-green with Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 4
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: app
        image: registry.internal/myapp:v1.5.0
        ports:
        - containerPort: 8080
  strategy:
    blueGreen:
      activeService: myapp-active        # production Service
      previewService: myapp-preview      # internal Service for testing green
      autoPromotionEnabled: false        # require manual `kubectl argo rollouts promote`
      scaleDownDelaySeconds: 600         # keep blue alive for 10 min after cutover

The activeService is the Service users hit. The previewService points to the green pods until promotion. With autoPromotionEnabled: false, the cutover happens only when you run kubectl argo rollouts promote myapp. The scaleDownDelaySeconds (default 30) keeps blue around long enough for traffic to drain and rollback to remain instant.

Canary with Argo Rollouts and analysis

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 6
  strategy:
    canary:
      canaryService: myapp-canary
      stableService: myapp-stable
      trafficRouting:
        nginx:
          stableIngress: myapp           # reuses the Ingress pattern above
      steps:
      - setWeight: 5
      - pause: { duration: 5m }
      - analysis:
          templates:
          - templateName: success-rate   # rollback if the AnalysisTemplate fails
      - setWeight: 25
      - pause: { duration: 10m }
      - setWeight: 50
      - pause: { duration: 10m }
      - setWeight: 100
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: app
        image: registry.internal/myapp:v1.5.0

Paired with an AnalysisTemplate that queries Prometheus for the request success rate:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
    value: myapp-canary
  metrics:
  - name: success-rate
    interval: 1m
    successCondition: result[0] >= 0.99
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: |
          sum(rate(
            nginx_ingress_controller_requests{
              service="NaN",
              status!~"5.."
            }[2m]
          )) /
          sum(rate(
            nginx_ingress_controller_requests{
              service="NaN"
            }[2m]
          ))

If the success rate drops below 99% for three consecutive samples, the Rollout aborts, traffic returns to the stable version, and the Rollout enters Degraded status. This is the only configuration in this article where a failed canary rolls back automatically. Native Kubernetes and the bare NGINX Ingress canary above both require human intervention.

Gateway API: the future of native traffic splitting

The longer-term answer is Gateway API, which reached GA on October 31, 2023 with v1.0 alongside Kubernetes 1.29. Its HTTPRoute resource supports weighted backends out of the box, no annotations and no extra controller required (other than a Gateway implementation):

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: myapp
spec:
  parentRefs:
  - name: my-gateway
  hostnames:
  - app.example.internal
  rules:
  - backendRefs:
    - name: myapp-stable
      port: 80
      weight: 95
    - name: myapp-canary
      port: 80
      weight: 5

The backendRefs weights are proportional, not percentages. With weights 95 and 5, the canary receives 5% of traffic; with 9 and 1, it receives 10%. Default weight is 1.

In 2026 this is where I would build new clusters. It removes the dependency on a controller-specific annotation vocabulary, and Argo Rollouts already supports trafficRouting via the Gateway API plugin so progressive delivery still works.

Traffic shift considerations: sessions, caches, database migrations

This is the part that breaks more cutovers than any controller-level mistake.

Sessions. Blue-green assumes a user can be routed to either environment without breaking. If your application stores session state in pod memory, a cutover logs everyone out the moment the selector flips. Move sessions to Redis, Memcached, or a database table before you adopt blue-green. The same applies to canary: a small fraction of users will ping-pong between stable and canary on every request, so any in-memory session state will misbehave for them.

Caches. Both strategies expose cold caches. After a blue-green cutover, every cache-warming request hits the new pods at once. With low rolling updates this happens gradually; with a hard switch it can spike database load enough to look like an outage. Either pre-warm the green environment with a small percentage of traffic (using Argo Rollouts' analysis pause), or ship with cache-aside patterns that degrade gracefully on a cold miss.

Database migrations. This is the real ceiling. Blue-green only works when both versions can read and write the same database safely. Schema changes that drop or rename a column break that contract. The discipline is the expand/contract migration pattern: make all schema changes additive first (expand), deploy the new code, then in a later release remove the old column (contract). If you cannot live without dropping the column for one release, blue-green is the wrong tool. Use a maintenance window or a rolling update with a flag that gates the schema use.

Long-lived connections. WebSockets, gRPC streams, and SSE outlive a typical terminationGracePeriodSeconds. Both strategies need explicit handling here, often a server-side close on rollout combined with a client that auto-reconnects.

Cost implications of running two full environments

Blue-green costs roughly 2x compute during the soak window. For 20 replicas at 4 GB each, that is an extra 80 GB of memory the cluster has to host. On managed Kubernetes, that translates directly to node-pool size and bill.

A few honest mitigations:

  • Use Argo Rollouts' previewReplicaCount to run green at lower replica counts during preview, then scale up just before cutover.
  • Rely on cluster autoscaler to evict the spare capacity quickly after blue is decommissioned. The faster you scale blue down post-soak, the less the cost shows up in your monthly bill.
  • For non-critical services, use canary instead. A single canary pod alongside 6 stable pods costs ~17% extra, not 100%.

For most internal services the cost of blue-green is not worth the instant rollback. Reserve it for revenue-critical paths and high-blast-radius releases.

What blue-green and canary are NOT

Three misconceptions show up reliably in code reviews:

"Blue-green means two permanent clusters." No. The pattern works at any scope: two Deployments behind one Service inside the same namespace is the most common form. Multi-cluster blue-green exists, but it is an extreme variant for disaster recovery, not the baseline. The cluster boundary is irrelevant; what matters is two parallel sets of pods sharing one Service identity.

"Canary deployments require a service mesh." They do not. A service mesh (Istio, Linkerd) gives you precise per-request routing and rich observability, which makes canaries safer. But replica-ratio canaries (1 canary pod alongside 9 stable pods) work on any cluster, and ingress-weight canaries work on any cluster with a supported ingress controller. The mesh is a nice-to-have, not a prerequisite.

"A failed canary automatically rolls back." Only if a controller is watching for you. The bare NGINX Ingress canary in Step 3 of the canary section above does not. Native Kubernetes does not. Argo Rollouts with an AnalysisTemplate, or Flagger, or a service mesh integration plus an SLO controller, does. If your runbook says "we use canary deployments so failures roll back automatically" and you are not running one of those, your runbook is wrong.

What you learned

The strategy you pick is a function of the rollback speed you need versus the cost you can pay. Rolling updates are the default. Blue-green buys instant rollback for double the compute. Canary buys traffic-validated promotion for the cost of one extra pod plus the operational discipline of watching metrics during each step.

Native Kubernetes can do all three, but for canary specifically the native primitives stop short of automatic rollback. That is the line where Argo Rollouts earns its complexity. And Gateway API is where this whole stack is heading once ingress-nginx is fully behind us.

Where to go next

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.