What you will learn
By the end of this tutorial you will be able to pick between rolling updates, blue-green, and canary based on a clear risk profile, then implement either advanced strategy two ways: first with native kubectl (no extra controllers), then with Argo Rollouts v1.9.0 once you have outgrown what native primitives can do. You will also see where Gateway API fits, and which considerations (sessions, caches, database migrations, cost) actually break a cutover in practice.
Assumed starting point
This tutorial assumes you can already run a zero-downtime rolling update and know what maxSurge, maxUnavailable, readiness probes, and a preStop hook do. If those terms feel new, start there first.
You will need:
kubectlconnected to a Kubernetes 1.29+ cluster (1.29 ships Gateway API v1.0 as a stable extension; the rolling-update sections work on 1.26+)- An application packaged as a
Deploymentwith at least 2 replicas and a working readiness probe - Cluster permissions to create
Service,Ingress, and (optionally)HTTPRouteandRolloutresources
When to use rolling, blue-green, or canary
Each strategy trades cost, blast radius, and rollback speed differently. The honest summary:
| Criterion | Rolling update | Blue-green | Canary |
|---|---|---|---|
| Native Kubernetes support | Built-in (Deployment.spec.strategy) |
Pattern only (label switch) | Pattern only (replica ratio or ingress weight) |
| Resource overhead during rollout | Low (maxSurge extra pods) |
2x (full second environment) | Low (small canary pool) |
| Time to expose new version | Gradual, automatic | Atomic switch | Gradual, controlled |
| Speed of rollback | Slow (re-roll old image) | Instant (flip selector back) | Fast (set weight to 0) |
| Old and new run concurrently | Yes, briefly | No (after switch) | Yes, by design |
| Ideal for | Most stateless HTTP services | High-stakes releases needing instant rollback | Releases that need real-traffic validation |
| Hard with | Releases that need instant rollback | Schema changes that break backward compatibility | Anything with sticky sessions you cannot make backend-affine |
My default position: start with a rolling update. Move to canary only when you actually need to validate against production traffic before full rollout. Move to blue-green only when instant rollback is more valuable than the doubled cost. Most teams pick canary or blue-green because it sounds professional, then never use the rollback they paid for. That is wasted complexity.
Blue-green with native kubectl
Native Kubernetes does not have a BlueGreen strategy. The pattern is two separate Deployment resources behind one Service, where you flip the Service.spec.selector to switch traffic.
Step 1: Deploy the blue version
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 4
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: app
image: registry.internal/myapp:v1.4.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
Apply it:
kubectl apply -f blue-deployment.yaml
Step 2: Create the Service that selects blue
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # the live version
ports:
- port: 80
targetPort: 8080
kubectl apply -f service.yaml
The Service's selector controls which pods receive traffic. Right now, all four blue pods do.
Step 3: Deploy the green version alongside
The green Deployment is identical except for the version: green label and the new image:
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 4
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: app
image: registry.internal/myapp:v1.5.0 # the new release
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
kubectl apply -f green-deployment.yaml
kubectl rollout status deployment/myapp-green --timeout=5m
Green pods are now running but receive zero production traffic, because the Service still selects version: blue. This is the moment to run smoke tests against the green pods directly, either via a separate preview Service or by kubectl port-forward to a green pod.
Checkpoint: verify both versions are running
kubectl get pods -l app=myapp
Expected output:
NAME READY STATUS RESTARTS AGE
myapp-blue-7d9f4c8b6c-abcde 1/1 Running 0 12m
myapp-blue-7d9f4c8b6c-fghij 1/1 Running 0 12m
myapp-blue-7d9f4c8b6c-klmno 1/1 Running 0 12m
myapp-blue-7d9f4c8b6c-pqrst 1/1 Running 0 12m
myapp-green-5b8c2d1f9d-uvwxy 1/1 Running 0 2m
myapp-green-5b8c2d1f9d-zabcd 1/1 Running 0 2m
myapp-green-5b8c2d1f9d-efghi 1/1 Running 0 2m
myapp-green-5b8c2d1f9d-jklmn 1/1 Running 0 2m
Eight pods running, four blue actively serving, four green idle.
Step 4: Cut over with one selector patch
kubectl patch service myapp \
-p '{"spec":{"selector":{"app":"myapp","version":"green"}}}'
This is the cutover. The Service controller updates the EndpointSlices, kube-proxy on every node refreshes its forwarding rules, and within a few seconds new connections route to green pods only. Connections in flight against blue pods continue until they close naturally.
Why this works: the Service selector field is mutable. Changing it is the same as deploying a new Service from the cluster's perspective, except the Service IP and DNS name are preserved. That is the entire mechanism.
Step 5: Roll back instantly if something is wrong
kubectl patch service myapp \
-p '{"spec":{"selector":{"app":"myapp","version":"blue"}}}'
This is the feature. A bad release becomes a no-op rollback with one command, because the blue pods never went away.
Step 6: Decommission blue once green is stable
After a defined soak window (typically 15 to 30 minutes of healthy production traffic), scale blue down:
kubectl scale deployment myapp-blue --replicas=0
Keep the blue Deployment object around for a release or two so a rollback after the next deploy is still possible. On the next release, blue becomes the new green, and so on.
Canary with NGINX Ingress weight annotations
Canary is different. Instead of an atomic switch, you route a small percentage of traffic to the new version, watch metrics, and increase the weight as confidence grows.
Important: ingress-nginx is retired. The
kubernetes/ingress-nginxrepository was archived as read-only on March 24, 2026. It still works, but it will not receive bug fixes or CVE patches. The Kubernetes Steering and Security Response Committees recommend migrating to Gateway API or another maintained ingress controller. The example below is included because it remains the most-deployed canary mechanism in 2026 (Datadog put ingress-nginx at roughly 50% of cloud-native environments at the time of retirement). Use it to understand the pattern, but plan a migration.
Step 1: Deploy the stable and canary versions
Use two Deployments with different version labels, the same as the blue-green pattern:
# stable-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-stable
spec:
replicas: 6
selector:
matchLabels:
app: myapp
version: stable
template:
metadata:
labels:
app: myapp
version: stable
spec:
containers:
- name: app
image: registry.internal/myapp:v1.4.0
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: myapp-stable
spec:
selector:
app: myapp
version: stable
ports:
- port: 80
targetPort: 8080
# canary-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-canary
spec:
replicas: 1 # small canary pool
selector:
matchLabels:
app: myapp
version: canary
template:
metadata:
labels:
app: myapp
version: canary
spec:
containers:
- name: app
image: registry.internal/myapp:v1.5.0
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: myapp-canary
spec:
selector:
app: myapp
version: canary
ports:
- port: 80
targetPort: 8080
Each version has its own Service. That is the prerequisite for traffic splitting.
Step 2: Define the stable Ingress
# stable-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp
spec:
ingressClassName: nginx
rules:
- host: app.example.internal
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-stable
port:
number: 80
Step 3: Define the canary Ingress with a weight annotation
The NGINX Ingress canary feature requires a second Ingress with the same host, marked as canary:
# canary-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-canary
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "5" # 5% of traffic
spec:
ingressClassName: nginx
rules:
- host: app.example.internal
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-canary
port:
number: 80
canary-weight is the percentage of random requests routed to the canary Service. The precedence order is canary-by-header first, then canary-by-cookie, then canary-weight. Combined, you can build "always send my QA team to the canary, plus 5% of everyone else" without code changes.
Step 4: Promote in steps
The point of a canary is that you raise the weight only when the metrics agree. A typical schedule:
# Start at 5% for 10 minutes, watch error rate and latency
kubectl annotate ingress myapp-canary \
nginx.ingress.kubernetes.io/canary-weight=25 --overwrite
# After another 10 minutes
kubectl annotate ingress myapp-canary \
nginx.ingress.kubernetes.io/canary-weight=50 --overwrite
# Then 100, at which point all traffic is on the new version
kubectl annotate ingress myapp-canary \
nginx.ingress.kubernetes.io/canary-weight=100 --overwrite
Once the canary is at 100%, swap the stable Deployment's image to the new version, scale the canary back to 1 replica with a placeholder, and you are ready for the next release.
What this canary cannot do
A failed canary does not roll itself back. If error rates spike at 25% weight, you have to notice and revert manually. There is no metric loop, no analysis, no automation. That is the gap Argo Rollouts fills.
Automating both strategies with Argo Rollouts
Argo Rollouts (current stable v1.9.0, released March 20, 2026) replaces the Deployment resource with a Rollout CRD. It manages the same ReplicaSets and Services, but adds step-based progressive delivery, integrated traffic routing, and analysis-driven automatic rollback.
Why use it: the canary above requires you to watch Grafana while running kubectl annotate by hand. Argo Rollouts watches Prometheus for you, advances the weight automatically when SLOs hold, and rolls back the moment they break.
Blue-green with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 4
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: registry.internal/myapp:v1.5.0
ports:
- containerPort: 8080
strategy:
blueGreen:
activeService: myapp-active # production Service
previewService: myapp-preview # internal Service for testing green
autoPromotionEnabled: false # require manual `kubectl argo rollouts promote`
scaleDownDelaySeconds: 600 # keep blue alive for 10 min after cutover
The activeService is the Service users hit. The previewService points to the green pods until promotion. With autoPromotionEnabled: false, the cutover happens only when you run kubectl argo rollouts promote myapp. The scaleDownDelaySeconds (default 30) keeps blue around long enough for traffic to drain and rollback to remain instant.
Canary with Argo Rollouts and analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 6
strategy:
canary:
canaryService: myapp-canary
stableService: myapp-stable
trafficRouting:
nginx:
stableIngress: myapp # reuses the Ingress pattern above
steps:
- setWeight: 5
- pause: { duration: 5m }
- analysis:
templates:
- templateName: success-rate # rollback if the AnalysisTemplate fails
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: registry.internal/myapp:v1.5.0
Paired with an AnalysisTemplate that queries Prometheus for the request success rate:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
value: myapp-canary
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.99
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(rate(
nginx_ingress_controller_requests{
service="NaN",
status!~"5.."
}[2m]
)) /
sum(rate(
nginx_ingress_controller_requests{
service="NaN"
}[2m]
))
If the success rate drops below 99% for three consecutive samples, the Rollout aborts, traffic returns to the stable version, and the Rollout enters Degraded status. This is the only configuration in this article where a failed canary rolls back automatically. Native Kubernetes and the bare NGINX Ingress canary above both require human intervention.
Gateway API: the future of native traffic splitting
The longer-term answer is Gateway API, which reached GA on October 31, 2023 with v1.0 alongside Kubernetes 1.29. Its HTTPRoute resource supports weighted backends out of the box, no annotations and no extra controller required (other than a Gateway implementation):
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: myapp
spec:
parentRefs:
- name: my-gateway
hostnames:
- app.example.internal
rules:
- backendRefs:
- name: myapp-stable
port: 80
weight: 95
- name: myapp-canary
port: 80
weight: 5
The backendRefs weights are proportional, not percentages. With weights 95 and 5, the canary receives 5% of traffic; with 9 and 1, it receives 10%. Default weight is 1.
In 2026 this is where I would build new clusters. It removes the dependency on a controller-specific annotation vocabulary, and Argo Rollouts already supports trafficRouting via the Gateway API plugin so progressive delivery still works.
Traffic shift considerations: sessions, caches, database migrations
This is the part that breaks more cutovers than any controller-level mistake.
Sessions. Blue-green assumes a user can be routed to either environment without breaking. If your application stores session state in pod memory, a cutover logs everyone out the moment the selector flips. Move sessions to Redis, Memcached, or a database table before you adopt blue-green. The same applies to canary: a small fraction of users will ping-pong between stable and canary on every request, so any in-memory session state will misbehave for them.
Caches. Both strategies expose cold caches. After a blue-green cutover, every cache-warming request hits the new pods at once. With low rolling updates this happens gradually; with a hard switch it can spike database load enough to look like an outage. Either pre-warm the green environment with a small percentage of traffic (using Argo Rollouts' analysis pause), or ship with cache-aside patterns that degrade gracefully on a cold miss.
Database migrations. This is the real ceiling. Blue-green only works when both versions can read and write the same database safely. Schema changes that drop or rename a column break that contract. The discipline is the expand/contract migration pattern: make all schema changes additive first (expand), deploy the new code, then in a later release remove the old column (contract). If you cannot live without dropping the column for one release, blue-green is the wrong tool. Use a maintenance window or a rolling update with a flag that gates the schema use.
Long-lived connections. WebSockets, gRPC streams, and SSE outlive a typical terminationGracePeriodSeconds. Both strategies need explicit handling here, often a server-side close on rollout combined with a client that auto-reconnects.
Cost implications of running two full environments
Blue-green costs roughly 2x compute during the soak window. For 20 replicas at 4 GB each, that is an extra 80 GB of memory the cluster has to host. On managed Kubernetes, that translates directly to node-pool size and bill.
A few honest mitigations:
- Use Argo Rollouts'
previewReplicaCountto run green at lower replica counts during preview, then scale up just before cutover. - Rely on cluster autoscaler to evict the spare capacity quickly after blue is decommissioned. The faster you scale blue down post-soak, the less the cost shows up in your monthly bill.
- For non-critical services, use canary instead. A single canary pod alongside 6 stable pods costs ~17% extra, not 100%.
For most internal services the cost of blue-green is not worth the instant rollback. Reserve it for revenue-critical paths and high-blast-radius releases.
What blue-green and canary are NOT
Three misconceptions show up reliably in code reviews:
"Blue-green means two permanent clusters." No. The pattern works at any scope: two Deployments behind one Service inside the same namespace is the most common form. Multi-cluster blue-green exists, but it is an extreme variant for disaster recovery, not the baseline. The cluster boundary is irrelevant; what matters is two parallel sets of pods sharing one Service identity.
"Canary deployments require a service mesh." They do not. A service mesh (Istio, Linkerd) gives you precise per-request routing and rich observability, which makes canaries safer. But replica-ratio canaries (1 canary pod alongside 9 stable pods) work on any cluster, and ingress-weight canaries work on any cluster with a supported ingress controller. The mesh is a nice-to-have, not a prerequisite.
"A failed canary automatically rolls back." Only if a controller is watching for you. The bare NGINX Ingress canary in Step 3 of the canary section above does not. Native Kubernetes does not. Argo Rollouts with an AnalysisTemplate, or Flagger, or a service mesh integration plus an SLO controller, does. If your runbook says "we use canary deployments so failures roll back automatically" and you are not running one of those, your runbook is wrong.
What you learned
The strategy you pick is a function of the rollback speed you need versus the cost you can pay. Rolling updates are the default. Blue-green buys instant rollback for double the compute. Canary buys traffic-validated promotion for the cost of one extra pod plus the operational discipline of watching metrics during each step.
Native Kubernetes can do all three, but for canary specifically the native primitives stop short of automatic rollback. That is the line where Argo Rollouts earns its complexity. And Gateway API is where this whole stack is heading once ingress-nginx is fully behind us.
Where to go next
- The foundations behind any rollout: Kubernetes rolling updates and zero-downtime deployments.
- The CI/CD side of progressive delivery: Kubernetes CI/CD with GitHub Actions.
- The GitOps controller that pairs naturally with Argo Rollouts: GitOps with Argo CD.