Why kube-proxy fails for gRPC and WebSocket
kube-proxy makes a load-balancing decision once per TCP connection. In iptables mode (the default until Kubernetes 1.30) and in the newer nftables mode (default from 1.31), it installs DNAT rules that fire when a new TCP SYN arrives. The kernel's conntrack table records the destination pod, and every subsequent packet on that connection follows the same conntrack entry. kube-proxy is never consulted again.
For HTTP/1.1 this works fine. Clients open many short-lived connections, each triggering a fresh DNAT decision. Traffic spreads across pods naturally.
gRPC is built on HTTP/2. HTTP/2 establishes one long-lived TCP connection per client-server pair and multiplexes all RPCs as streams over that single connection. The DNAT decision happens once, at connection time. Every subsequent RPC rides the same connection to the same pod. Scale the backend to ten replicas, and nine sit idle.
WebSocket has the same root problem. A WebSocket upgrade creates a persistent TCP connection that stays open for minutes, hours, or days. All messages flow to the one pod that accepted the initial handshake.
IPVS mode does not fix this. Neither does nftables mode. The issue is structural: L4 load balancing distributes connections, not requests.
L4 versus L7 load balancing
The distinction matters because choosing the wrong layer is the root cause.
| Characteristic | L4 (transport) | L7 (application) |
|---|---|---|
| OSI layer | TCP/UDP | HTTP, gRPC, WebSocket |
| Decision unit | Per TCP connection | Per HTTP request or gRPC stream |
| Protocol awareness | None; forwards raw packets | Parses headers, methods, paths |
| TLS handling | Passthrough or terminate | Must terminate to inspect |
| Cost | Low (kernel-space NAT) | Higher (user-space parsing) |
| gRPC behavior | One pod per channel | Per-RPC distribution across pods |
An L7 load balancer terminates the client's TCP connection, opens separate HTTP/2 connections to each backend pod, and distributes individual gRPC streams (or HTTP requests) across those backend connections. The client sees one connection; the proxy fans out internally. That is why L7 solves the problem and L4 cannot.
Headless Service with client-side load balancing
The lightest solution. No proxy, no mesh, no extra infrastructure. Requires control over the gRPC client code.
A standard ClusterIP Service resolves to one virtual IP. kube-proxy handles distribution behind it. A headless Service (spec.clusterIP: None) has no VIP. DNS returns multiple A records, one per ready pod. The client receives all pod IPs and distributes its own connections.
gRPC clients support a round_robin load-balancing policy that creates a subchannel to each resolved IP and cycles RPCs across them:
// Go gRPC client using DNS-based round-robin (grpc-go 1.62+)
conn, err := grpc.Dial(
"dns:///my-grpc-service-headless.prod.svc.cluster.local:50051",
grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
)
Datadog runs this pattern at tens of thousands of pods processing trillions of data points daily, without a service mesh.
The MaxConnectionAge trick
Headless + round-robin has a catch: gRPC clients re-resolve DNS only when a connection closes. New pods added during a scale-out event receive no traffic until existing connections cycle. Jamf Engineering solved this with three lines of server-side configuration:
// Force the server to close connections after 30 seconds (grpc-go keepalive)
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionAge: 30 * time.Second, // hard limit on connection lifetime
MaxConnectionAgeGrace: 10 * time.Second, // grace for in-flight RPCs
})
Combined with minReadySeconds: 30 on the Deployment (so old DNS entries expire before traffic shifts), this forces periodic DNS re-resolution and even distribution across all pods, including new ones.
Pitfalls to watch for
pick_firstdefault. gRPC clients default topick_first, which connects to the first DNS result and ignores the rest. Always override toround_robinwhen using headless Services.- Java DNS caching. Java caches DNS for 30 seconds by default (or indefinitely with a security manager). Go has no DNS cache. Verify your language SDK's behavior.
- IP recycling. Kubernetes can reassign a deleted pod's IP to a different Service within seconds. Datadog documented cases where gRPC clients silently connected to the wrong backend after a rollout. Mitigate with TLS server identity verification and
MaxConnectionAge. - Conntrack exhaustion. Aggressive reconnection parameters (Datadog had 300ms intervals across 900 clients) can generate enough SYN traffic to saturate VPC connection tracking tables, dropping legitimate connections.
L7 proxy at the edge
For gRPC services exposed outside the cluster, an L7 Ingress controller or Gateway API implementation handles per-RPC distribution without touching application code.
NGINX Ingress Controller supports gRPC natively. It bypasses the Kubernetes Service VIP entirely, subscribing to the Endpoints API to manage its own upstream pool of pod IPs:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grpc-ingress
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
spec:
ingressClassName: nginx
rules:
- host: grpc.yoursite.nl
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-grpc-service
port:
number: 50051
GRPCRoute (Gateway API) is the forward-looking approach. It graduated to GA in Gateway API v1.1.0 and provides gRPC-native routing: match on service name and method, retry on specific gRPC status codes, and observe gRPC-specific metrics. Implementations that support it include Envoy Gateway, Istio, NGINX Gateway Fabric, Cilium, and AWS Load Balancer Controller.
These solutions work well for edge/ingress traffic. For service-to-service (east-west) gRPC traffic inside the cluster, a proxy at the edge does not help. You need either client-side balancing or a service mesh.
Service mesh as the production answer
A service mesh injects a proxy that intercepts all pod traffic and operates at L7. No code changes required. The mesh proxy opens separate HTTP/2 connections to each backend pod and distributes individual gRPC streams, exactly the L7 behavior the problem demands.
Linkerd injects an ultralight Rust-based proxy per pod. It automatically performs request-level load balancing for HTTP/2 and gRPC using EWMA (Exponentially Weighted Moving Average) of response latencies, shifting traffic away from slow pods in real time. Overhead: <1ms p99 latency, <10 MiB RSS per pod. A 2024 benchmark measured +33% latency under mTLS, the lowest of the three major meshes.
Istio injects an Envoy sidecar per pod. Configure gRPC balancing via a DestinationRule:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-grpc-service
spec:
host: my-grpc-service
trafficPolicy:
loadBalancer:
simple: LEAST_REQUEST # recommended for gRPC
connectionPool:
http:
http2MaxRequests: 1000
maxRequestsPerConnection: 100 # forces connection cycling
The sidecar model adds +166% latency under mTLS. Istio Ambient mode (per-node ztunnel + optional L7 waypoint proxy, no sidecar) cuts that to +8%.
Cilium offers a per-node Envoy proxy (shared across all pods on the node) enabled with a single annotation: service.cilium.io/lb-l7: enabled. gRPC L7 load balancing is documented as beta. For teams already running Cilium as their CNI (GKE Dataplane V2, AKS with Azure CNI Powered by Cilium), this avoids adding a second mesh.
Choosing a mesh for gRPC
| Mesh | gRPC LB status | mTLS latency overhead | Best for |
|---|---|---|---|
| Linkerd | Stable | +33% | Lowest overhead, simplest operations |
| Istio Ambient | Stable | +8% | Advanced traffic policies, lower overhead than sidecar |
| Cilium L7 | Beta | +99% | eBPF-native shops already on Cilium CNI |
Proxyless gRPC with xDS
For teams that want mesh-like endpoint discovery without sidecar overhead, gRPC natively implements an xDS client (supported in Go, Java, C++, Python since gRPC 1.30). With the xds:/// URI scheme, the gRPC client connects to an xDS control plane, receives real-time pod IP updates via Endpoint Discovery Service (EDS), and performs client-side load balancing directly.
Databricks built this: a custom EDS server watching Kubernetes EndpointSlice objects, feeding pod IPs and locality-aware weights to gRPC clients. The xDS agent uses <25 MiB memory and <0.1% CPU, compared to 50-100 MiB+ for an Envoy sidecar.
The tradeoff is real: you must instrument every gRPC client with xDS bootstrap configuration, and non-gRPC services still need a traditional proxy. This is a good fit for large-scale gRPC-heavy platforms where sidecar overhead is a material cost.
WebSocket: a different problem
WebSocket and gRPC share the same root cause (persistent connection pins to one pod) but differ in what a good solution looks like.
gRPC multiplexes independent RPCs over one connection. The fix is distributing RPCs across pods. WebSocket carries a single stateful stream: chat sessions, subscribed topics, game state. Reconnecting to a different pod breaks the application unless that pod has access to the same state.
For WebSocket, the goal is not "distribute messages across pods." It is "place new connections evenly and handle reconnects gracefully."
Ingress timeout and affinity configuration
Default NGINX Ingress timeouts (60 seconds) silently drop WebSocket connections. Extend them:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ws-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "WSROUTE"
nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
nginx.ingress.kubernetes.io/proxy-buffering: "off"
Cookie-based session affinity routes reconnects back to the same pod. leastconn balancing (available in HAProxy Ingress) routes new connections to the pod with the fewest active connections, evening out the load as pods scale.
Scaling WebSocket services
Sticky sessions work at moderate scale. At production scale (tens of thousands of concurrent connections), pod failure drops all its clients simultaneously. The resilient pattern is a shared-state backplane: store session state in Redis, NATS, or Kafka. Any pod can serve any client. Pub/sub fans messages to the right clients regardless of which pod they are on.
Standard HPA based on CPU does not reflect WebSocket load well (10,000 idle connections use minimal CPU). Use KEDA with a custom metric: active connection count per pod.
Scale-down needs care. Terminating a pod drops every connection on it. Use a preStop hook to send WebSocket close frames (1001 Going Away) so clients know to reconnect, and set terminationGracePeriodSeconds long enough for clients to drain. I have seen 60-120 seconds work well for most WebSocket services.
What this article does not cover
This article focuses on the load-balancing problem and its solutions. It does not cover gRPC health checking (see the health probes article for grpc liveness and readiness probes), mutual TLS configuration within a service mesh, or graceful shutdown patterns for long-lived connections (see the graceful shutdown article).
When to choose which approach
| Situation | Recommended approach |
|---|---|
| Control gRPC client code, want minimal infra | Headless Service + round_robin + MaxConnectionAge |
| Edge/ingress gRPC routing | NGINX Ingress backend-protocol: GRPC or GRPCRoute |
| Zero code changes, lightest overhead | Linkerd |
| Advanced traffic policies, already on Istio | Istio Ambient mode |
| Already using Cilium CNI | Cilium L7 proxy (beta) |
| gRPC-native, no sidecar budget | Proxyless xDS |
| WebSocket with stateful sessions | Ingress cookie affinity + shared-state backplane |
| WebSocket autoscaling | KEDA with connection-count metric |