Kubernetes load balancing for long-lived connections (gRPC, WebSocket)

gRPC and WebSocket workloads often pin all traffic to a single pod despite having multiple replicas. The cause is a mismatch between Kubernetes L4 load balancing (which distributes TCP connections) and protocols that multiplex many requests over one persistent connection. This article explains why default kube-proxy behavior fails for these protocols, what the difference between L4 and L7 load balancing means in practice, and which solutions exist at each level of infrastructure complexity.

Why kube-proxy fails for gRPC and WebSocket

kube-proxy makes a load-balancing decision once per TCP connection. In iptables mode (the default until Kubernetes 1.30) and in the newer nftables mode (default from 1.31), it installs DNAT rules that fire when a new TCP SYN arrives. The kernel's conntrack table records the destination pod, and every subsequent packet on that connection follows the same conntrack entry. kube-proxy is never consulted again.

For HTTP/1.1 this works fine. Clients open many short-lived connections, each triggering a fresh DNAT decision. Traffic spreads across pods naturally.

gRPC is built on HTTP/2. HTTP/2 establishes one long-lived TCP connection per client-server pair and multiplexes all RPCs as streams over that single connection. The DNAT decision happens once, at connection time. Every subsequent RPC rides the same connection to the same pod. Scale the backend to ten replicas, and nine sit idle.

WebSocket has the same root problem. A WebSocket upgrade creates a persistent TCP connection that stays open for minutes, hours, or days. All messages flow to the one pod that accepted the initial handshake.

IPVS mode does not fix this. Neither does nftables mode. The issue is structural: L4 load balancing distributes connections, not requests.

L4 versus L7 load balancing

The distinction matters because choosing the wrong layer is the root cause.

Characteristic	L4 (transport)	L7 (application)
OSI layer	TCP/UDP	HTTP, gRPC, WebSocket
Decision unit	Per TCP connection	Per HTTP request or gRPC stream
Protocol awareness	None; forwards raw packets	Parses headers, methods, paths
TLS handling	Passthrough or terminate	Must terminate to inspect
Cost	Low (kernel-space NAT)	Higher (user-space parsing)
gRPC behavior	One pod per channel	Per-RPC distribution across pods

An L7 load balancer terminates the client's TCP connection, opens separate HTTP/2 connections to each backend pod, and distributes individual gRPC streams (or HTTP requests) across those backend connections. The client sees one connection; the proxy fans out internally. That is why L7 solves the problem and L4 cannot.

Headless Service with client-side load balancing

The lightest solution. No proxy, no mesh, no extra infrastructure. Requires control over the gRPC client code.

A standard ClusterIP Service resolves to one virtual IP. kube-proxy handles distribution behind it. A headless Service (spec.clusterIP: None) has no VIP. DNS returns multiple A records, one per ready pod. The client receives all pod IPs and distributes its own connections.

gRPC clients support a round_robin load-balancing policy that creates a subchannel to each resolved IP and cycles RPCs across them:

// Go gRPC client using DNS-based round-robin (grpc-go 1.62+)
conn, err := grpc.Dial(
    "dns:///my-grpc-service-headless.prod.svc.cluster.local:50051",
    grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
)

Datadog runs this pattern at tens of thousands of pods processing trillions of data points daily, without a service mesh.

The `MaxConnectionAge` trick

Headless + round-robin has a catch: gRPC clients re-resolve DNS only when a connection closes. New pods added during a scale-out event receive no traffic until existing connections cycle. Jamf Engineering solved this with three lines of server-side configuration:

// Force the server to close connections after 30 seconds (grpc-go keepalive)
grpc.KeepaliveParams(keepalive.ServerParameters{
    MaxConnectionAge:      30 * time.Second,  // hard limit on connection lifetime
    MaxConnectionAgeGrace: 10 * time.Second,  // grace for in-flight RPCs
})

Combined with minReadySeconds: 30 on the Deployment (so old DNS entries expire before traffic shifts), this forces periodic DNS re-resolution and even distribution across all pods, including new ones.

Pitfalls to watch for

pick_first default. gRPC clients default to pick_first, which connects to the first DNS result and ignores the rest. Always override to round_robin when using headless Services.
Java DNS caching. Java caches DNS for 30 seconds by default (or indefinitely with a security manager). Go has no DNS cache. Verify your language SDK's behavior.
IP recycling. Kubernetes can reassign a deleted pod's IP to a different Service within seconds. Datadog documented cases where gRPC clients silently connected to the wrong backend after a rollout. Mitigate with TLS server identity verification and MaxConnectionAge.
Conntrack exhaustion. Aggressive reconnection parameters (Datadog had 300ms intervals across 900 clients) can generate enough SYN traffic to saturate VPC connection tracking tables, dropping legitimate connections.

L7 proxy at the edge

For gRPC services exposed outside the cluster, an L7 Ingress controller or Gateway API implementation handles per-RPC distribution without touching application code.

NGINX Ingress Controller supports gRPC natively. It bypasses the Kubernetes Service VIP entirely, subscribing to the Endpoints API to manage its own upstream pool of pod IPs:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grpc-ingress
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
spec:
  ingressClassName: nginx
  rules:
    - host: grpc.yoursite.nl
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-grpc-service
                port:
                  number: 50051

GRPCRoute (Gateway API) is the forward-looking approach. It graduated to GA in Gateway API v1.1.0 and provides gRPC-native routing: match on service name and method, retry on specific gRPC status codes, and observe gRPC-specific metrics. Implementations that support it include Envoy Gateway, Istio, NGINX Gateway Fabric, Cilium, and AWS Load Balancer Controller.

These solutions work well for edge/ingress traffic. For service-to-service (east-west) gRPC traffic inside the cluster, a proxy at the edge does not help. You need either client-side balancing or a service mesh.

Service mesh as the production answer

A service mesh injects a proxy that intercepts all pod traffic and operates at L7. No code changes required. The mesh proxy opens separate HTTP/2 connections to each backend pod and distributes individual gRPC streams, exactly the L7 behavior the problem demands.

Linkerd injects an ultralight Rust-based proxy per pod. It automatically performs request-level load balancing for HTTP/2 and gRPC using EWMA (Exponentially Weighted Moving Average) of response latencies, shifting traffic away from slow pods in real time. Overhead: <1ms p99 latency, <10 MiB RSS per pod. A 2024 benchmark measured +33% latency under mTLS, the lowest of the three major meshes.

Istio injects an Envoy sidecar per pod. Configure gRPC balancing via a DestinationRule:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-grpc-service
spec:
  host: my-grpc-service
  trafficPolicy:
    loadBalancer:
      simple: LEAST_REQUEST        # recommended for gRPC
    connectionPool:
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 100  # forces connection cycling

The sidecar model adds +166% latency under mTLS. Istio Ambient mode (per-node ztunnel + optional L7 waypoint proxy, no sidecar) cuts that to +8%.

Cilium offers a per-node Envoy proxy (shared across all pods on the node) enabled with a single annotation: service.cilium.io/lb-l7: enabled. gRPC L7 load balancing is documented as beta. For teams already running Cilium as their CNI (GKE Dataplane V2, AKS with Azure CNI Powered by Cilium), this avoids adding a second mesh.

Choosing a mesh for gRPC

Mesh	gRPC LB status	mTLS latency overhead	Best for
Linkerd	Stable	+33%	Lowest overhead, simplest operations
Istio Ambient	Stable	+8%	Advanced traffic policies, lower overhead than sidecar
Cilium L7	Beta	+99%	eBPF-native shops already on Cilium CNI

Proxyless gRPC with xDS

For teams that want mesh-like endpoint discovery without sidecar overhead, gRPC natively implements an xDS client (supported in Go, Java, C++, Python since gRPC 1.30). With the xds:/// URI scheme, the gRPC client connects to an xDS control plane, receives real-time pod IP updates via Endpoint Discovery Service (EDS), and performs client-side load balancing directly.

Databricks built this: a custom EDS server watching Kubernetes EndpointSlice objects, feeding pod IPs and locality-aware weights to gRPC clients. The xDS agent uses <25 MiB memory and <0.1% CPU, compared to 50-100 MiB+ for an Envoy sidecar.

The tradeoff is real: you must instrument every gRPC client with xDS bootstrap configuration, and non-gRPC services still need a traditional proxy. This is a good fit for large-scale gRPC-heavy platforms where sidecar overhead is a material cost.

WebSocket: a different problem

WebSocket and gRPC share the same root cause (persistent connection pins to one pod) but differ in what a good solution looks like.

gRPC multiplexes independent RPCs over one connection. The fix is distributing RPCs across pods. WebSocket carries a single stateful stream: chat sessions, subscribed topics, game state. Reconnecting to a different pod breaks the application unless that pod has access to the same state.

For WebSocket, the goal is not "distribute messages across pods." It is "place new connections evenly and handle reconnects gracefully."

Ingress timeout and affinity configuration

Default NGINX Ingress timeouts (60 seconds) silently drop WebSocket connections. Extend them:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ws-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "WSROUTE"
    nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"

Cookie-based session affinity routes reconnects back to the same pod. leastconn balancing (available in HAProxy Ingress) routes new connections to the pod with the fewest active connections, evening out the load as pods scale.

Scaling WebSocket services

Sticky sessions work at moderate scale. At production scale (tens of thousands of concurrent connections), pod failure drops all its clients simultaneously. The resilient pattern is a shared-state backplane: store session state in Redis, NATS, or Kafka. Any pod can serve any client. Pub/sub fans messages to the right clients regardless of which pod they are on.

Standard HPA based on CPU does not reflect WebSocket load well (10,000 idle connections use minimal CPU). Use KEDA with a custom metric: active connection count per pod.

Scale-down needs care. Terminating a pod drops every connection on it. Use a preStop hook to send WebSocket close frames (1001 Going Away) so clients know to reconnect, and set terminationGracePeriodSeconds long enough for clients to drain. I have seen 60-120 seconds work well for most WebSocket services.

What this article does not cover

This article focuses on the load-balancing problem and its solutions. It does not cover gRPC health checking (see the health probes article for grpc liveness and readiness probes), mutual TLS configuration within a service mesh, or graceful shutdown patterns for long-lived connections (see the graceful shutdown article).

When to choose which approach

Situation	Recommended approach
Control gRPC client code, want minimal infra	Headless Service + `round_robin` + `MaxConnectionAge`
Edge/ingress gRPC routing	NGINX Ingress `backend-protocol: GRPC` or GRPCRoute
Zero code changes, lightest overhead	Linkerd
Advanced traffic policies, already on Istio	Istio Ambient mode
Already using Cilium CNI	Cilium L7 proxy (beta)
gRPC-native, no sidecar budget	Proxyless xDS
WebSocket with stateful sessions	Ingress cookie affinity + shared-state backplane
WebSocket autoscaling	KEDA with connection-count metric

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy