What you will have at the end
A set of HPA configurations that scale a Deployment on CPU utilization, memory, and custom Prometheus metrics. You will understand the scaling algorithm, know how to tune the stabilization window for each direction, and have a working Prometheus Adapter rule that bridges application metrics into the HPA control loop.
Prerequisites
kubectlconnected to a Kubernetes 1.27+ cluster (1.27 promotedHPAContainerMetricsto beta, enabling per-container scaling by default)- metrics-server installed and returning data (
kubectl top podsshows values, noterror: Metrics API not available) - A Deployment with resource requests defined on all containers, including injected sidecars. HPA calculates utilization as
current usage / request, so without requests it reports<unknown> - For custom metrics sections: Prometheus running in-cluster and the Prometheus Adapter Helm chart installed
The scaling algorithm
HPA runs as a control loop inside kube-controller-manager, syncing every 15 seconds by default. Each cycle it computes:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
If the current average CPU is 200m and the target is 100m, the ratio is 2.0, and HPA doubles the replica count. If usage drops to 50m, the ratio is 0.5, and it halves the count.
A built-in tolerance band of +/-10% prevents flapping. When the ratio falls between 0.9 and 1.1, HPA takes no action. Kubernetes 1.33 introduced an alpha feature gate (HPAConfigurableTolerance) that lets you override this tolerance per-HPA, but on clusters older than 1.33 the 10% band is fixed.
When multiple metrics are defined, HPA evaluates each independently and picks the maximum desired replica count across all of them. That ensures every constraint is satisfied simultaneously.
CPU-based scaling
CPU is a compressible resource: when a container hits its limit, the kernel throttles it instead of killing it. That makes CPU utilization the most reliable primary scaling signal for stateless services.
Create the HPA
# hpa-cpu.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2 # never fewer than 2 for production
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # scale out when avg CPU exceeds 70% of requests
Apply and verify:
kubectl apply -f hpa-cpu.yaml
kubectl get hpa web-app-hpa
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# web-app-hpa Deployment/web-app 45%/70% 2 20 3 5m
The TARGETS column shows current%/target%. If it reads <unknown>/70%, the most likely cause is a missing resource request on one of the containers in the target Deployment.
Quick alternative with kubectl
For non-production testing, you can create an HPA imperatively:
kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=20
In production, always declare HPA in version-controlled YAML. Imperative objects have no audit trail.
Memory-based scaling (and why it is tricky)
Memory is non-compressible. When a container exceeds its memory limit, the kernel OOM-kills it immediately: in-flight requests are lost, caches are destroyed, and the replacement pod starts cold. Adding pods only helps if the memory pressure comes from concurrent users, not from a leak.
HPA polls every 15 seconds. A sudden memory spike that fills a container in 2 seconds causes an OOM kill before HPA even notices. For that reason, memory-based HPA works as a supplementary signal, not a primary one.
Conservative targets
| Workload type | Recommended target |
|---|---|
| Stateless web apps | 70-75% |
| Safety-critical services | 65-70% |
| Batch processing | 75-80% |
The lower targets leave headroom for bursts before the OOM killer intervenes.
Memory HPA manifest
# hpa-memory.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: memory-scaler
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 3
maxReplicas: 15
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75 # conservative; leaves ~25% headroom
behavior:
scaleUp:
stabilizationWindowSeconds: 120
policies:
- type: Percent
value: 50
periodSeconds: 120
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 600 # 10 minutes; very cautious
policies:
- type: Percent
value: 10
periodSeconds: 180
selectPolicy: Min
The longer stabilizationWindowSeconds on the scale-down side prevents premature pod removal when memory usage dips temporarily between request batches.
What memory HPA cannot solve
A memory leak grows until OOM regardless of replica count. If you see pods consistently climbing toward their limit and restarting, the fix is in the application code or in the garbage collector configuration, not in HPA. For right-sizing memory requests based on observed usage patterns, look at VPA (updateMode: "Off" for recommendations only).
Custom metrics via Prometheus Adapter
CPU and memory cover many workloads, but business-level signals like request rate, queue depth, or p95 latency often correlate better with user-facing load. The Prometheus Adapter bridges Prometheus metrics into the custom.metrics.k8s.io and external.metrics.k8s.io APIs that HPA can query.
Architecture
Application --> /metrics endpoint --> Prometheus (scrapes) --> Prometheus Adapter
--> custom.metrics.k8s.io API --> HPA
Configure an adapter rule
In the Helm values.yaml for prometheus-community/prometheus-adapter, define a rule that converts a Prometheus counter into a per-pod rate:
rules:
default: false # disable built-in CPU/memory rules; metrics-server handles those
custom:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total"
as: "${1}_per_second" # exposes as http_requests_per_second
metricsQuery: |
sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)
The metricsQuery computes a 2-minute rate. The <<.Series>>, <<.LabelMatchers>>, and <<.GroupBy>> placeholders are filled in by the adapter at query time.
Verify the metric is reachable
# confirm the API service is registered
kubectl get apiservice v1beta1.custom.metrics.k8s.io
# NAME SERVICE AVAILABLE
# v1beta1.custom.metrics.k8s.io monitoring/prometheus-adapter True
# list all exposed custom metrics
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq '.resources[].name'
# query the specific metric for pods in a namespace
kubectl get --raw \
"/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" \
| jq '.items[].value'
If the API service shows False under AVAILABLE, check the adapter pod logs: kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-adapter.
HPA targeting the custom metric
# hpa-custom.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000" # target: 1000 req/s per pod
You can combine resource metrics and custom metrics in a single HPA. HPA evaluates all of them and uses the highest desired replica count. A common pattern is CPU as a safety net and request rate as the primary signal:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
Scaling behaviour: stabilization windows and policies
The behavior field (stable since autoscaling/v2) controls how aggressively HPA scales in each direction. The defaults are asymmetric on purpose: scale up immediately, wait 5 minutes before scaling down.
The full schema
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # act on the latest recommendation immediately
selectPolicy: Max # pick the policy allowing the most change
policies:
- type: Percent
value: 100 # double the current count per period
periodSeconds: 15
- type: Pods
value: 4 # or add 4 pods per period, whichever is larger
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300 # look back 5 min; use the highest recommendation
selectPolicy: Min # pick the policy allowing the least change
policies:
- type: Percent
value: 10 # remove max 10% per period
periodSeconds: 60
How the stabilization window works
For scale-down, the window default is 300 seconds. HPA looks at every desired-replica recommendation from the past 300 seconds and picks the highest. That means it will not remove pods if any recent recommendation wanted them. For scale-up, the default is 0 seconds: react to the latest recommendation immediately.
Why asymmetric?
Scaling up too slowly causes latency spikes and timeouts. Scaling down too fast causes cache churn, dropped connections on pods that just came online, and unnecessary pod restarts.
selectPolicy: Max on scaleUp picks the policy that allows the most aggressive response. selectPolicy: Min on scaleDown picks the most conservative. That is the right default for production HTTP services.
Disabling scale-down
During a known traffic event (Black Friday, product launch), you can prevent HPA from removing pods entirely:
behavior:
scaleDown:
selectPolicy: Disabled
Remove this after the event.
HPA + VPA interaction rules
The Vertical Pod Autoscaler (VPA) adjusts resource requests. HPA adjusts replica count based on the ratio of usage to requests. When both target the same resource, they create a feedback loop: VPA raises the CPU request, the utilization ratio drops, HPA scales down, per-pod load rises, VPA raises the request again.
Safe pattern: split by resource type
Let HPA own CPU (replica scaling) and VPA own memory (request sizing). Neither interferes with the other's signal:
# VPA: memory only
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
spec:
resourcePolicy:
containerPolicies:
- containerName: app
controlledResources: ["memory"]
# HPA: CPU only
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Alternative: HPA with AverageValue instead of Utilization
If you use type: AverageValue with a raw millicores target (averageValue: "200m"), VPA's changes to the request do not affect HPA's scaling decision because the target is an absolute value, not a ratio. This works but is harder to tune: you need to know the right absolute CPU value per pod, which changes with application updates.
VPA in recommendation-only mode
Run VPA with updateMode: "Off" so it collects data and produces recommendations without applying them. Use those recommendations to set requests manually, then let HPA control replicas without interference. This is the safest approach for teams who are not ready to give VPA automatic control.
Verify the final result
After applying your HPA, confirm it is active and receiving metrics:
kubectl describe hpa web-app-hpa
Look for three conditions:
| Condition | Healthy value | Meaning |
|---|---|---|
AbleToScale |
True | HPA can read and modify the scale subresource |
ScalingActive |
True | Metrics are flowing; scaling is operational |
ScalingLimited |
False | Not capped at minReplicas or maxReplicas |
If ScalingActive is False, HPA cannot fetch metrics. Check that metrics-server is running (kubectl get deployment metrics-server -n kube-system) and that the target Deployment has resource requests on every container.
Common troubleshooting
<unknown> in TARGETS column. The most frequent cause is a missing resources.requests block on one or more containers, including auto-injected sidecars from Istio or Linkerd. Second most common: metrics-server is not installed, or it fails TLS verification against kubelets. On kubeadm clusters with self-signed certs, adding --kubelet-insecure-tls to the metrics-server Deployment args resolves this.
HPA not scaling down to minReplicas. This is usually correct behaviour. The default stabilization window is 300 seconds: HPA will not scale down until 5 full minutes of consistently lower recommendations. Run kubectl describe hpa and read the Events section to see what HPA is recommending.
Custom metric shows FailedGetPodsMetric. The Prometheus Adapter either has no rule matching the series name, or the PromQL in metricsQuery returns no data. Test the query directly in Prometheus first: sum(rate(http_requests_total{namespace="production"}[2m])) by (pod).
Surge pods stuck in Pending during scale-up. The cluster does not have enough node capacity. Pair HPA with the Cluster Autoscaler or Karpenter to provision new nodes when pod demand exceeds node capacity.
For queue-driven or scale-to-zero workloads, HPA's minimum replica count is 1. KEDA extends HPA with 50+ native scalers and supports scaling to zero, which eliminates idle-pod costs for event-driven consumers.
Complete multi-metric HPA manifest
For reference, here is a production-ready HPA combining CPU, memory, and a custom metric with tuned scaling behaviour:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 1
periodSeconds: 60
selectPolicy: Min