Kubernetes Horizontal Pod Autoscaler (HPA): scaling pods by metrics

The Horizontal Pod Autoscaler adds or removes pod replicas based on metrics you choose: CPU utilization, memory pressure, request rates from Prometheus, or external signals like queue depth. This guide walks through configuring HPA v2 for each of those scenarios, tuning the stabilization window so scaling is fast up and cautious down, and avoiding the conflict that arises when HPA and VPA target the same resource.

What you will have at the end

A set of HPA configurations that scale a Deployment on CPU utilization, memory, and custom Prometheus metrics. You will understand the scaling algorithm, know how to tune the stabilization window for each direction, and have a working Prometheus Adapter rule that bridges application metrics into the HPA control loop.

Prerequisites

kubectl connected to a Kubernetes 1.27+ cluster (1.27 promoted HPAContainerMetrics to beta, enabling per-container scaling by default)
metrics-server installed and returning data (kubectl top pods shows values, not error: Metrics API not available)
A Deployment with resource requests defined on all containers, including injected sidecars. HPA calculates utilization as current usage / request, so without requests it reports <unknown>
For custom metrics sections: Prometheus running in-cluster and the Prometheus Adapter Helm chart installed

The scaling algorithm

HPA runs as a control loop inside kube-controller-manager, syncing every 15 seconds by default. Each cycle it computes:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

If the current average CPU is 200m and the target is 100m, the ratio is 2.0, and HPA doubles the replica count. If usage drops to 50m, the ratio is 0.5, and it halves the count.

A built-in tolerance band of +/-10% prevents flapping. When the ratio falls between 0.9 and 1.1, HPA takes no action. Kubernetes 1.33 introduced an alpha feature gate (HPAConfigurableTolerance) that lets you override this tolerance per-HPA, but on clusters older than 1.33 the 10% band is fixed.

When multiple metrics are defined, HPA evaluates each independently and picks the maximum desired replica count across all of them. That ensures every constraint is satisfied simultaneously.

CPU-based scaling

CPU is a compressible resource: when a container hits its limit, the kernel throttles it instead of killing it. That makes CPU utilization the most reliable primary scaling signal for stateless services.

Create the HPA

# hpa-cpu.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2              # never fewer than 2 for production
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # scale out when avg CPU exceeds 70% of requests

Apply and verify:

kubectl apply -f hpa-cpu.yaml

kubectl get hpa web-app-hpa
# NAME           REFERENCE          TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
# web-app-hpa    Deployment/web-app 45%/70%    2         20        3          5m

The TARGETS column shows current%/target%. If it reads <unknown>/70%, the most likely cause is a missing resource request on one of the containers in the target Deployment.

Quick alternative with kubectl

For non-production testing, you can create an HPA imperatively:

kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=20

In production, always declare HPA in version-controlled YAML. Imperative objects have no audit trail.

Memory-based scaling (and why it is tricky)

Memory is non-compressible. When a container exceeds its memory limit, the kernel OOM-kills it immediately: in-flight requests are lost, caches are destroyed, and the replacement pod starts cold. Adding pods only helps if the memory pressure comes from concurrent users, not from a leak.

HPA polls every 15 seconds. A sudden memory spike that fills a container in 2 seconds causes an OOM kill before HPA even notices. For that reason, memory-based HPA works as a supplementary signal, not a primary one.

Conservative targets

Workload type	Recommended target
Stateless web apps	70-75%
Safety-critical services	65-70%
Batch processing	75-80%

The lower targets leave headroom for bursts before the OOM killer intervenes.

Memory HPA manifest

# hpa-memory.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memory-scaler
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 15
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 75    # conservative; leaves ~25% headroom
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
      - type: Percent
        value: 50
        periodSeconds: 120
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 600   # 10 minutes; very cautious
      policies:
      - type: Percent
        value: 10
        periodSeconds: 180
      selectPolicy: Min

The longer stabilizationWindowSeconds on the scale-down side prevents premature pod removal when memory usage dips temporarily between request batches.

What memory HPA cannot solve

A memory leak grows until OOM regardless of replica count. If you see pods consistently climbing toward their limit and restarting, the fix is in the application code or in the garbage collector configuration, not in HPA. For right-sizing memory requests based on observed usage patterns, look at VPA (updateMode: "Off" for recommendations only).

Custom metrics via Prometheus Adapter

CPU and memory cover many workloads, but business-level signals like request rate, queue depth, or p95 latency often correlate better with user-facing load. The Prometheus Adapter bridges Prometheus metrics into the custom.metrics.k8s.io and external.metrics.k8s.io APIs that HPA can query.

Architecture

Application --> /metrics endpoint --> Prometheus (scrapes) --> Prometheus Adapter
  --> custom.metrics.k8s.io API --> HPA

Configure an adapter rule

In the Helm values.yaml for prometheus-community/prometheus-adapter, define a rule that converts a Prometheus counter into a per-pod rate:

rules:
  default: false                  # disable built-in CPU/memory rules; metrics-server handles those
  custom:
    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_total"
        as: "${1}_per_second"     # exposes as http_requests_per_second
      metricsQuery: |
        sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)

The metricsQuery computes a 2-minute rate. The <<.Series>>, <<.LabelMatchers>>, and <<.GroupBy>> placeholders are filled in by the adapter at query time.

Verify the metric is reachable

# confirm the API service is registered
kubectl get apiservice v1beta1.custom.metrics.k8s.io
# NAME                              SERVICE                                  AVAILABLE
# v1beta1.custom.metrics.k8s.io     monitoring/prometheus-adapter             True

# list all exposed custom metrics
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq '.resources[].name'

# query the specific metric for pods in a namespace
kubectl get --raw \
  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" \
  | jq '.items[].value'

If the API service shows False under AVAILABLE, check the adapter pod logs: kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-adapter.

HPA targeting the custom metric

# hpa-custom.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"    # target: 1000 req/s per pod

You can combine resource metrics and custom metrics in a single HPA. HPA evaluates all of them and uses the highest desired replica count. A common pattern is CPU as a safety net and request rate as the primary signal:

metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 70
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "1000"

Scaling behaviour: stabilization windows and policies

The behavior field (stable since autoscaling/v2) controls how aggressively HPA scales in each direction. The defaults are asymmetric on purpose: scale up immediately, wait 5 minutes before scaling down.

The full schema

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0       # act on the latest recommendation immediately
    selectPolicy: Max                   # pick the policy allowing the most change
    policies:
    - type: Percent
      value: 100                        # double the current count per period
      periodSeconds: 15
    - type: Pods
      value: 4                          # or add 4 pods per period, whichever is larger
      periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 300     # look back 5 min; use the highest recommendation
    selectPolicy: Min                   # pick the policy allowing the least change
    policies:
    - type: Percent
      value: 10                         # remove max 10% per period
      periodSeconds: 60

How the stabilization window works

For scale-down, the window default is 300 seconds. HPA looks at every desired-replica recommendation from the past 300 seconds and picks the highest. That means it will not remove pods if any recent recommendation wanted them. For scale-up, the default is 0 seconds: react to the latest recommendation immediately.

Why asymmetric?

Scaling up too slowly causes latency spikes and timeouts. Scaling down too fast causes cache churn, dropped connections on pods that just came online, and unnecessary pod restarts.

selectPolicy: Max on scaleUp picks the policy that allows the most aggressive response. selectPolicy: Min on scaleDown picks the most conservative. That is the right default for production HTTP services.

Disabling scale-down

During a known traffic event (Black Friday, product launch), you can prevent HPA from removing pods entirely:

behavior:
  scaleDown:
    selectPolicy: Disabled

Remove this after the event.

HPA + VPA interaction rules

The Vertical Pod Autoscaler (VPA) adjusts resource requests. HPA adjusts replica count based on the ratio of usage to requests. When both target the same resource, they create a feedback loop: VPA raises the CPU request, the utilization ratio drops, HPA scales down, per-pod load rises, VPA raises the request again.

Safe pattern: split by resource type

Let HPA own CPU (replica scaling) and VPA own memory (request sizing). Neither interferes with the other's signal:

# VPA: memory only
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
spec:
  resourcePolicy:
    containerPolicies:
    - containerName: app
      controlledResources: ["memory"]

# HPA: CPU only
metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 70

Alternative: HPA with AverageValue instead of Utilization

If you use type: AverageValue with a raw millicores target (averageValue: "200m"), VPA's changes to the request do not affect HPA's scaling decision because the target is an absolute value, not a ratio. This works but is harder to tune: you need to know the right absolute CPU value per pod, which changes with application updates.

VPA in recommendation-only mode

Run VPA with updateMode: "Off" so it collects data and produces recommendations without applying them. Use those recommendations to set requests manually, then let HPA control replicas without interference. This is the safest approach for teams who are not ready to give VPA automatic control.

Verify the final result

After applying your HPA, confirm it is active and receiving metrics:

kubectl describe hpa web-app-hpa

Look for three conditions:

Condition	Healthy value	Meaning
`AbleToScale`	True	HPA can read and modify the scale subresource
`ScalingActive`	True	Metrics are flowing; scaling is operational
`ScalingLimited`	False	Not capped at minReplicas or maxReplicas

If ScalingActive is False, HPA cannot fetch metrics. Check that metrics-server is running (kubectl get deployment metrics-server -n kube-system) and that the target Deployment has resource requests on every container.

Common troubleshooting

<unknown> in TARGETS column. The most frequent cause is a missing resources.requests block on one or more containers, including auto-injected sidecars from Istio or Linkerd. Second most common: metrics-server is not installed, or it fails TLS verification against kubelets. On kubeadm clusters with self-signed certs, adding --kubelet-insecure-tls to the metrics-server Deployment args resolves this.

HPA not scaling down to minReplicas. This is usually correct behaviour. The default stabilization window is 300 seconds: HPA will not scale down until 5 full minutes of consistently lower recommendations. Run kubectl describe hpa and read the Events section to see what HPA is recommending.

Custom metric shows FailedGetPodsMetric. The Prometheus Adapter either has no rule matching the series name, or the PromQL in metricsQuery returns no data. Test the query directly in Prometheus first: sum(rate(http_requests_total{namespace="production"}[2m])) by (pod).

Surge pods stuck in Pending during scale-up. The cluster does not have enough node capacity. Pair HPA with the Cluster Autoscaler or Karpenter to provision new nodes when pod demand exceeds node capacity.

For queue-driven or scale-to-zero workloads, HPA's minimum replica count is 1. KEDA extends HPA with 50+ native scalers and supports scaling to zero, which eliminates idle-pod costs for event-driven consumers.

Complete multi-metric HPA manifest

For reference, here is a production-ready HPA combining CPU, memory, and a custom metric with tuned scaling behaviour:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 75
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
      - type: Pods
        value: 1
        periodSeconds: 60
      selectPolicy: Min

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy