Kubernetes Monitoring

Reference articles for the Kubernetes observability gaps I see most often: Prometheus scraping that silently stops after a relabel change, Grafana dashboards that show green while the cluster is on fire, log pipelines that lose entries under pressure, and alerting rules that either never fire or fire so often they get ignored.

Each article covers one observability layer at a time: what to measure, what a healthy baseline looks like, and how to verify that the monitoring itself is working before you need it.

Scroll

Kubernetes monitoring with Prometheus and kube-prometheus-stack
A production Kubernetes cluster without observability is a cluster you are guessing about. This tutorial walks through installing kube-prometheus-stack via Helm, understanding what each component does, scraping your own application metrics with ServiceMonitor, writing alerting rules, routing alerts, and knowing when to add remote storage.

2275 words
Kubernetes observability with OpenTelemetry: deploying the Collector and instrumenting workloads
Prometheus gives you metrics. Fluent Bit ships logs. Neither gives you traces, and neither gives you a single wire format that every backend speaks. This tutorial walks through deploying the OpenTelemetry Operator on Kubernetes, running two Collectors (DaemonSet for node-level, Deployment for cluster-level), auto-instrumenting workloads without touching application code, and routing traces, metrics, and logs to the backends of your choice.

3072 words
Kubernetes CPU throttling: why pods stall at low utilisation
A pod shows 12% average CPU in Grafana yet gets throttled 60% of the time. The cause is not the node being overloaded. It is the Linux CFS scheduler enforcing a per-100 ms time budget that monitoring dashboards smooth into invisibility. This article explains the mechanism, shows how to measure it, and lays out the remediation options with their tradeoffs.

1185 words
Kubernetes cluster logging with Fluent Bit and the EFK stack
Container logs disappear the moment a pod is deleted. kubectl logs shows only the latest 10 MiB rotation file for a single pod. For anything beyond local debugging, you need a centralised logging pipeline. This tutorial walks through deploying Fluent Bit as a DaemonSet, shipping logs to Elasticsearch 8.x via TLS, and querying them in Kibana.

2377 words
Kubernetes Vertical Pod Autoscaler (VPA): right-sizing resource requests
The Vertical Pod Autoscaler watches actual CPU and memory consumption per container and adjusts resource requests to match. In Off mode it gives you right-sizing recommendations without touching running pods. In enforcement modes it applies those recommendations automatically, either by restarting pods or (on Kubernetes 1.33+) by resizing them in place. This guide walks through installing VPA, reading its recommendations, bounding them with resource policies, safely progressing to auto-apply, and avoiding the conflict with HPA.

1450 words
Karpenter on EKS: faster node autoscaling with NodePool and EC2NodeClass
Karpenter provisions nodes in 45–60 seconds on EKS by calling EC2 Fleet directly instead of waiting for Auto Scaling Groups. Where Cluster Autoscaler picks from predefined node groups, Karpenter evaluates all available instance types per pending pod batch and launches the tightest fit. This guide covers installing Karpenter v1.x on EKS, writing NodePool and EC2NodeClass manifests, configuring disruption and consolidation, migrating from Cluster Autoscaler with zero downtime, and monitoring everything through Prometheus.

1984 words
Kubernetes Cluster Autoscaler: automatic node scaling for managed clusters
Cluster Autoscaler watches for pods stuck in Pending because no node has room, then adds a node from a matching node group. When nodes drop below 50% resource utilization for long enough, it removes them. This guide covers configuring Cluster Autoscaler on EKS, GKE, and AKS, tuning scale-down timing, diagnosing common blockers, and knowing when Karpenter is a better fit.

2248 words
Kubernetes spot and preemptible instances: cost savings with interruption safety
Spot instances on AWS and preemptible VMs on GCP cost 60–80% less than on-demand, but the cloud provider can reclaim them with as little as 30 seconds notice. Running Kubernetes workloads on spot safely requires interruption handlers, PodDisruptionBudgets, proper taints, and diversified instance pools. This guide walks through each layer of the setup on both EKS and GKE.

1499 words
Kubernetes multi-tenancy: namespace isolation, ResourceQuota, and LimitRange
Running multiple teams or environments on a single Kubernetes cluster saves infrastructure cost, but without explicit boundaries one namespace can starve every other. This guide walks through provisioning a tenant namespace with ResourceQuota for aggregate caps, LimitRange for per-container defaults, NetworkPolicy for network isolation, RBAC for API-level access control, and Pod Security Standards for runtime restrictions.

2177 words
Kubernetes cost optimization: rightsizing workloads and reducing cluster spend
Most Kubernetes clusters waste 60–80% of requested resources because teams set requests high and never revisit them. This guide walks through getting cost visibility with Kubecost and kubectl top, generating rightsizing recommendations with Goldilocks and VPA, enforcing sane defaults with LimitRange and ResourceQuota, and combining rightsizing with spot instances and autoscaling to cut cluster spend without sacrificing reliability.

1305 words

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Kubernetes Monitoring

Articles

Recurring server or deployment issues?