Kubernetes Monitoring
Reference articles for the Kubernetes observability gaps I see most often: Prometheus scraping that silently stops after a relabel change, Grafana dashboards that show green while the cluster is on fire, log pipelines that lose entries under pressure, and alerting rules that either never fire or fire so often they get ignored.
Each article covers one observability layer at a time: what to measure, what a healthy baseline looks like, and how to verify that the monitoring itself is working before you need it.
Articles
-
Kubernetes monitoring with Prometheus and kube-prometheus-stack2235 words
A production Kubernetes cluster without observability is a cluster you are guessing about. This tutorial walks through installing kube-prometheus-stack via Helm, understanding what each component does, scraping your own application metrics with ServiceMonitor, writing alerting rules, routing alerts, and knowing when to add remote storage.
-
Kubernetes CPU throttling: why pods stall at low utilisation1185 words
A pod shows 12% average CPU in Grafana yet gets throttled 60% of the time. The cause is not the node being overloaded. It is the Linux CFS scheduler enforcing a per-100 ms time budget that monitoring dashboards smooth into invisibility. This article explains the mechanism, shows how to measure it, and lays out the remediation options with their tradeoffs.
-
Kubernetes cluster logging with Fluent Bit and the EFK stack2331 words
Container logs disappear the moment a pod is deleted. kubectl logs shows only the latest 10 MiB rotation file for a single pod. For anything beyond local debugging, you need a centralised logging pipeline. This tutorial walks through deploying Fluent Bit as a DaemonSet, shipping logs to Elasticsearch 8.x via TLS, and querying them in Kibana.
-
Kubernetes Vertical Pod Autoscaler (VPA): right-sizing resource requests1421 words
The Vertical Pod Autoscaler watches actual CPU and memory consumption per container and adjusts resource requests to match. In Off mode it gives you right-sizing recommendations without touching running pods. In enforcement modes it applies those recommendations automatically, either by restarting pods or (on Kubernetes 1.33+) by resizing them in place. This guide walks through installing VPA, reading its recommendations, bounding them with resource policies, safely progressing to auto-apply, and avoiding the conflict with HPA.
-
Karpenter on EKS: faster node autoscaling with NodePool and EC2NodeClass1984 words
Karpenter provisions nodes in 45–60 seconds on EKS by calling EC2 Fleet directly instead of waiting for Auto Scaling Groups. Where Cluster Autoscaler picks from predefined node groups, Karpenter evaluates all available instance types per pending pod batch and launches the tightest fit. This guide covers installing Karpenter v1.x on EKS, writing NodePool and EC2NodeClass manifests, configuring disruption and consolidation, migrating from Cluster Autoscaler with zero downtime, and monitoring everything through Prometheus.
-
Kubernetes Cluster Autoscaler: automatic node scaling for managed clusters2248 words
Cluster Autoscaler watches for pods stuck in Pending because no node has room, then adds a node from a matching node group. When nodes drop below 50% resource utilization for long enough, it removes them. This guide covers configuring Cluster Autoscaler on EKS, GKE, and AKS, tuning scale-down timing, diagnosing common blockers, and knowing when Karpenter is a better fit.
-
Kubernetes spot and preemptible instances: cost savings with interruption safety1499 words
Spot instances on AWS and preemptible VMs on GCP cost 60–80% less than on-demand, but the cloud provider can reclaim them with as little as 30 seconds notice. Running Kubernetes workloads on spot safely requires interruption handlers, PodDisruptionBudgets, proper taints, and diversified instance pools. This guide walks through each layer of the setup on both EKS and GKE.
-
Kubernetes multi-tenancy: namespace isolation, ResourceQuota, and LimitRange2060 words
Running multiple teams or environments on a single Kubernetes cluster saves infrastructure cost, but without explicit boundaries one namespace can starve every other. This guide walks through provisioning a tenant namespace with ResourceQuota for aggregate caps, LimitRange for per-container defaults, NetworkPolicy for network isolation, RBAC for API-level access control, and Pod Security Standards for runtime restrictions.
Recurring server or deployment issues?
I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.
