Kubernetes Monitoring

Reference articles for the Kubernetes observability gaps I see most often: Prometheus scraping that silently stops after a relabel change, Grafana dashboards that show green while the cluster is on fire, log pipelines that lose entries under pressure, and alerting rules that either never fire or fire so often they get ignored.

Each article covers one observability layer at a time: what to measure, what a healthy baseline looks like, and how to verify that the monitoring itself is working before you need it.

Scroll

Articles

  1. Kubernetes monitoring with Prometheus and kube-prometheus-stack

    A production Kubernetes cluster without observability is a cluster you are guessing about. This tutorial walks through installing kube-prometheus-stack via Helm, understanding what each component does, scraping your own application metrics with ServiceMonitor, writing alerting rules, routing alerts, and knowing when to add remote storage.

    2235 words
  2. Kubernetes CPU throttling: why pods stall at low utilisation

    A pod shows 12% average CPU in Grafana yet gets throttled 60% of the time. The cause is not the node being overloaded. It is the Linux CFS scheduler enforcing a per-100 ms time budget that monitoring dashboards smooth into invisibility. This article explains the mechanism, shows how to measure it, and lays out the remediation options with their tradeoffs.

    1185 words
  3. Kubernetes cluster logging with Fluent Bit and the EFK stack

    Container logs disappear the moment a pod is deleted. kubectl logs shows only the latest 10 MiB rotation file for a single pod. For anything beyond local debugging, you need a centralised logging pipeline. This tutorial walks through deploying Fluent Bit as a DaemonSet, shipping logs to Elasticsearch 8.x via TLS, and querying them in Kibana.

    2331 words
  4. Kubernetes Vertical Pod Autoscaler (VPA): right-sizing resource requests

    The Vertical Pod Autoscaler watches actual CPU and memory consumption per container and adjusts resource requests to match. In Off mode it gives you right-sizing recommendations without touching running pods. In enforcement modes it applies those recommendations automatically, either by restarting pods or (on Kubernetes 1.33+) by resizing them in place. This guide walks through installing VPA, reading its recommendations, bounding them with resource policies, safely progressing to auto-apply, and avoiding the conflict with HPA.

    1421 words
  5. Karpenter on EKS: faster node autoscaling with NodePool and EC2NodeClass

    Karpenter provisions nodes in 45–60 seconds on EKS by calling EC2 Fleet directly instead of waiting for Auto Scaling Groups. Where Cluster Autoscaler picks from predefined node groups, Karpenter evaluates all available instance types per pending pod batch and launches the tightest fit. This guide covers installing Karpenter v1.x on EKS, writing NodePool and EC2NodeClass manifests, configuring disruption and consolidation, migrating from Cluster Autoscaler with zero downtime, and monitoring everything through Prometheus.

    1984 words
  6. Kubernetes Cluster Autoscaler: automatic node scaling for managed clusters

    Cluster Autoscaler watches for pods stuck in Pending because no node has room, then adds a node from a matching node group. When nodes drop below 50% resource utilization for long enough, it removes them. This guide covers configuring Cluster Autoscaler on EKS, GKE, and AKS, tuning scale-down timing, diagnosing common blockers, and knowing when Karpenter is a better fit.

    2248 words
  7. Kubernetes spot and preemptible instances: cost savings with interruption safety

    Spot instances on AWS and preemptible VMs on GCP cost 60–80% less than on-demand, but the cloud provider can reclaim them with as little as 30 seconds notice. Running Kubernetes workloads on spot safely requires interruption handlers, PodDisruptionBudgets, proper taints, and diversified instance pools. This guide walks through each layer of the setup on both EKS and GKE.

    1499 words
  8. Kubernetes multi-tenancy: namespace isolation, ResourceQuota, and LimitRange

    Running multiple teams or environments on a single Kubernetes cluster saves infrastructure cost, but without explicit boundaries one namespace can starve every other. This guide walks through provisioning a tenant namespace with ResourceQuota for aggregate caps, LimitRange for per-container defaults, NetworkPolicy for network isolation, RBAC for API-level access control, and Pod Security Standards for runtime restrictions.

    2060 words

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.