VictoriaMetrics vs Prometheus: my default, and when I still pick Prometheus

VictoriaMetrics is my default for a new monitoring stack: leaner on RAM and disk, simpler to run highly available, and boring in production. Here is the honest reasoning, the independent evidence, and the cases where I still pick Prometheus.

For a new monitoring stack on Kubernetes, VictoriaMetrics is my default and Prometheus is the exception I have to justify. I reach for it because making Prometheus highly available means bolting on replica pairs, then Thanos or Mimir, then object storage and compaction workers, while VictoriaMetrics gives me the same high availability and long retention with far less to run. Fair warning: I recommend VictoriaMetrics to most of my clients, so read this as a practitioner's argument with reasons attached, not a neutral scorecard.

TL;DR

  • My default for a greenfield stack is VictoriaMetrics: leaner on RAM and disk, simpler to run highly available, and boring in production once it is in. I keep Prometheus where a healthy stack already runs, or where a hard vendor-neutral requirement rules it in.
  • The efficiency edge is real and independently corroborated (Criteo, Prezi), not just vendor benchmarks. The exact "7x" multipliers are VictoriaMetrics' own; the direction is not in doubt.
  • Prometheus is excellent software and the de facto standard (CNCF-graduated, the default for Kubernetes monitoring). Its weak spot is operations at scale: high availability and long retention are bolt-ons, and cardinality can take it down.
  • VictoriaMetrics is not 100% PromQL, and I no longer care, because I write MetricsQL natively and prefer it. If you need strict PromQL portability across backends, that is a fair reason to stay.
  • Single-vendor risk is real, just ask MinIO users. But VictoriaMetrics' Apache 2.0 core takes no contributor license agreement, so it cannot be clawed back the way the MinIO, HashiCorp and Redis editions were.

Table of contents

At a glance

Prometheus VictoriaMetrics
Built for scrape, alert, short-term local storage scale, long retention, RAM efficiency
RAM per 1M active series several GB (3 to 4 KB/series in the head) around 1 GB
Long-term storage needs Thanos, Mimir or Cortex built in
High availability replica pairs plus Alertmanager gossip -replicationFactor plus query-time dedup
Query language PromQL (the standard) MetricsQL (74% PromQL-compatible superset)
Kubernetes Prometheus Operator (de facto standard) VM Operator, auto-converts Prometheus CRDs
Governance CNCF-graduated, multi-vendor one company, Apache 2.0 core (no CLA) plus Enterprise tier
I pick it when a healthy Prometheus stack already runs, or you need vendor-neutral or portable PromQL new stacks, multi-million series, multi-year retention, RAM-constrained nodes, high churn

Why VictoriaMetrics is my default

A single Prometheus on Kubernetes is easy. A resilient Prometheus is not, and that gap is why VictoriaMetrics is my starting point. The moment you need high availability you run two replicas scraping the same targets, and the moment you need a global view or retention past a couple of weeks you add Thanos or Mimir, object storage, and compaction workers. Each piece is one more thing to run, monitor, and get paged for. VictoriaMetrics gives me high availability with -replicationFactor and years of retention with a single flag, on far fewer moving parts, including hard multi-tenant isolation when a client needs it.

I run the cluster with the VictoriaMetrics Operator, and where it fits I add VictoriaLogs and, where it earns its place, VictoriaTraces, so metrics, logs and traces come from one coherent stack instead of three unrelated ones.

When I have moved clients off Prometheus, the outcome has been the same one I want: boring. Dashboards kept working, alerting kept working, RAM dropped, and queries that sweep weeks of data come back instantly. Boring is the highest compliment I can pay a monitoring system, because boring means nobody is awake at 3am rebuilding it.

How the storage engines differ

Prometheus keeps recent data in an in-memory head block, writes every incoming sample to a write-ahead log on disk first, then compacts older data into immutable two-hour blocks on local disk. Each active series costs roughly 3 to 4 KB of RAM in the head, which is the number to remember about Prometheus: a million active series is several GB of RAM before you have run a single query. Default local retention is 15 days, block compaction produces periodic disk I/O spikes you will see in your own node metrics, and clustering or replication is not in the box. (TSDB internals)

VictoriaMetrics uses a storage engine closer in spirit to ClickHouse's MergeTree. Scaling the cluster out is adding nodes; historical data is not rebalanced, only new writes spread to the new nodes. The design avoids the WAL write amplification and the large compaction spikes Prometheus exhibits, and its compression is tighter for typical time series. Replication is one flag, -replicationFactor, on the insert layer.

Two catches up front. Single-node and cluster VictoriaMetrics use different on-disk formats, so you cannot copy data files from one to the other when you outgrow single-node; you migrate with vmctl. And the default -retentionPeriod is one month. A September 2025 issue asked for that default to be made safer because new users silently lose data at 30 days; the maintainers closed it as "not planned". I set retention with the client based on the storage they have, so this has never bitten me, but set it on day one.

The resource numbers, and why I distrust most of them

VictoriaMetrics is more efficient on RAM and disk. I believe the direction of that claim, because I see it on real client systems. I do not trust the exact multipliers, and you should not either, because almost every published benchmark comes from VictoriaMetrics or its founders.

The widely-cited "7x less disk, 5x less RAM" figure traces to a 2020 benchmark by VictoriaMetrics' CTO running Prometheus v2.22, years out of date now. A more recent 2024 vendor benchmark reports 2.5x less disk and 1.7x less RAM. Notice what happened: the gap shrank as Prometheus improved. So go to the independent sources.

The cleanest neutral data point is Criteo's engineering blog: at around a billion active series across 1,500 Prometheus instances, VictoriaMetrics hit roughly one byte per datapoint in production and let Criteo cut its backend from 226 compute and 156 storage nodes to 15 compute and 46 storage, while storing 15 times more data. Prezi, reported by InfoQ with a named SRE on record, moved 5 million active series off Prometheus and saw a 70% storage cut, 60% less memory, 30% less CPU, and heavy queries drop from 30-plus seconds to 3 to 7 seconds. Those are real teams with real numbers, and they point the same way I see in my own migrations. My own evidence is the qualitative version: RAM drops and the system gets boring.

It does not run one way for everyone, and the magnitude depends on your workload and versions. Prometheus has also closed part of the gap with its own improvements since those older benchmarks ran. The honest read: VictoriaMetrics' compression and RAM advantage is real, but a fully independent, reproducible benchmark on current releases of both still does not exist.

Cardinality: the wall Prometheus hits first

Cardinality is the failure that pushes most people off Prometheus, and it is worth being precise about why. Every unique label combination is a separate series, every active series lives in the head block, and the head block lives in RAM. Attach an unbounded label, a user ID, a request ID, a pod name that churns every few minutes, and series count explodes. Prometheus does not degrade gracefully here; it runs out of memory and the process dies. There is no built-in hard cap on series count. (cardinality explained)

This is not theoretical, and it bites large, competent teams. Cloudflare runs 916 Prometheus instances over 4.9 billion series and had to write custom patches (a global series limit, graceful sample-limit degradation) because upstream could not cap memory the way they needed. PingCAP ran a Prometheus box on a 96-core, 768 GB instance that still OOM'd, with a WAL replay of over 40 minutes on each crash. The UK Ministry of Justice cloud platform lost monitoring for 3 hours 21 minutes in April 2024 when a WAL replay after a restart blew past the 15-minute startup probe and tripped a restart loop. A Prometheus restart is not free once the WAL is large, and that is what tripped the Ministry of Justice.

VictoriaMetrics handles the same pressure differently. When its in-memory series cache overflows, new series trigger "slow inserts", background index work, rather than an immediate OOM. The -search.maxUniqueTimeseries flag caps how many series a single query can touch, it ships in the free Apache 2.0 build, and there is a built-in cardinality explorer to find the label hurting you. You watch vm_slow_row_inserts_total: when it climbs past about 5% of inserts for ten minutes, your active series no longer fit in cache and it is time to add RAM or cut cardinality. That is a dial you can read and turn, instead of an OOM kill you discover after the fact. Avoiding that failure from day one, rather than waiting to hit it, is a large part of why I start on VictoriaMetrics.

Long-term storage: Thanos, Cortex, Mimir, or VictoriaMetrics?

The bolt-on tax for long-term storage is a big reason VictoriaMetrics is my default. If you want Prometheus to keep more data for longer with a global view, you do not have to leave the Prometheus world, but you do have to assemble it, and the state of each option matters.

  • Thanos is CNCF Incubating, sidecars each Prometheus to ship two-hour blocks into S3-style object storage, and deduplicates across replicas at query time with --query.replica-label. It is still on 0.x version numbers, which says nothing about its maturity; it has run at large scale for years.
  • Cortex is still maintained, with regular releases, active community calls and a roadmap presented at KubeCon EU 2025. For new deployments, though, Mimir has become the recommended path.
  • Grafana Mimir is Grafana Labs' fork and evolution of Cortex, open-sourced in 2022, with v3.0 in November 2025 adding a Kafka-based write path. It powers Grafana Cloud's metrics backend, so it gets investment proportional to Grafana Labs' revenue. Between Cortex and Mimir for a greenfield setup, I would pick Mimir.

Every one of those is a capable system. They are also more to run than the single thing they are bolted onto. Thanos, Cortex and Mimir keep you on PromQL with Prometheus as the scrape source, at the cost of operating object storage, compaction workers and rule synchronization. VictoriaMetrics gives me long retention with local compression and far fewer parts in one stack. Criteo's numbers above came specifically from replacing a Thanos and Cortex evaluation, and Prezi chose VictoriaMetrics over both because block storage was cheaper and faster than the S3-backed alternatives for their workload. When the question is "how do I make Prometheus durable and long-lived," VictoriaMetrics answers it with less machinery.

PromQL vs MetricsQL: the language I actually prefer

MetricsQL, VictoriaMetrics' query language, is a superset of PromQL. The overwhelming majority of existing PromQL queries and Grafana dashboards run unchanged, which is why migrations do not break dashboards. But I no longer write PromQL by habit and reach for MetricsQL; I write MetricsQL by choice. It adds WITH templates for reusable fragments, multiple or label filters in one selector, a keep_metric_names modifier, unit suffixes like Ki/Gi in queries, and rollup functions Prometheus lacks such as outlier_iqr_over_time and range_linear_regression. Once you have those, plain PromQL feels cramped.

Be clear-eyed about the trade, though. MetricsQL deliberately diverges from PromQL, and the independent PromLabs compliance suite, run by Prometheus co-creator Julius Volz, scored VictoriaMetrics at 74% on its last published round, against 100% for Thanos and Cortex. Most of those failures are intentional: MetricsQL keeps the metric name after functions like min_over_time() where Prometheus strips it, changes rate() and increase() to look at the sample just before the window, removes NaN from output, and aligns timestamps to the resolution step. I think several of these are more correct, and in practice the divergences have never cost me anything. But there is one case where they cost you: if you need strict PromQL portability across multiple backends, or you maintain a large library of recording rules tuned to Prometheus' exact behaviour, MetricsQL's improvements become a liability rather than a feature. That is one of the few reasons I would keep a team on Prometheus.

Running each in production

Benchmarks cover ingest and query speed. The differences that bite in day-two operations are elsewhere: failover, the collector, Kubernetes wiring and alert state.

High availability

Prometheus has no leader election. The standard HA pattern is two identical instances scraping the same targets with a distinguishing replica external label, plus an Alertmanager cluster that gossips silences and notification state and deduplicates alerts by staggering sends (default --cluster.peer-timeout=15s). Neither instance deduplicates the other's data; that is what Thanos Query, Mimir's HA tracker, or VictoriaMetrics solve. VictoriaMetrics does it with -replicationFactor=N on vminsert to write each sample to N storage nodes, and -dedup.minScrapeInterval on vmselect to collapse the duplicates two vmagent instances produce when both scrape the same targets. One operational note: when a vmstorage node dies, vminsert reroutes its load to the survivors, and at scale that surge can cascade into OOM on the remaining nodes. Keep headroom.

The collector layer

Prometheus has had an agent mode since v2.32 (--agent): scrape and remote-write only, no local TSDB, no queries, with a roughly two-hour WAL buffer if the remote endpoint is down. vmagent does more. It buffers to disk per destination (-remoteWrite.maxDiskUsagePerURL), so an outage longer than two hours does not silently cost you data the way Prometheus' WAL does, and it can pre-aggregate at ingestion with stream aggregation (-streamAggr.config) to crush cardinality before it ever lands in storage. That last feature attacks the cardinality tax directly, and Prometheus has no equivalent.

Kubernetes integration

This is Prometheus' home turf. The Prometheus Operator and its ServiceMonitor, PodMonitor, PrometheusRule and ScrapeConfig CRDs, bundled in the kube-prometheus-stack Helm chart, are the most battle-tested Kubernetes monitoring setup that exists. The good news for a switch is that the VictoriaMetrics Operator can run alongside it and auto-convert those CRDs: a ServiceMonitor becomes a VMServiceScrape, a PrometheusRule becomes a VMRule, non-destructively, so you migrate gradually without rewriting every scrape definition. It is how I move clients across without a flag day.

Alerting state

vmalert is stateless by design. Alert state lives in memory and resets on restart unless you wire up -remoteWrite.url and -remoteRead.url so it persists and restores the ALERTS and ALERTS_FOR_STATE series. Forget that, and every restart resets your for: timers. Prometheus keeps that state in its local TSDB by default. A small thing that surprises people on day one.

What it actually costs

I self-host VictoriaMetrics for clients, usually in private clouds, so the managed pricing below is context rather than my path. It is still worth knowing, because the billing models show exactly what self-hosting sidesteps, and because the cost trap is the billing model your workload lands in, not the rate card.

  • Amazon Managed Service for Prometheus bills per sample ingested, tiered from $0.90 per 10M samples down to $0.16 at volume. High-frequency scraping is what hurts here, because doubling your scrape rate doubles your sample count.
  • Google Cloud Managed Service for Prometheus also bills per sample but is the most expensive of the hyperscalers at practical scales, roughly 4x AWS for the same load, and it downsamples older data.
  • Azure Monitor managed Prometheus is a flat $0.016 per million samples with 18 months of retention included, which makes it the cheapest cloud-native option at moderate scale.
  • Grafana Cloud bills on active series ($6.50 per 1,000 on Pro above a 10,000 free tier) but with a data-points-per-minute multiplier: scrape at 15s instead of 60s and you pay roughly 4x, because the bill is max(series, DPM/included).
  • VictoriaMetrics Cloud uses fixed compute tiers, single-node from about $202/month for 500K series, so a cardinality spike does not change your bill until you blow the tier ceiling.

Per-sample billing punishes scrape frequency; active-series billing punishes cardinality; Grafana's formula punishes both. In a Kubernetes cluster where labels proliferate on their own, a careless customer_id or unbounded path label does not just risk an OOM, it can multiply a usage-priced bill by ten on the same hardware. Self-hosting sidesteps that entirely: your cost is the box, not the cardinality. It is the same dynamic I wrote about in my guide to FinOps for Kubernetes: the line that blows up the bill is rarely the capacity you planned, it is the usage you did not predict.

On sizing, the rules of thumb tell the story: budget around 1 GB of RAM per million active series for VictoriaMetrics versus several GB for Prometheus, per VictoriaMetrics' own capacity guidance and the well-established Prometheus head-block math. On a private cloud where I am paying for every node, that difference is the bill, repeated across every replica.

Migrating without a big-bang cutover

You do not have to choose on day one. The cleanest path is dual-write: leave Prometheus scraping exactly as it is and add VictoriaMetrics as a remote_write target.

# In your existing prometheus.yml
remote_write:
  - url: http://victoriametrics:8428/api/v1/write

From there:

  • Grafana: VictoriaMetrics speaks the Prometheus HTTP API, so the standard Prometheus datasource works unchanged. An optional VictoriaMetrics datasource plugin unlocks MetricsQL extras.
  • Scraping: vmagent is a drop-in for Prometheus' scrape engine, accepts the same scrape_configs, uses less RAM, and buffers to disk when the backend is unavailable.
  • Kubernetes: run the VictoriaMetrics Operator beside the Prometheus Operator and let it convert your existing ServiceMonitor and PrometheusRule objects automatically.
  • Alerting: vmalert reads the same rule YAML and reuses your existing Alertmanager.
  • Historical data: vmctl imports Prometheus TSDB snapshots, in time-filtered chunks if needed.

The gotchas are documented; read them before you commit. Graphs can show slightly different values during the transition between vmagent and Prometheus; the rate()/increase() divergence means alert thresholds may need a look; set -retentionPeriod explicitly so you do not inherit the one-month default; and pick single-node versus cluster before you migrate history, because the on-disk formats are incompatible. On the cluster question, I run cluster VictoriaMetrics with the Operator and it has stayed boring; there is a documented data-loss window on rolling vmstorage restarts in some versions, and I keep an eye on it, but it has not bitten me. Single-node is still the simplest thing that works, so if a client does not need the cluster, I do not give them one.

Governance: a single vendor

On governance, Prometheus has a structural edge over VictoriaMetrics. It is CNCF-graduated, the second project ever to graduate, after Kubernetes. Its governance is a multi-company structure with maintainers from Grafana Labs, Red Hat, G-Research, Polar Signals and independents, and no single company can relicense or capture it because the CNCF holds the trademark and the contributor base spans thousands of organizations. VictoriaMetrics, by contrast, is a product of one company, VictoriaMetrics, Inc., around 50 people and still bootstrapped with no outside funding as of early 2026.

Every dependency you adopt is a bet on the people behind it, and that bet sometimes goes bad. MinIO is the cautionary tale everyone reaches for: in May 2025 it stripped the admin console (IAM, bucket management, LDAP and OIDC) out of the community edition into the paid tier, in October 2025 it stopped publishing community Docker images and binaries, and by December 2025 the open-source project went into maintenance mode. That is the risk, in the flesh.

VictoriaMetrics is a safer bet than MinIO was, and the reason is structural. MinIO could gut its community edition because it held the copyright through a contributor license agreement. VictoriaMetrics takes no CLA, so contributors keep their own copyright and the company cannot relicense the existing Apache 2.0 code even if it wanted to, the same structural protection HashiCorp and Redis lacked when they switched the BSL in 2023 and a source-available license in 2024. The realistic pressure is not a rug-pull on the open core, it is new features landing in Enterprise instead, and that is worth watching. I treat VictoriaMetrics the way I treat Traefik or NGINX: a piece of open-source infrastructure I depend on with my eyes open, and a bridge I will cross if I ever reach it. (I covered the governance fallout of that BSL switch in my comparison of OpenTofu and Terraform.)

One more practical note: both projects assume you run them behind a firewall, not exposed. The 336,000 internet-facing Prometheus endpoints Aqua Security found in December 2024, leaking credentials and API keys, are a deployment failure rather than a flaw, but they are a reminder to actually do that.

When I still pick Prometheus

For a greenfield stack, my answer is VictoriaMetrics. Prometheus still has a place, so here are the cases where I leave it be or reach for it on purpose.

The clearest one is the boring one: a Prometheus stack that is already running clean. There is no prize for migrating a healthy, stable system, and a migration you did not need is its own kind of waste. If it runs without pain, I leave it exactly where it is, and I would tell you to do the same.

Beyond that, two reasons hold up. The first is a hard requirement for vendor-neutral, CNCF-governed tooling, the kind some procurement and compliance regimes specify outright; that is a box VictoriaMetrics cannot tick and Prometheus can. The second is strict PromQL portability: a team running multiple metrics backends, or sitting on a deep library of recording rules tuned to Prometheus' exact rate() and metric-name behaviour, will find MetricsQL's deliberate divergences cost more than VictoriaMetrics' efficiency saves. Outside those cases, for something new, it is VictoriaMetrics, and I have yet to regret that default.

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.