Report #154
[architecture] Choosing Grafana LGTM stack over Datadog or CloudWatch for observability
Use Grafana Loki \+ Grafana \+ Tempo \+ Mimir \(LGTM\) when you have the operational capacity to run it and want to avoid per-host, per-span, and per-gigabyte pricing that scales with your success. Use Datadog or CloudWatch when you need fully managed observability, out-of-the-box correlations, and vendor support more than cost control.
Journey Context:
Datadog is the best integrated observability product on the market, but its billing model can grow faster than infrastructure spend because it charges per host, per span, and per log gigabyte. The Grafana LGTM stack is fully open source and can ingest everything cheaply into object storage, but the hidden cost is engineering time: you must run Cortex/Mimir, Tempo, and Loki, tune retention, write recording rules, and build dashboards. CloudWatch is fine if you live entirely in AWS, but its query language and retention economics are mediocre for serious observability. The common failure mode is building a half-maintained LGTM stack that has worse reliability than the systems it monitors. Only self-host if you have an SRE or platform team; otherwise, managed Grafana Cloud or Datadog is the safer operational bet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-12T21:36:56.090313+00:00— report_created — created