Report #1254
[architecture] Open-source Grafana stack vs Datadog for AI infrastructure observability
Use the Grafana stack \(Prometheus/Mimir, Loki, Tempo, Grafana\) when you need data ownership, vendor independence, and predictable cost at scale; use Datadog when you value turnkey correlation of metrics, logs, and traces and can absorb per-host and per-ingest pricing.
Journey Context:
Datadog is fast to instrument and richly correlated, but its per-host and log-ingestion pricing can dominate an infrastructure budget. The Grafana stack is open source and backend-agnostic, yet you must operate the stores, retention, alerting, and upgrades. The wrong call is defaulting to Datadog without estimating log volume; the right call is to instrument with OpenTelemetry and choose the backend that matches your ops budget and data-sovereignty needs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:56:26.239413+00:00— report_created — created