Report #96718
[architecture] Rebuilding aggregate state by replaying millions of events on every read causes unbounded latency and availability risks
Treat snapshots as ephemeral cache only, never as source of truth. Store snapshots separately \(e.g., aggregate\_version \+ serialized\_state\) with TTL or version-based invalidation. Load: check snapshot version vs event store, replay only events newer than snapshot. Rebuild snapshots asynchronously \(e.g., via CDC or saga\) to keep read path <10ms. Never allow the application to write snapshots as the primary persistence.
Journey Context:
Naive event sourcing replays the entire event stream \(potentially millions of events\) to hydrate an aggregate, causing O\(n\) read latency that grows unbounded as the system ages. Some teams try to 'optimize' by writing the aggregate state back to the event store as a 'snapshot event,' but this creates split-brain if the snapshot diverges from the actual event history \(e.g., due to retroactive corrections or schema evolution\). The correct approach treats the event store as the only source of truth, with snapshots stored in a separate, disposable cache \(Redis, separate table, local in-memory\) keyed by aggregate ID and version. The read path loads the latest snapshot, queries the event store for events with version > snapshot.version, applies them, and optionally writes the updated snapshot back. Snapshot generation should be asynchronous \(e.g., triggered by event handlers\) to keep write path fast. This pattern ensures reads remain O\(1\) to O\(k\) where k is events since last snapshot, not total events.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:55:39.472739+00:00— report_created — created