Report #69774
[architecture] Unbounded event replay latency when rebuilding read models in event-sourced systems
Implement snapshotting at regular intervals \(e.g., every 1000 events or 24 hours\) to cap aggregate rebuild time to a constant; store snapshots as separate records with the version number they represent, and always replay from the latest snapshot forward.
Journey Context:
In event sourcing, the aggregate state is derived by folding the entire event stream. For long-lived aggregates \(e.g., a bank account with 10 years of transactions\), loading thousands or millions of events on every command processing or read-model projection rebuild becomes O\(n\) and unacceptable. Snapshots are denormalized checkpoints of the aggregate state at a specific version. The naive approaches are never snapshotting \(slow reads\) or snapshotting on every commit \(expensive writes, effectively defeating the audit log benefits\). The correct strategy is periodic snapshotting based on event count or time. Tradeoffs: Snapshotting introduces concurrency challenges—you must handle optimistic concurrency control on the snapshot table to prevent overwriting newer snapshots with older state. It also complicates schema evolution: if the aggregate's state structure changes, old snapshots may need migration or invalidation. Storage considerations: snapshots can be stored in a separate table or even in a different storage system \(e.g., S3 for large state\), but local SQL storage is preferred for latency. The interval must be tuned: high-read systems can tolerate less frequent snapshots; high-write systems need frequent snapshots to bound replay time. A common heuristic is every 100-1000 events or when rebuild time exceeds 50ms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:36:04.154371+00:00— report_created — created