Welcome, Developer!
In Part 5, we built a correct local-first system:
- Offline-first writes
- Bidirectional sync
- Deterministic reconciliation
- Explicit conflict resolution
Correctness is necessary — but it is not sufficient.
Local-first systems can fail silently. This part is about making those failures visible.
The Observability Problem
Traditional client-server systems fail loudly:
- Requests error
- APIs time out
- Pages do not load
Local-first systems fail quietly:
- Writes succeed locally
- Sync partially succeeds
- Replicas diverge
- No alarms fire
Weeks later, a user reports:
My data looks different on my phone and my laptop.
Observability is how you prevent that.
What You Must Be Able to Answer
A production local-first app must answer:
- Are clients syncing?
- Are actions flowing in both directions?
- Is the WAL growing without bound?
- Are replicas converging?
- Is reconciliation deterministic in practice?
If you cannot answer these, you are operating blind.
WAL-Centric Metrics
The WAL is your source of truth — and your primary signal.
Minimum metrics to track:
wal.entries.createdwal.entries.pendingwal.entries.uploadedwal.entries.compactedwal.size
A growing WAL backlog is an early-warning system.
Sync Health Metrics
Sync is a pipeline, not a boolean.
Track:
- Upload success / failure
- Download batch sizes
- Sync duration
- Retry counts
Failures must be phase-specific:
- upload
- download
- reconcile
Convergence & Drift Detection
Correctness means all replicas converge.
To detect drift:
- Periodically compute a stable hash of materialized state
- Report or compare hashes across devices
If two replicas with the same sync cursor report different hashes, the determinism is broken.
Determinism as a Production Invariant
Reconciliation must always be pure.
The same inputs must produce the same output — every time.
If determinism breaks in production, correctness is lost.
Operational Checklist
A local-first system is production-ready only if:
- WAL growth is observable
- Sync success is measurable
- Drift is detectable
- Determinism is enforced
- Compaction is safe and automatic
Anything less is hope-driven engineering.
Observability Platforms (A Tech Lead’s Take)
The observability concepts in this part are intentionally platform-agnostic. That said, after running real systems in production, I don’t believe tooling is optional — you will need something that can collect, aggregate, and alert on these signals.
As a tech lead, my baseline expectation for any observability stack is simple: it must support metrics, structured logs, and alerts. Everything else is secondary. The vendor matters far less than whether the right signals exist.
Platforms I’ve Seen Work Well
OpenTelemetry (my default starting point)
If I can choose one thing to standardize on, it’s OpenTelemetry. I treat it as infrastructure, not a vendor choice. It keeps instrumentation portable and lets teams evolve their backend without rewriting client code.
Datadog
When teams want speed and minimal operational overhead, Datadog works well. I’ve seen it handle:
- WAL growth and backlog visibility
- Sync latency across client and server
- Cross-system correlation without custom glue
You trade flexibility for convenience, but for many teams that’s the right call.
Grafana + Prometheus
If you want control and cost predictability, this stack is hard to beat. I like it when:
- You care deeply about custom metrics
- You want explicit drift and determinism alerts
- You’re comfortable owning your dashboards
It requires more discipline, but it scales well with the team.
Azure Monitor / Application Insights
If you’re already on Azure, this is often the pragmatic choice. It integrates cleanly with infrastructure and gives you enough visibility to operate a local-first system without stitching multiple tools together.
Sentry (useful, but not enough on its own)
I treat Sentry as a complement, not a foundation. It’s excellent for:
- Client-side crashes
- Reconciliation exceptions
- Detecting non-deterministic failures
But errors alone won’t tell you if replicas are slowly diverging.
What Actually Matters
In my experience, teams get stuck debating tools when they should be debating signals.
Regardless of platform, I make sure we can see:
- WAL backlog growth
- Sync success and failure by phase
- Reconciliation duration
- Any hint of drift or non-determinism
If those signals exist, the tooling can change.
If they don’t, no amount of dashboards will save you.
Final Thoughts
Local-first architecture is distributed systems engineering.
Observability is not optional. It is the difference between assuming correctness and proving it.
This concludes the Local-First Architecture series!
Thank you for following along, Developer 💙