Local-First Architecture Series Final: Observability, Metrics & Operational Excellence

Welcome, Developer!

In Part 5, we built a correct local-first system:

Offline-first writes
Bidirectional sync
Deterministic reconciliation
Explicit conflict resolution

Correctness is necessary — but it is not sufficient.

Local-first systems can fail silently. This part is about making those failures visible.

The Observability Problem

Traditional client-server systems fail loudly:

Requests error
APIs time out
Pages do not load

Local-first systems fail quietly:

Writes succeed locally
Sync partially succeeds
Replicas diverge
No alarms fire

Weeks later, a user reports:

My data looks different on my phone and my laptop.

Observability is how you prevent that.

What You Must Be Able to Answer

A production local-first app must answer:

Are clients syncing?
Are actions flowing in both directions?
Is the WAL growing without bound?
Are replicas converging?
Is reconciliation deterministic in practice?

If you cannot answer these, you are operating blind.

WAL-Centric Metrics

The WAL is your source of truth — and your primary signal.

Minimum metrics to track:

wal.entries.created
wal.entries.pending
wal.entries.uploaded
wal.entries.compacted
wal.size

A growing WAL backlog is an early-warning system.

Sync Health Metrics

Sync is a pipeline, not a boolean.

Track:

Upload success / failure
Download batch sizes
Sync duration
Retry counts

Failures must be phase-specific:

upload
download
reconcile

Convergence & Drift Detection

Correctness means all replicas converge.

To detect drift:

Periodically compute a stable hash of materialized state
Report or compare hashes across devices

If two replicas with the same sync cursor report different hashes, the determinism is broken.

Determinism as a Production Invariant

Reconciliation must always be pure.

The same inputs must produce the same output — every time.

If determinism breaks in production, correctness is lost.

Operational Checklist

A local-first system is production-ready only if:

WAL growth is observable
Sync success is measurable
Drift is detectable
Determinism is enforced
Compaction is safe and automatic

Anything less is hope-driven engineering.

Observability Platforms (A Tech Lead’s Take)

The observability concepts in this part are intentionally platform-agnostic. That said, after running real systems in production, I don’t believe tooling is optional — you will need something that can collect, aggregate, and alert on these signals.

As a tech lead, my baseline expectation for any observability stack is simple: it must support metrics, structured logs, and alerts. Everything else is secondary. The vendor matters far less than whether the right signals exist.

Platforms I’ve Seen Work Well

OpenTelemetry (my default starting point)
If I can choose one thing to standardize on, it’s OpenTelemetry. I treat it as infrastructure, not a vendor choice. It keeps instrumentation portable and lets teams evolve their backend without rewriting client code.

Datadog
When teams want speed and minimal operational overhead, Datadog works well. I’ve seen it handle:

WAL growth and backlog visibility
Sync latency across client and server
Cross-system correlation without custom glue

You trade flexibility for convenience, but for many teams that’s the right call.

Grafana + Prometheus
If you want control and cost predictability, this stack is hard to beat. I like it when:

You care deeply about custom metrics
You want explicit drift and determinism alerts
You’re comfortable owning your dashboards

It requires more discipline, but it scales well with the team.

Azure Monitor / Application Insights
If you’re already on Azure, this is often the pragmatic choice. It integrates cleanly with infrastructure and gives you enough visibility to operate a local-first system without stitching multiple tools together.

Sentry (useful, but not enough on its own)
I treat Sentry as a complement, not a foundation. It’s excellent for:

Client-side crashes
Reconciliation exceptions
Detecting non-deterministic failures

But errors alone won’t tell you if replicas are slowly diverging.

What Actually Matters

In my experience, teams get stuck debating tools when they should be debating signals.

Regardless of platform, I make sure we can see:

WAL backlog growth
Sync success and failure by phase
Reconciliation duration
Any hint of drift or non-determinism

If those signals exist, the tooling can change.

If they don’t, no amount of dashboards will save you.

Final Thoughts

Local-first architecture is distributed systems engineering.

Observability is not optional. It is the difference between assuming correctness and proving it.

This concludes the Local-First Architecture series!

Thank you for following along, Developer 💙