Skip to content

Local-First Architecture Series Final: Observability, Metrics & Operational Excellence

Posted on:December 19, 2025

Welcome, Developer!

In Part 5, we built a correct local-first system:

Correctness is necessary — but it is not sufficient.

Local-first systems can fail silently. This part is about making those failures visible.

The Observability Problem

Traditional client-server systems fail loudly:

Local-first systems fail quietly:

Weeks later, a user reports:

My data looks different on my phone and my laptop.

Observability is how you prevent that.

What You Must Be Able to Answer

A production local-first app must answer:

  1. Are clients syncing?
  2. Are actions flowing in both directions?
  3. Is the WAL growing without bound?
  4. Are replicas converging?
  5. Is reconciliation deterministic in practice?

If you cannot answer these, you are operating blind.

WAL-Centric Metrics

The WAL is your source of truth — and your primary signal.

Minimum metrics to track:

A growing WAL backlog is an early-warning system.

Sync Health Metrics

Sync is a pipeline, not a boolean.

Track:

Failures must be phase-specific:

Convergence & Drift Detection

Correctness means all replicas converge.

To detect drift:

If two replicas with the same sync cursor report different hashes, the determinism is broken.

Determinism as a Production Invariant

Reconciliation must always be pure.

The same inputs must produce the same output — every time.

If determinism breaks in production, correctness is lost.

Operational Checklist

A local-first system is production-ready only if:

Anything less is hope-driven engineering.

Observability Platforms (A Tech Lead’s Take)

The observability concepts in this part are intentionally platform-agnostic. That said, after running real systems in production, I don’t believe tooling is optional — you will need something that can collect, aggregate, and alert on these signals.

As a tech lead, my baseline expectation for any observability stack is simple: it must support metrics, structured logs, and alerts. Everything else is secondary. The vendor matters far less than whether the right signals exist.

Platforms I’ve Seen Work Well

OpenTelemetry (my default starting point)
If I can choose one thing to standardize on, it’s OpenTelemetry. I treat it as infrastructure, not a vendor choice. It keeps instrumentation portable and lets teams evolve their backend without rewriting client code.

Datadog
When teams want speed and minimal operational overhead, Datadog works well. I’ve seen it handle:

You trade flexibility for convenience, but for many teams that’s the right call.

Grafana + Prometheus
If you want control and cost predictability, this stack is hard to beat. I like it when:

It requires more discipline, but it scales well with the team.

Azure Monitor / Application Insights
If you’re already on Azure, this is often the pragmatic choice. It integrates cleanly with infrastructure and gives you enough visibility to operate a local-first system without stitching multiple tools together.

Sentry (useful, but not enough on its own)
I treat Sentry as a complement, not a foundation. It’s excellent for:

But errors alone won’t tell you if replicas are slowly diverging.

What Actually Matters

In my experience, teams get stuck debating tools when they should be debating signals.

Regardless of platform, I make sure we can see:

If those signals exist, the tooling can change.

If they don’t, no amount of dashboards will save you.

Final Thoughts

Local-first architecture is distributed systems engineering.

Observability is not optional. It is the difference between assuming correctness and proving it.

This concludes the Local-First Architecture series!

Thank you for following along, Developer 💙