Designing A Wearable Data Pipeline That Can Handle Real Time And History
A practical data architecture for wearable telemetry: mobile sync, ingestion, streaming, lakehouse storage, feature windows, and serving surfaces.
Wearable data looks simple from the outside.
A device records heart rate, sleep, steps, location context, workouts, and sensor events. A user opens the app. The app syncs the data. A dashboard shows a trend.
The backend is where the hard part lives.
Wearable telemetry is not just another event stream. It arrives late, arrives duplicated, arrives out of order, and often arrives in bursts after the device reconnects. Some events are useful immediately. Some only become meaningful after aggregation. Some should never be trusted until the pipeline can explain where they came from.
The architecture has to support two truths at the same time:
- the product needs fresh signals
- the data platform needs replayable history
That is the central design problem.
Data Architecture
The high-level shape is a pipeline with two lanes:
- a hot path for recent events, alerts, and user-facing feedback
- a cold path for raw history, backfills, analytics, and model-ready features
The first diagram captures the main system boundaries. The rest of the post breaks those boundaries down with Mermaid diagrams.
Diagram
The split is deliberate.
The hot path keeps the product responsive. The cold path keeps the data trustworthy.
Why Wearable Data Is Awkward
Most product event pipelines assume events arrive close to when they happened.
Wearables break that assumption.
A watch can collect data for hours without a network connection. A phone can batch sync after the app opens. A device SDK can resend data after a failed acknowledgement. A user can change timezone. Firmware can alter payload shape. The pipeline needs to treat those cases as normal, not exceptional.
That changes the design:
- ingestion should be idempotent
- event time and processing time should be stored separately
- raw payloads should be retained before transformation
- stream jobs should tolerate late data
- aggregates should be recomputable
Ingestion Contract
The ingestion API should not try to become the analytics layer. Its job is to accept data safely, validate enough to protect the system, and write events in a form that can be replayed.
Diagram
The important detail is the envelope.
Every event should carry enough metadata to explain itself later:
user_iddevice_idsource_sdkevent_typeevent_timereceived_atidempotency_keyschema_versionraw_payload_hash
That metadata keeps the pipeline debuggable when the data starts disagreeing with the product.
Stream Processing
The streaming layer handles operational freshness.
It should deduplicate, enrich, window, and publish recent signals. It should not be the only place where business truth exists.
Diagram
This path should be fast enough for product feedback, but conservative enough that a late sync does not corrupt user history.
Lakehouse Layers
The durable pipeline should move through explicit layers.
Diagram
The key is that each layer has a clear job:
- raw stores what arrived
- bronze makes the payload queryable
- silver makes events canonical
- gold creates business-ready summaries
- feature tables shape data for models and product decisions
That separation makes backfills less dangerous.
Handling Late Events
Late data is not a corner case in wearable systems. It is the default failure mode.
The system needs a correction path.
Diagram
This is where many pipelines get brittle.
If the system only appends new summaries, late events create drift. If it recomputes everything on every update, it becomes expensive. A correction window gives the platform a practical middle ground.
Serving Model
A wearable product usually needs more than one serving surface.
Diagram
The serving store is for low-latency reads. The warehouse is for exploration, audit, and reporting. Feature exports are for models and longer-running jobs.
Trying to force all three workloads into one database usually creates operational debt.
Reliability Checks
The pipeline should have checks that reflect product risk, not just infrastructure uptime.
Useful checks include:
- event volume by device type and SDK version
- sync delay distribution
- duplicate rate
- late-event rate
- missing user-day rate
- aggregate correction count
- schema drift by source version
Those checks answer the questions that matter:
- are devices still syncing?
- are users getting stale insights?
- did a firmware or SDK release change the payload?
- are aggregates being corrected too often?
The Design Rule
The architecture should not pretend wearable data is clean.
It should assume the opposite:
- data will arrive late
- batches will be resent
- schemas will drift
- users will move across timezones
- operational and analytical needs will disagree
A good wearable data pipeline is built around those realities. It keeps raw history replayable, makes the hot path useful, and gives every derived number a path back to the source event.