Legacy-to-cloud data modernization

PythonKafkaAirflowSparkSnowflakedbtFastAPIAzure

Motivation

Legacy services and fragmented pipelines slowed analytics delivery and made reliability difficult to scale across teams.

Thinking model

  • Modernize orchestration and platform in parallel to avoid migration deadlock.
  • Improve reliability first (SLA/backfill/dependency control), then optimize throughput.
  • Keep real-time and batch flows interoperable to avoid duplicated logic.

Architecture

Ingest

Source systems + event streams
Kafka + Python ingest services

Storage

ADLS Gen2 / Snowflake staging

Process

Airflow orchestration + Spark/dbt

Serve

Analytics serving

Ops

SLA + dependency monitoring

Flow edges

ingestion: Source systems + event streams Kafka + Python ingest serviceslanding + staging: Kafka + Python ingest services ADLS Gen2 / Snowflake stagingscheduled + event jobs: ADLS Gen2 / Snowflake staging Airflow orchestration + Spark/dbtmodeled outputs: Airflow orchestration + Spark/dbt Analytics servingruntime signals: Airflow orchestration + Spark/dbt SLA + dependency monitoring
  • Migration path keeps business reporting available while internal services are modernized.
  • Orchestration controls are explicit for backfills, SLAs, and dependency safety.

Build

Core components

  • Built ingestion/processing pipelines with Python, Kafka, MySQL, and Elasticsearch.
  • Orchestrated production workflows in Airflow with backfills, SLAs, and dependency management.
  • Modernized legacy services from Flask to FastAPI and expanded real-time processing with Kafka, Redis, and Spark.

Quality controls

  • Dependency-aware scheduling to prevent incomplete downstream runs.
  • Migration-era validation checks between legacy and modernized outputs.

Observability

  • Operational alerts centered on SLA breaches and pipeline dependency failures.
  • Run-level visibility for backfill and replay operations.

Outcomes

Modernization progress

Legacy analytics workloads transitioned to cloud-native platform patterns.

Operational resilience

Airflow-based backfill and SLA workflows formalized production operations.

Real-time readiness

Event-driven services integrated with real-time processing and modern API surfaces.

Tradeoffs

  • Ran hybrid legacy + modern paths during migration to protect reporting continuity.
  • Accepted temporary operational complexity to reduce cutover risk.

Confidentiality note

  • Internal system names and exact dataset shapes are generalized for confidentiality.

Work with me

Building a data platform like this?

I work with teams building data systems that need to be reliable, governed, and fast to iterate.

Start a project