Skip to content

Latest commit

 

History

History
79 lines (55 loc) · 4.06 KB

File metadata and controls

79 lines (55 loc) · 4.06 KB

Azure Projects Repository

This collection showcases end-to-end data engineering solutions and advanced analytics implementations. Below, you’ll find highlights of my work with Azure services.


Featured Projects

An end-to-end solution leveraging the Medallion Architecture to ingest, process, and analyze data using Azure Data Factory (ADF) and Azure Databricks.

Key Features:

  • Staged data transformations across Bronze, Silver, and Gold layers.
  • Automated pipelines for data ingestion, transformation, and load.
  • Integration with Power BI for seamless analytics.

A demonstration of serverless analytics with Azure Synapse Analytics, showcasing advanced SQL features and seamless integration with Power BI for insights.

Key Features:

  • Implemented CETAS (Create External Tables As Select) and Incremental Load Design.
  • Utilized Change Data Capture (CDC) for real-time updates.
  • Demonstrated SQL performance monitoring and query optimization.

Multiple pipelines developed to demonstrate below functions

  1. Data Ingestion Pipeline: Automated ingestion of structured and unstructured data into Azure Data Lake.
  2. Transformations Pipeline: ETL workflows built for scalable data processing.
  3. Orchestration Pipeline: Dependencies managed using pipeline chaining, conditional execution, and alerts for monitoring.

ETL

  • Scalable enterprise data platform built with Azure Databricks and Azure Data Factory
  • Automated, end-to-end ETL for car sales data, incrementally loading from GitHub API and Azure SQL Database into ADLS Gen2 using parameterized ADF pipelines
  • Data processed through the Medallion architecture (Bronze, Silver, Gold layers) orchestrated by Databricks Workflows
  • Implements Change Data Capture (CDC) for fact tables and Slowly Changing Dimensions (SCD Type 1) for dimension tables
  • Enforces data governance and security with Unity Catalog
  • Delivers a star schema modeled in Delta tables for efficient analytics and BI use

DLT

Project with steps for a data processing pipeline using Delta Live Tables (DLT) showcasing -

  • Incremental Loading: Streaming Tables automatically process only new data on each pipeline run.

  • Schema Evolution: Adding/modifying columns or renaming tables is handled automatically by DLT.

  • Autoloader Integration: Integrated Autoloader (spark.readStream.format("cloudFiles")) to ingest files from a landing volume. Configured with options for schema hinting, schema location, file format, and path glob filter. DLT managed checkpoint location for Autoloader automatically.

  • Append Flow: Used @dlt.append_flow to combine streaming data from multiple sources into a union Streaming Table.

  • Passing Parameters (Dynamic Tables): Pipeline configurations can be accessed within the DLT notebook using spark.conf.get. Example: dynamically creating separate Gold Materialized Views filtered by order status.

  • Change Data Capture (CDC) with apply_changes: Used @dlt.apply_changes for SCD Type 1 and 2. Tracked historical changes and handled deletes/truncates. Updated downstream logic to read from SCD Type 2 table and filter for active records.

  • Data Quality with Expectations: Defined rules using @dlt.expect and @dlt.expect_all. Actions: Warning (default), Drop, Fail. Data quality metrics shown in UI and event logs.