Skip to content

mohamedkhalifa9/germany-grid-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

germany-grid-pipeline

End-to-end data engineering pipeline for German electricity market data: ingest from the public SMARD API, land raw JSON in Amazon S3, then transform with PySpark and Delta Lake on Databricks using a Bronze → Silver → Gold layout. Job definitions live under jobs/.


Architecture

flowchart LR
  API[SMARD API] -->|Python ingest| S3[(S3 raw-data)]
  S3 -->|JSON| Bronze[Bronze Delta]
  Bronze -->|PySpark| Silver[Silver Delta]
  Silver -->|Aggregates & features| Gold[Gold Delta]
Loading
Layer Purpose
Bronze Raw API-shaped JSON from S3 → Delta tables
Silver Flattened time series (timestamps, values), cleansed columns
Gold Tables ready for Business Analyst and ML Engineers

Repository layout

├── src/api/                 # Python SMARD client (local + S3 ingest)
├── databricks_notebooks/    # Bronze / Silver / Gold notebooks (.py source)
├── jobs/                    # Databricks Job JSON (see jobs/README.md)
├── req.txt
└── README.md

Local setup (ingestion)

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r req.txt

For S3 upload (smard_clientS3.py), configure AWS credentials (e.g. ~/.aws/credentials locally, or an IAM instance profile on Databricks).


Databricks jobs & notebook paths

After you clone this repo with Databricks Repos, update notebook_path in jobs/*.json so they match your workspace. See jobs/README.md for placeholders (YOUR_DATABRICKS_USER, repo folder name).

Cross-job “run next job” tasks were removed from the JSON exports because job IDs are workspace-specific—chain bronze → silver → gold in the Jobs UI or re-add Run job tasks after import.


Tech stack

Python · requests · boto3 · Databricks · PySpark · Delta Lake · Unity Catalog


About

Ingests public SMARD time series, lands raw data in Amazon S3, and transforms it through a medallion architecture (Bronze/Silver/Gold) using PySpark and Delta Lake for analytics and downstream use.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages