End-to-end data engineering pipeline for German electricity market data: ingest from the public SMARD API, land raw JSON in Amazon S3, then transform with PySpark and Delta Lake on Databricks using a Bronze → Silver → Gold layout. Job definitions live under jobs/.
flowchart LR
API[SMARD API] -->|Python ingest| S3[(S3 raw-data)]
S3 -->|JSON| Bronze[Bronze Delta]
Bronze -->|PySpark| Silver[Silver Delta]
Silver -->|Aggregates & features| Gold[Gold Delta]
| Layer | Purpose |
|---|---|
| Bronze | Raw API-shaped JSON from S3 → Delta tables |
| Silver | Flattened time series (timestamps, values), cleansed columns |
| Gold | Tables ready for Business Analyst and ML Engineers |
├── src/api/ # Python SMARD client (local + S3 ingest)
├── databricks_notebooks/ # Bronze / Silver / Gold notebooks (.py source)
├── jobs/ # Databricks Job JSON (see jobs/README.md)
├── req.txt
└── README.md
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r req.txtFor S3 upload (smard_clientS3.py), configure AWS credentials (e.g. ~/.aws/credentials locally, or an IAM instance profile on Databricks).
After you clone this repo with Databricks Repos, update notebook_path in jobs/*.json so they match your workspace. See jobs/README.md for placeholders (YOUR_DATABRICKS_USER, repo folder name).
Cross-job “run next job” tasks were removed from the JSON exports because job IDs are workspace-specific—chain bronze → silver → gold in the Jobs UI or re-add Run job tasks after import.
Python · requests · boto3 · Databricks · PySpark · Delta Lake · Unity Catalog