This repository is a small project consisting of an ETL pipeline using Spark Scala and a public API:
- Request the follwing endpoint to download the GZIP about weather foerecast in Mexico per day by municipality: https://smn.conagua.gob.mx/tools/GUI/webservices/?method=1
- Converts the GZIP into a json file
- Reads the data with Spark and write it into a parquet
It is pretty simple, you just need to check if sbt and scala is appropiately installed
To install dependencies:
sbt compile
If everything went good then run to run the app locally in your machine.
sbt "runMain etl.Main"
sbt test
Run docker build -t etl-conagua-scala . to build the docker image.
Run docker run -it --rm etl-conagua-scala after building the image, you'll see the results in the shell.
Run sbt assemlbly to build the jar. This will run the tests as well before building the Jar.
Change the permissions of the shell script chmode 777 spark-submit-script.sh
Run ./spark-submit.sh