Skip to content

More than once I had to clean big CSVs, pyspark was the tool that helped me when memory is an issue since it applies parallel processing.

Notifications You must be signed in to change notification settings

NicolasTerroni/pyspark-ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

pyspark-ETL

More than once I had to clean big CSVs, PySpark was the tool that helped me when memory is an issue since it applies parallel processing. Here are some of the PySpark ETL pipelines I've created that were useful in my work and personal projects. They can be useful as templates.

My Jars:

  • spark-3.3-bigquery-0.30.0
  • spark-bigquery-with-dependencies_2.12-0.30.0
  • sqlite-jdbc-3.40.0.0

(they can be found in Maven's repositories)

About

More than once I had to clean big CSVs, pyspark was the tool that helped me when memory is an issue since it applies parallel processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published