More than once I had to clean big CSVs, PySpark was the tool that helped me when memory is an issue since it applies parallel processing. Here are some of the PySpark ETL pipelines I've created that were useful in my work and personal projects. They can be useful as templates.
My Jars:
- spark-3.3-bigquery-0.30.0
- spark-bigquery-with-dependencies_2.12-0.30.0
- sqlite-jdbc-3.40.0.0
(they can be found in Maven's repositories)