The final project for Big Data. Our group had to use Spark to profile (derive column metadata--data types and semantic types) for ~2k datasets from NYC's Open Data initiative.
5-Project-Open Data Profiling, Quality, and Analysis.pdf contains the assignment instructions. Biggest Data Report.pdf contains the final group report.
I wrote a custom reducer in basic_metadata.py which was used to derive all the datatype metadata for Part 1. I wrote all of the code and much of the report for Part 2. I also wrote the timing code found in timing.py.
This was written for internal usage but I'm keeping it in the README for reference.
Use timing module to automate timing of a function that operates on a series of dfs.
timing.timed accepts a closure. Refer to sample.py for a good example of how to use timing.timed with a closure.
typical usage: ./env.sh "<file.py> <ds dir> <number of ds (limit)> <random order>"
Limit and Rand are optional. Use - to not pass arg.
example: ./env.sh "sample.py datasets - True"
refer to cli.py for an explanation of how to grab arguments from the cli
Run as ./env.sh "basic_metadata.py /user/hm74/NYCOpenData 30 print"