NYC Open Data Profiling

The final project for Big Data. Our group had to use Spark to profile (derive column metadata--data types and semantic types) for ~2k datasets from NYC's Open Data initiative.

Repo Info

5-Project-Open Data Profiling, Quality, and Analysis.pdf contains the assignment instructions. Biggest Data Report.pdf contains the final group report.

My Role

I wrote a custom reducer in basic_metadata.py which was used to derive all the datatype metadata for Part 1. I wrote all of the code and much of the report for Part 2. I also wrote the timing code found in timing.py.

Running

This was written for internal usage but I'm keeping it in the README for reference.

Timing

Use timing module to automate timing of a function that operates on a series of dfs. timing.timed accepts a closure. Refer to sample.py for a good example of how to use timing.timed with a closure.

From CLI

typical usage: ./env.sh "<file.py> <ds dir> <number of ds (limit)> <random order>" Limit and Rand are optional. Use - to not pass arg. example: ./env.sh "sample.py datasets - True"

refer to cli.py for an explanation of how to grab arguments from the cli

Run as ./env.sh "basic_metadata.py /user/hm74/NYCOpenData 30 print"

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
scripts		scripts
strsimpy		strsimpy
.gitignore		.gitignore
5-Project-Open Data Profiling, Quality, and Analysis.pdf		5-Project-Open Data Profiling, Quality, and Analysis.pdf
Biggest Data Report.pdf		Biggest Data Report.pdf
README.md		README.md
basic_metadata.py		basic_metadata.py
basic_profile.py		basic_profile.py
cli.py		cli.py
compile_itemset.py		compile_itemset.py
ds_reader.py		ds_reader.py
env.sh		env.sh
sample.py		sample.py
similarity.py		similarity.py
task1_driver.py		task1_driver.py
task2_driver.py		task2_driver.py
task_2_names.txt		task_2_names.txt
timing.py		timing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NYC Open Data Profiling

Repo Info

My Role

Running

Timing

From CLI

About

Uh oh!

Releases

Packages

Languages

vinceniko/NYC-OpenData-Profiling

Folders and files

Latest commit

History

Repository files navigation

NYC Open Data Profiling

Repo Info

My Role

Running

Timing

From CLI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages