Skip to content

Commit 1bcae0d

Browse files
Vitthal MirjiVitthal Mirji
Vitthal Mirji
authored and
Vitthal Mirji
committed
Work in progress - Added new data pipelines and few API functions useful for writing data pipelines
1 parent a340dc1 commit 1bcae0d

File tree

119 files changed

+4020
-159
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

119 files changed

+4020
-159
lines changed
File renamed without changes.

.gitignore

Lines changed: 8 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,13 @@
1-
# Created by .ignore support plugin (hsz.mobi)
2-
### JupyterNotebooks template
3-
# gitignore template for Jupyter Notebooks
4-
# website: http://jupyter.org/
5-
6-
.ipynb_checkpoints
7-
*/.ipynb_checkpoints/*
8-
9-
# Remove previous ipynb_checkpoints
10-
# git rm -r .ipynb_checkpoints/
11-
#
12-
13-
### GitBook template
14-
# Node rules:
15-
## Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
16-
.grunt
17-
18-
## Dependency directory
19-
## Commenting this out is preferred by some people, see
20-
## https://docs.npmjs.com/misc/faq#should-i-check-my-node_modules-folder-into-git
21-
node_modules
22-
23-
# Book build output
24-
_book
25-
26-
# eBook build output
27-
*.epub
28-
*.mobi
29-
*.pdf
30-
31-
### Vim template
32-
# Swap
33-
[._]*.s[a-v][a-z]
34-
[._]*.sw[a-p]
35-
[._]s[a-rt-v][a-z]
36-
[._]ss[a-gi-z]
37-
[._]sw[a-p]
38-
39-
# Session
40-
Session.vim
41-
Sessionx.vim
1+
### Python template
422

43-
# Temporary
44-
.netrwhist
45-
*~
46-
# Auto-generated tag files
47-
tags
48-
# Persistent undo
49-
[._]*.un~
3+
# Bash script logs
4+
*.log
505

51-
### Python template
526
# Byte-compiled / optimized / DLL files
537
__pycache__/
548
*.py[cod]
559
*$py.class
5610

57-
# C extensions
58-
*.so
59-
6011
# Distribution / packaging
6112
.Python
6213
build/
@@ -77,11 +28,10 @@ share/python-wheels/
7728
.installed.cfg
7829
*.egg
7930
MANIFEST
80-
test/metastore_db
81-
test/hive
82-
test/spark-warehouse
83-
test/derby.log
84-
31+
src/test/metastore_db
32+
src/src/main/test/hive
33+
src/test/spark-warehouse
34+
src/test/derby.log
8535

8636
# PyInstaller
8737
# Usually these files are written by a python script from a template
@@ -110,71 +60,17 @@ coverage.xml
11060
*.mo
11161
*.pot
11262

113-
# Django stuff:
114-
*.log
115-
local_settings.py
116-
db.sqlite3
117-
db.sqlite3-journal
118-
119-
# Flask stuff:
120-
instance/
121-
.webassets-cache
122-
123-
# Scrapy stuff:
124-
.scrapy
125-
126-
# Sphinx documentation
127-
docs/_build/
128-
12963
# PyBuilder
13064
target/
13165

132-
# Jupyter Notebook
133-
.ipynb_checkpoints
134-
135-
# IPython
136-
profile_default/
137-
ipython_config.py
138-
13966
# pyenv
14067
.python-version
14168

142-
# pipenv
143-
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
144-
# However, in case of collaboration, if having platform-specific dependencies or dependencies
145-
# having no cross-platform support, pipenv may install dependencies that don't work, or not
146-
# install all needed dependencies.
147-
#Pipfile.lock
148-
149-
# celery beat schedule file
150-
celerybeat-schedule
151-
152-
# SageMath parsed files
153-
*.sage.py
154-
15569
# Environments
15670
.env
15771
.venv
15872
env/
15973
venv/
16074
ENV/
16175
env.bak/
162-
venv.bak/
163-
164-
# Spyder project settings
165-
.spyderproject
166-
.spyproject
167-
168-
# Rope project settings
169-
.ropeproject
170-
171-
# mkdocs documentation
172-
/site
173-
174-
# mypy
175-
.mypy_cache/
176-
.dmypy.json
177-
dmypy.json
178-
179-
# Pyre type checker
180-
.pyre/
76+
venv.bak/

MANIFEST.in

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
recursive-include sbin *.sh
2+
recursive-include conf *.html *.json *.conf *.properties *.html
3+
recursive-include resources *.html
4+
recursive-include logs logs log-sample
5+
recursive-include docs *.md
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
<html><body><P>Team,<br/><br/>Data Quality check finished successfully for <b>DQ ID = 101</b>, with failures. Check details in below table of metrics.</P><h3 style="font-family:arial">Failed DQ details</h3><table border="3" style="width:100%"><tr style="text-align:left;background-color:#FF6347"><th>Yarn Application Id</th> <th>DQ ID</th> <th>Rule ID</th> <th>Rule Name</th> <th>Rule type</th> <th>Description</th> <th>Columns/Query</th> <th>Pass Count</th> <th>Fail Count</th> <th>Total Count</th></tr><tr><td>local-1681916910001</td> <td>101</td> <td>1011</td> <td>Primary / Natural Keys</td> <td>unique</td> <td>Primary / Natural Keys should not have duplicates</td> <td>['name']</td> <td>1039</td> <td>3</td> <td>1042</td></tr> <tr><td>local-1681916910001</td> <td>101</td> <td>1012</td> <td>NOT NULL fields</td> <td>not null</td> <td>Field should have valid value</td> <td>['name', 'cookTime', 'prepTime']</td> <td>715</td> <td>327</td> <td>1042</td></tr></table><h3 style="font-family:arial">Succeeded DQ details</h3><table border="3" style="width:100%"><tr style="text-align:left;background-color:#33FFBD"><th>Yarn Application Id</th> <th>DQ ID</th> <th>Rule ID</th> <th>Rule Name</th> <th>Rule type</th> <th>Description</th> <th>Columns/Query</th> <th>Pass Count</th> <th>Fail Count</th> <th>Total Count</th></tr><tr><td>local-1681916910001</td> <td>101</td> <td>1013</td> <td>File names check</td> <td>query</td> <td>Check If all input files are read for processing</td> <td>["WITH file_names AS (SELECT 'recipes-000.json' AS file_name UNION SELECT 'recipes-001.json' AS file_name UNION SELECT 'recipes-002.json' AS file_name)\nSELECT f.file_name FROM file_names f\nLEFT JOIN (SELECT DISTINCT reverse(split(input_file_name(), '/'))[0] as file_name FROM temp) t\nON t.file_name = f.file_name\nWHERE t.file_name IS NULL"]</td> <td>1042</td> <td>0</td> <td>1042</td></tr></table><br/><br/>Thanks</body></html><br/>
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"dq_id": 101,
3+
"rules": [
4+
{
5+
"rule_id": 1011,
6+
"name": "Primary / Natural Keys",
7+
"description": "Primary / Natural Keys should not have duplicates",
8+
"rule_type": "unique",
9+
"columns": [
10+
"name"
11+
]
12+
},
13+
{
14+
"rule_id": 1012,
15+
"name": "NOT NULL fields",
16+
"description": "Field should have valid value",
17+
"rule_type": "not null",
18+
"columns": [
19+
"name",
20+
"cookTime",
21+
"prepTime"
22+
]
23+
},
24+
{
25+
"rule_id": 1013,
26+
"name": "Input files check",
27+
"description": "Check If all input files are read for processing",
28+
"rule_type": "query",
29+
"query": "WITH file_names AS (SELECT 'recipes-000.json' AS file_name UNION SELECT 'recipes-001.json' AS file_name UNION SELECT 'recipes-002.json' AS file_name)\nSELECT f.file_name FROM file_names f\nLEFT JOIN (SELECT DISTINCT reverse(split(input_file_name(), '/'))[0] as file_name FROM temp) t\nON t.file_name = f.file_name\nWHERE t.file_name IS NULL"
30+
},
31+
{
32+
"rule_id": 1014,
33+
"name": "\"Check for invalid cook & prep time",
34+
"description": "Check empty or null values",
35+
"rule_type": "query",
36+
"query": "SELECT * FROM temp WHERE cookTime = '' OR prepTime = ''"
37+
}
38+
]
39+
}
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
{
2+
"dq_id": 101,
3+
"rules": [
4+
{
5+
"rule_id": 1015,
6+
"name": "Primary / Natural Keys",
7+
"description": "Primary / Natural Keys should not have duplicates",
8+
"rule_type": "unique",
9+
"columns": [
10+
"difficulty"
11+
]
12+
},
13+
{
14+
"rule_id": 1016,
15+
"name": "NOT NULL fields",
16+
"description": "Field should have valid value",
17+
"rule_type": "not null",
18+
"columns": [
19+
"difficulty",
20+
"avg_total_cooking_time"
21+
]
22+
}
23+
]
24+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
{
2+
"dq_id": 101,
3+
"execution_reports_dir": "<project-root-place-holder>/resources/data-quality-reports/recipe-tasks",
4+
"email_execution_report_to": "[email protected]",
5+
"rules": [
6+
{
7+
"rule_id": 1011,
8+
"name": "Primary / Natural Keys",
9+
"description": "Primary / Natural Keys should not have duplicates",
10+
"rule_type": "unique",
11+
"columns": [
12+
"name"
13+
]
14+
},
15+
{
16+
"rule_id": 1012,
17+
"name": "NOT NULL fields",
18+
"description": "Field should have valid value",
19+
"rule_type": "not null",
20+
"columns": [
21+
"name",
22+
"cookTime",
23+
"prepTime"
24+
]
25+
},
26+
{
27+
"rule_id": 1013,
28+
"name": "Input files check",
29+
"description": "Check If all input files are read for processing",
30+
"rule_type": "query",
31+
"query": "WITH file_names AS (SELECT 'recipes-000.json' AS file_name UNION SELECT 'recipes-001.json' AS file_name UNION SELECT 'recipes-002.json' AS file_name)\nSELECT f.file_name FROM file_names f\nLEFT JOIN (SELECT DISTINCT reverse(split(input_file_name(), '/'))[0] as file_name FROM temp) t\nON t.file_name = f.file_name\nWHERE t.file_name IS NULL"
32+
},
33+
{
34+
"rule_id": 1014,
35+
"name": "\"Check for invalid cook & prep time",
36+
"description": "Check empty or null values",
37+
"rule_type": "query",
38+
"query": "SELECT * FROM temp WHERE cookTime = '' OR prepTime = ''"
39+
}
40+
]
41+
}
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
{
2+
"dq_id": 101,
3+
"execution_reports_dir": "<project-root-place-holder>/resources/data-quality-reports/recipe-tasks",
4+
"email_execution_report_to": "[email protected]",
5+
"rules": [
6+
{
7+
"rule_id": 1015,
8+
"name": "Primary / Natural Keys",
9+
"description": "Primary / Natural Keys should not have duplicates",
10+
"rule_type": "unique",
11+
"columns": [
12+
"difficulty"
13+
]
14+
},
15+
{
16+
"rule_id": 1016,
17+
"name": "NOT NULL fields",
18+
"description": "Field should have valid value",
19+
"rule_type": "not null",
20+
"columns": [
21+
"difficulty",
22+
"avg_total_cooking_time"
23+
]
24+
}
25+
]
26+
}

conf/python/logging-properties.json

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
{
2+
"version": 1,
3+
"objects": {
4+
"queue": {
5+
"class": "queue.Queue",
6+
"maxsize": 1000
7+
}
8+
},
9+
"formatters": {
10+
"simple": {
11+
"format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
12+
},
13+
"detailed": {
14+
"format": "%(asctime)s %(name)-15s %(levelname)-8s %(process)-10d %(funcName)-30s %(message)s"
15+
}
16+
},
17+
"handlers": {
18+
"console": {
19+
"class": "logging.StreamHandler",
20+
"level": "DEBUG",
21+
"formatter": "detailed",
22+
"stream": "ext://sys.stdout"
23+
},
24+
"file": {
25+
"class": "logging.FileHandler",
26+
"level": "DEBUG",
27+
"encoding": "utf-8",
28+
"formatter": "detailed",
29+
"filename": "logs/log-{job_name_placeholder}_{timestamp_placeholder}.log",
30+
"mode": "a"
31+
}
32+
},
33+
"loggers": {
34+
"simple": {
35+
"level": "INFO",
36+
"handlers": [
37+
"console"
38+
],
39+
"propagate": "no"
40+
},
41+
"unit-tests": {
42+
"level": "DEBUG",
43+
"handlers": [
44+
"console"
45+
],
46+
"propagate": "no"
47+
}
48+
},
49+
"root": {
50+
"level": "DEBUG",
51+
"handlers": [
52+
"console",
53+
"file"
54+
]
55+
}
56+
}

0 commit comments

Comments
 (0)