You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
2
2
===
3
-
This repository contains the code for implementing and running the pipeline described in the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes.
3
+
This repository contains the code for implementing and running the pipeline
4
+
described in the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes".
4
5
5
-
The objective is modeling a situation where an user is trying to execute ML tasks on some base data, enriching it by
6
-
using new tables found in a data lake using retrieval methods.
6
+
We model a scenaro where the user wants to apply a ML predictor on a base table,
7
+
after augmenting it by adding features found in a data lake.
8
+
9
+

10
+
11
+
This repository includes code necessary for:
12
+
- Preparing and indexing an arbitrary data lake so that the tables it contains
13
+
can be used by the pipeline
14
+
- Indexing the data lake using different retrieval methods, and querying the
15
+
indices given a specific base table
16
+
- Run the evaluation pipeline on a given base table and data lake
17
+
- Track all the experimental runs
18
+
- Study the results and prepare the plots and tables used in the main paper
7
19
8
-
The join candidates are merged with the base table under study, before training a ML model (either Catboost, or a linear model) to evaluate the performance
9
-
before and after the merge.
10
20
11
21
We use YADL as our data lake, a synthetic data lake based on the YAGO3 knowledge base. The YADL variants used in the paper
12
22
are available [on Zenodo](https://zenodo.org/doi/10.5281/zenodo.10600047).
13
-
14
-
The code for preparing the YADL variants can be found in [this repo](https://github.com/rcap107/YADL).
23
+
The code for preparing the YADL variants can be found in [this repo]
24
+
(https://github.com/rcap107/YADL).
15
25
16
26
The base tables used for the experiments are provided in `data/source_tables/`.
17
27
18
-
More detail on the functioning of the code is available on the [repository website](https://rcap107.github.io/retrieve-merge-predict/).
28
+
**NOTE:** The repository relies heavily on the `parquet` format
29
+
[ref](https://parquet.apache.org/docs/file-format/), and will expect all tables
30
+
(both source tables, and data lake tables) to be stored in `parquet` format.
31
+
Please convert your data to parquet before working on the pipeline.
19
32
20
-
**NOTE:** The repository relies heavily on the `parquet` format [ref](https://parquet.apache.org/docs/file-format/), and will expect all tables (both source tables, and data lake
21
-
tables) to be stored in `parquet` format. Please convert your data to parquet before working on the pipeline.
22
-
23
-
**NOTE:** We recommend to use the smaller `binary_update` data lake and its corresponding configurations to set up the data structures and debug potential issues, as all preparation steps are significantly faster than with larger data lakes.
33
+
**NOTE:** We recommend to use the smaller `binary_update` data lake and its
34
+
corresponding configurations to set up the data structures and debug potential
35
+
issues, as all preparation steps are significantly faster than with larger data lakes.
The *Schools* dataset is an internal dataset found in the Open Data US data lake. The *US County Population* dataset is
34
-
an internal dataset found in YADL.
44
+
The *Schools* dataset is an internal dataset found in the Open Data US data lake.
45
+
The *US County Population* dataset is an internal dataset found in YADL.
35
46
36
-
YADL is derived from YAGO3 [source](https://yago-knowledge.org/getting-started) and shares its CC BY 4.0 license.
47
+
YADL is derived from YAGO3 [source](https://yago-knowledge.org/getting-started)
48
+
and shares its CC BY 4.0 license.
37
49
38
-
Datasets were pre-processed before they were used in our experiments. Pre-processing steps are reported in the [preparation
39
-
repository](https://github.com/rcap107/YADL) and this repository.
50
+
Datasets were pre-processed before they were used in our experiments. Pre-processing
51
+
steps are reported in the [preparation repository](https://github.com/rcap107/YADL)
52
+
and this repository.
40
53
41
-
**Important**: in the current version of the code, all base tables are expected to include a column named `target` that contains the variable that should
42
-
be predicted by the ML model. Please process any new input table so that the prediction column is named `target`.
54
+
**Important**: in the current version of the code, all base tables are expected
55
+
to include a column named `target` that contains the variable that should
56
+
be predicted by the ML model. Please process any new input table so that the
57
+
prediction column is named `target`.
43
58
44
59
### Starmie
45
-
To implement Starmie in our pipeline, we implemented modifications that are tracked in a [fork](https://github.com/megagonlabs/starmie) of the [original repository](https://github.com/rcap107/starmie).
60
+
To implement Starmie in our pipeline, we implemented modifications that are tracked in a
61
+
[fork](https://github.com/megagonlabs/starmie) of the
The fork includes some additional bindings that we added to produce results in the
65
+
format required by the pipeline.
66
+
46
67
47
68
# Installing the requirements
48
69
We recommend to use conda environments to fetch the required packages. File `environment.yaml` contains the
@@ -67,47 +88,68 @@ Additional files may be downloaded from zenodo using the same command:
67
88
wget -O destination_file_name path_to_file
68
89
```
69
90
# Preparing the environment
70
-
Once the required python environment has been prepared it is necessary to prepare the files required
91
+
Once the required python environment has been prepared it is necessary to prepare
92
+
the files required
71
93
for the execution of the pipeline.
72
94
73
-
For efficiency reasons and to avoid running unnecessary operations when testing different components, the pipeline has
95
+
For efficiency reasons and to avoid running unnecessary operations when testing
96
+
different components, the pipeline has
74
97
been split in different modules that have to be run in sequence.
75
98
76
99
## Preparing the metadata
77
-
Given a data lake version to evaluate, the first step is preparing a metadata file for each table in the data lake. This
100
+
Given a data lake version to evaluate, the first step is preparing a metadata file
101
+
for each table in the data lake. This
78
102
metadata is used in all steps of the pipeline.
79
103
80
-
The script `prepare_metadata.py`is used to generate the files for a given data lake case.
104
+
The script `prepare_metadata.py`is used to generate the files for a given data lake
105
+
case.
81
106
82
-
**NOTE:** This scripts assumes that all tables are saved in `.parquet` format, and will raise an error if it finds no `.parquet`
83
-
files in the given path. Please convert your files to parquet before running this script.
107
+
**NOTE:** This scripts assumes that all tables are saved in `.parquet` format, and
108
+
will raise an error if it finds no `.parquet`
109
+
files in the given path. Please convert your files to parquet before running this
110
+
script.
84
111
85
112
Use the command:
86
113
```
87
114
python prepare_metadata.py PATH_DATA_FOLDER
88
115
```
89
-
where `PATH_DATA_FOLDER` is the root path of the data lake. The stem of `PATH_DATA_FOLDER` will be used as identifier for
90
-
the data lake throughout the program (e.g., for `data/binary_update`, the data lake will be stored under the name `binary_update`).
91
-
92
-
The script will recursively scan all folders found in `PATH_DATA_FOLDER` and generate a json file for each parquet file
93
-
encountered. By providing the `--flat` parameter, it is possible to scan only the files in the root directory rather than
116
+
where `PATH_DATA_FOLDER` is the root path of the data lake. The stem of
117
+
`PATH_DATA_FOLDER`
118
+
will be used as identifier for
119
+
the data lake throughout the program (e.g., for `data/binary_update`, the data lake
120
+
will be stored under the name `binary_update`).
121
+
122
+
The script will recursively scan all folders found in `PATH_DATA_FOLDER` and
123
+
generate a json file for each parquet file
124
+
encountered. By providing the `--flat` parameter, it is possible to scan only the
125
+
files in the root directory rather than
94
126
working on all folders and files.
95
127
96
-
Metadata will be saved in `data/metadata/DATA_LAKE_NAME`, with an auxiliary file stored in `data/metadata/_mdi/md_index_DATA_LAKE_NAME.pickle`.
128
+
Metadata will be saved in `data/metadata/DATA_LAKE_NAME`, with an auxiliary file
129
+
stored in `data/metadata/_mdi/md_index_DATA_LAKE_NAME.pickle`.
97
130
98
131
## Preparing the Retrieval methods
99
-
This step is an offline operation during which the retrieval methods are prepared by building the data structures they rely on to function. This operation can require a long time and a large amount of disk space (depending on the method); it is not required for
100
-
the querying step and thus it can be executed only once for each data lake (and retrieval method).
101
-
102
-
Different retrieval methods require different data structures and different starting configurations, which should be stored in `config/retrieval/prepare`. In all configurations,
103
-
`n_jobs` is the number of parallel jobs that will be executed; if it set to -1, all available
132
+
This step is an offline operation during which the retrieval methods are prepared
133
+
by building the data structures they rely on to function. This operation can require
134
+
a long time and a large amount of disk space (depending on the method); it is not
135
+
required for
136
+
the querying step and thus it can be executed only once for each data lake (and
137
+
retrieval method).
138
+
139
+
Different retrieval methods require different data structures and different starting
140
+
configurations, which should be stored in `config/retrieval/prepare`. In all
141
+
configurations,
142
+
`n_jobs` is the number of parallel jobs that will be executed; if it set to -1,
To use the Hybrid MinHash variant, the `query` configuration file should include the parameter `hybrid=true`: the re-ranking
221
+
To use the Hybrid MinHash variant, the `query` configuration file should include the
222
+
parameter `hybrid=true`: the re-ranking
171
223
operation is done at query time.
172
224
173
225
# Executing the pipeline
174
-
The configurations used to run the experiments in the paper are available in directory `config/evaluation`.
226
+
The configurations used to run the experiments in the paper are available in
227
+
directory `config/evaluation`.
175
228
176
-
The experiment configurations that tested default parameters are stored in `config/evaluation/general`; experiment configurations
177
-
testing aggregation are in `config/evaluation/aggregation`; additional experiments that test specific parameters and scenarios are in `config/evaluation/other`.
229
+
The experiment configurations that tested default parameters are stored in
0 commit comments