Skip to content

Commit 1b078ac

Browse files
committed
Moving scripts around, updating readme a bit
1 parent b36ef94 commit 1b078ac

20 files changed

+117
-395
lines changed

README.md

Lines changed: 113 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,69 @@
11
Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
22
===
3-
This repository contains the code for implementing and running the pipeline described in the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes.
3+
This repository contains the code for implementing and running the pipeline
4+
described in the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes".
45

5-
The objective is modeling a situation where an user is trying to execute ML tasks on some base data, enriching it by
6-
using new tables found in a data lake using retrieval methods.
6+
We model a scenaro where the user wants to apply a ML predictor on a base table,
7+
after augmenting it by adding features found in a data lake.
8+
9+
![](doc/images/pipeline.png)
10+
11+
This repository includes code necessary for:
12+
- Preparing and indexing an arbitrary data lake so that the tables it contains
13+
can be used by the pipeline
14+
- Indexing the data lake using different retrieval methods, and querying the
15+
indices given a specific base table
16+
- Run the evaluation pipeline on a given base table and data lake
17+
- Track all the experimental runs
18+
- Study the results and prepare the plots and tables used in the main paper
719

8-
The join candidates are merged with the base table under study, before training a ML model (either Catboost, or a linear model) to evaluate the performance
9-
before and after the merge.
1020

1121
We use YADL as our data lake, a synthetic data lake based on the YAGO3 knowledge base. The YADL variants used in the paper
1222
are available [on Zenodo](https://zenodo.org/doi/10.5281/zenodo.10600047).
13-
14-
The code for preparing the YADL variants can be found in [this repo](https://github.com/rcap107/YADL).
23+
The code for preparing the YADL variants can be found in [this repo]
24+
(https://github.com/rcap107/YADL).
1525

1626
The base tables used for the experiments are provided in `data/source_tables/`.
1727

18-
More detail on the functioning of the code is available on the [repository website](https://rcap107.github.io/retrieve-merge-predict/).
28+
**NOTE:** The repository relies heavily on the `parquet` format
29+
[ref](https://parquet.apache.org/docs/file-format/), and will expect all tables
30+
(both source tables, and data lake tables) to be stored in `parquet` format.
31+
Please convert your data to parquet before working on the pipeline.
1932

20-
**NOTE:** The repository relies heavily on the `parquet` format [ref](https://parquet.apache.org/docs/file-format/), and will expect all tables (both source tables, and data lake
21-
tables) to be stored in `parquet` format. Please convert your data to parquet before working on the pipeline.
22-
23-
**NOTE:** We recommend to use the smaller `binary_update` data lake and its corresponding configurations to set up the data structures and debug potential issues, as all preparation steps are significantly faster than with larger data lakes.
33+
**NOTE:** We recommend to use the smaller `binary_update` data lake and its
34+
corresponding configurations to set up the data structures and debug potential
35+
issues, as all preparation steps are significantly faster than with larger data lakes.
2436

2537
# Dataset info
2638
We used the following sources for our dataset:
2739
- *Company Employees* [source](https://www.kaggle.com/datasets/iqmansingh/company-employee-dataset) - CC0
2840
- *Housing Prices* [source](https://www.zillow.com/research/data/)
29-
- *Movie Ratings* and *Movie Revenue* [source](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) - CC0
3041
- *US Accidents* [source](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) - CC BY-NC-SA 4.0
3142
- *US Elections* [source](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ) - CC0
3243

33-
The *Schools* dataset is an internal dataset found in the Open Data US data lake. The *US County Population* dataset is
34-
an internal dataset found in YADL.
44+
The *Schools* dataset is an internal dataset found in the Open Data US data lake.
45+
The *US County Population* dataset is an internal dataset found in YADL.
3546

36-
YADL is derived from YAGO3 [source](https://yago-knowledge.org/getting-started) and shares its CC BY 4.0 license.
47+
YADL is derived from YAGO3 [source](https://yago-knowledge.org/getting-started)
48+
and shares its CC BY 4.0 license.
3749

38-
Datasets were pre-processed before they were used in our experiments. Pre-processing steps are reported in the [preparation
39-
repository](https://github.com/rcap107/YADL) and this repository.
50+
Datasets were pre-processed before they were used in our experiments. Pre-processing
51+
steps are reported in the [preparation repository](https://github.com/rcap107/YADL)
52+
and this repository.
4053

41-
**Important**: in the current version of the code, all base tables are expected to include a column named `target` that contains the variable that should
42-
be predicted by the ML model. Please process any new input table so that the prediction column is named `target`.
54+
**Important**: in the current version of the code, all base tables are expected
55+
to include a column named `target` that contains the variable that should
56+
be predicted by the ML model. Please process any new input table so that the
57+
prediction column is named `target`.
4358

4459
### Starmie
45-
To implement Starmie in our pipeline, we implemented modifications that are tracked in a [fork](https://github.com/megagonlabs/starmie) of the [original repository](https://github.com/rcap107/starmie).
60+
To implement Starmie in our pipeline, we implemented modifications that are tracked in a
61+
[fork](https://github.com/megagonlabs/starmie) of the
62+
[original repository](https://github.com/rcap107/starmie).
63+
64+
The fork includes some additional bindings that we added to produce results in the
65+
format required by the pipeline.
66+
4667

4768
# Installing the requirements
4869
We recommend to use conda environments to fetch the required packages. File `environment.yaml` contains the
@@ -67,47 +88,68 @@ Additional files may be downloaded from zenodo using the same command:
6788
wget -O destination_file_name path_to_file
6889
```
6990
# Preparing the environment
70-
Once the required python environment has been prepared it is necessary to prepare the files required
91+
Once the required python environment has been prepared it is necessary to prepare
92+
the files required
7193
for the execution of the pipeline.
7294

73-
For efficiency reasons and to avoid running unnecessary operations when testing different components, the pipeline has
95+
For efficiency reasons and to avoid running unnecessary operations when testing
96+
different components, the pipeline has
7497
been split in different modules that have to be run in sequence.
7598

7699
## Preparing the metadata
77-
Given a data lake version to evaluate, the first step is preparing a metadata file for each table in the data lake. This
100+
Given a data lake version to evaluate, the first step is preparing a metadata file
101+
for each table in the data lake. This
78102
metadata is used in all steps of the pipeline.
79103

80-
The script `prepare_metadata.py`is used to generate the files for a given data lake case.
104+
The script `prepare_metadata.py`is used to generate the files for a given data lake
105+
case.
81106

82-
**NOTE:** This scripts assumes that all tables are saved in `.parquet` format, and will raise an error if it finds no `.parquet`
83-
files in the given path. Please convert your files to parquet before running this script.
107+
**NOTE:** This scripts assumes that all tables are saved in `.parquet` format, and
108+
will raise an error if it finds no `.parquet`
109+
files in the given path. Please convert your files to parquet before running this
110+
script.
84111

85112
Use the command:
86113
```
87114
python prepare_metadata.py PATH_DATA_FOLDER
88115
```
89-
where `PATH_DATA_FOLDER` is the root path of the data lake. The stem of `PATH_DATA_FOLDER` will be used as identifier for
90-
the data lake throughout the program (e.g., for `data/binary_update`, the data lake will be stored under the name `binary_update`).
91-
92-
The script will recursively scan all folders found in `PATH_DATA_FOLDER` and generate a json file for each parquet file
93-
encountered. By providing the `--flat` parameter, it is possible to scan only the files in the root directory rather than
116+
where `PATH_DATA_FOLDER` is the root path of the data lake. The stem of
117+
`PATH_DATA_FOLDER`
118+
will be used as identifier for
119+
the data lake throughout the program (e.g., for `data/binary_update`, the data lake
120+
will be stored under the name `binary_update`).
121+
122+
The script will recursively scan all folders found in `PATH_DATA_FOLDER` and
123+
generate a json file for each parquet file
124+
encountered. By providing the `--flat` parameter, it is possible to scan only the
125+
files in the root directory rather than
94126
working on all folders and files.
95127

96-
Metadata will be saved in `data/metadata/DATA_LAKE_NAME`, with an auxiliary file stored in `data/metadata/_mdi/md_index_DATA_LAKE_NAME.pickle`.
128+
Metadata will be saved in `data/metadata/DATA_LAKE_NAME`, with an auxiliary file
129+
stored in `data/metadata/_mdi/md_index_DATA_LAKE_NAME.pickle`.
97130

98131
## Preparing the Retrieval methods
99-
This step is an offline operation during which the retrieval methods are prepared by building the data structures they rely on to function. This operation can require a long time and a large amount of disk space (depending on the method); it is not required for
100-
the querying step and thus it can be executed only once for each data lake (and retrieval method).
101-
102-
Different retrieval methods require different data structures and different starting configurations, which should be stored in `config/retrieval/prepare`. In all configurations,
103-
`n_jobs` is the number of parallel jobs that will be executed; if it set to -1, all available
132+
This step is an offline operation during which the retrieval methods are prepared
133+
by building the data structures they rely on to function. This operation can require
134+
a long time and a large amount of disk space (depending on the method); it is not
135+
required for
136+
the querying step and thus it can be executed only once for each data lake (and
137+
retrieval method).
138+
139+
Different retrieval methods require different data structures and different starting
140+
configurations, which should be stored in `config/retrieval/prepare`. In all
141+
configurations,
142+
`n_jobs` is the number of parallel jobs that will be executed; if it set to -1,
143+
all available
104144
CPU cores will be used.
105145

106146
```sh
107147
python prepare_retrieval_methods.py [--repeats REPEATS] config_file
108148
```
109-
`config_file` is the path to the configuration file. `repeats` is a parameter that can be
110-
added to re-run the current configuration `repeats` times (this should be used only for measuring the time required
149+
`config_file` is the path to the configuration file. `repeats` is a parameter
150+
that can be
151+
added to re-run the current configuration `repeats` times (this should be used only
152+
for measuring the time required
111153
for running the indexing operation).
112154

113155
### Config files
@@ -136,28 +178,37 @@ query_column="County"
136178
n_jobs=-1
137179
```
138180

139-
The configuration parser will prepare the data structures (specifically, the counts) for each case provided in the configuration file.
181+
The configuration parser will prepare the data structures (specifically, the counts)
182+
for each case provided in the configuration file.
140183

141-
Configuration files whose name start with `prepare` in `config/retrieval/prepare` are example configuration files for the index preparation step.
184+
Configuration files whose name start with `prepare` in `config/retrieval/prepare`
185+
are example configuration files for the index preparation step.
142186

143187
To prepare the retrieval methods for data lake `binary_update`:
144188
```sh
145189
python prepare_retrieval_methods.py config/retrieval/prepare/prepare-exact_matching-binary_update.toml
146190
python prepare_retrieval_methods.py config/retrieval/prepare/prepare-minhash-binary_update.toml
147191
```
148-
This will create the index structures for the different retrieval methods in `data/metadata/_indices/binary_update`.
192+
This will create the index structures for the different retrieval methods in
193+
`data/metadata/_indices/binary_update`.
149194

150-
Data lake preparation should be repeated for any new data lake, and each data lake will have its own directory in `data/metadata/_indices/`.
195+
Data lake preparation should be repeated for any new data lake, and each data lake
196+
will have its own directory in `data/metadata/_indices/`.
151197

152198
## Querying the retrieval methods
153-
The querying operation is decoupled from the indexing step for practical reasons (querying is much faster than indexing).
154-
Moreover, methods such as MinHash attempt to optimize the query operation by building the data structures offline in the indexing
199+
The querying operation is decoupled from the indexing step for practical reasons
200+
(querying is much faster than indexing).
201+
Moreover, methods such as MinHash attempt to optimize the query operation by building
202+
the data structures offline in the indexing
155203
step.
156204

157-
For these reason, querying is done using the `query_indices.py` script and is based on the configurations in `config/retrieval/query`.
205+
For these reason, querying is done using the `query_indices.py` script and is based
206+
on the configurations in `config/retrieval/query`.
158207

159-
In principle, queries could be done at runtime during the pipeline execution. For efficiency and simplicity, they are executed
160-
offline and stored in `results/query_results`. The pipeline then loads the appropriate query at runtime.
208+
In principle, queries could be done at runtime during the pipeline execution. For
209+
efficiency and simplicity, they are executed
210+
offline and stored in `results/query_results`. The pipeline then loads the appropriate
211+
query at runtime.
161212

162213
To build the queries for `binary_update`:
163214
```sh
@@ -167,16 +218,25 @@ python query_indices.py config/retrieval/query/query-exact_matching-binary_updat
167218
```
168219

169220
### Hybrid MinHash
170-
To use the Hybrid MinHash variant, the `query` configuration file should include the parameter `hybrid=true`: the re-ranking
221+
To use the Hybrid MinHash variant, the `query` configuration file should include the
222+
parameter `hybrid=true`: the re-ranking
171223
operation is done at query time.
172224

173225
# Executing the pipeline
174-
The configurations used to run the experiments in the paper are available in directory `config/evaluation`.
226+
The configurations used to run the experiments in the paper are available in
227+
directory `config/evaluation`.
175228

176-
The experiment configurations that tested default parameters are stored in `config/evaluation/general`; experiment configurations
177-
testing aggregation are in `config/evaluation/aggregation`; additional experiments that test specific parameters and scenarios are in `config/evaluation/other`.
229+
The experiment configurations that tested default parameters are stored in
230+
`config/evaluation/general`; experiment configurations
231+
testing aggregation are in `config/evaluation/aggregation`; additional experiments
232+
that test specific parameters and scenarios are in `config/evaluation/other`.
178233

179234
To run experiments with `binary_update`:
180235
```sh
181236
python main.py config/evaluation/general/config-binary.toml
182237
```
238+
# Evaluation of the results and plotting
239+
Due to the large scale of the experimental campaign, a number of script in the
240+
repository are used to check the correctness of the results, and to prepare the
241+
final result datasets.
242+

0 commit comments

Comments
 (0)