This repository contains the experiment code for running example model benchmarks and data processing that accompanies the paper MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems. This repository is only an example demonstration of how the MerRec dataset can be used in terms of recommendation tasks, and does not depict or reflect production implementation at Mercari.
In the SBR tasks, the raw data is converted to a processed sequences in the memory itself. We don't need to run pre-processing separately. Below are the commands to run various SBR models on the benchmark data.
NextItNet:
python main.py --task_name=sequence --seed=100 --model_name=nextitnet --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0001 --hidden_size=128 --block_num=8 --embedding_size=128 --kernel_size=3 --is_pretrain=1Bert4Rec:
python main.py --task_name=sequence --seed=100 --model_name=bert4rec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0001 --hidden_size=128 --block_num=16 --embedding_size=128 --num_heads=4 --mask_prob=0.3 --is_pretrain=1GRU4Rec:
python main.py --task_name=sequence --seed=100 --model_name=gru4rec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0005 --hidden_size=64 --block_num=8 --embedding_size=64 --is_pretrain=1SASRec:
python main.py --task_name=sequence --seed=100 --model_name=sasrec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0001 --hidden_size=64 --block_num=8 --embedding_size=64 --num_heads=4 --is_pretrain=1In both CTR task and MTL task below, the raw dataset first needs to be transformed.
Based on product_id:
python preprocess_mtl.py --out_path='data/mtl_product.csv' --local_dir_path='data/20230501'Attention FM (AFM):
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=afm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005DeepFM:
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=deepfm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005xDeepFM:
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=xdeepfm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005DCN:
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=dcn --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005DCNv2 (DCNMIX):
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=dcnmix --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005NeuralFM (NFM):
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=nfm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005Wide & Deep:
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=wdl --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005Only item_view with MMOE:
python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=mmoe --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=1Only item_like with MMOE:
python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=mmoe --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=02-task ESMM:
python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=esmm --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=22-task MMOE:
python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=mmoe --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=2Skip-SASRec
python main.py --task_name=inference_acc --seed=5 --model_name=sas4infacc --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=1 --epochs=20 --lr=0.0001 --hidden_size=64 --block_num=8 --embedding_size=64 --num_heads=4 --is_pretrain=1Skip-NextItNet
python main.py --task_name=inference_acc --seed=5 --model_name=skiprec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=1 --epochs=20 --lr=0.0001 --hidden_size=128 --block_num=8 --embedding_size=128 --dilation=1,4 --kernel_size=3 --is_pretrain=1@misc{li2024merrec,
title={MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems},
author={Lichi Li and Zainul Abi Din and Zhen Tan and Sam London and Tianlong Chen and Ajay Daptardar},
year={2024},
eprint={2402.14230},
archivePrefix={arXiv},
primaryClass={cs.IR}
}- Codebase: This codebase is licensed under the MIT license.
- Dataset: The MerRec dataset is licensed under CC BY-NC 4.0 International.
Contributions are welcomed. Please read the CLA carefully before submitting your contribution to Mercari. Under any circumstances, by submitting your contribution, you are deemed to accept and agree to be bound by the terms and conditions of the CLA.
We would like to thank Guanghu Yuan et al. for their work Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems and making the code publicly available and for the extensive documentation. Many of our experiment implementation centered on product_id in CTR, MTL and SBR tasks derived from this work.
