CodeFuse-Embeddings/F2LLM/README.md at main · fluoryyn-art/CodeFuse-Embeddings

F2LLM

F2LLMs (Foundation-to-Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs, striking a strong balance between model size, training cost, and embedding performance:

On the MTEB leaderboard, F2LLM-4B ranks 2nd among models of ~4B size, and 7th overall, while F2LLM-1.7B ranks 1st among models of 1B-2B size.

F2LLMs are fully open. Model checkpoints are available at:

Training data is available at F2LLM data.

Train

In this repo we provide a streamlined and efficient script for training embedding models. To reproduce the training of F2LLMs, please:

Setup environment following requirements.txt. We note that transformers>=4.51.0 is required for training Qwen3 models.
Download data and backbone models from Hugging Face (we use Qwen3 models).
Run tokenize_data_qwen.py to tokenize the downloaded data
Modify model path, data path, and other arguments in configs/config.json.
Start training with accelerate launch --config_file configs/accelerate_config.yaml run.py --config configs/config.json.

Note: we recommend setting num_processes to 1 in configs/accelerate_config.yaml and launch the training code once to generate cache for training data before starting the actual training.

For multi-node training, run on the main node:

accelerate launch --config_file configs/accelerate_config.yaml --num_machines N_NODE --num_processes N_PROCESSES --machine_rank 0 --main_process_ip MASTER_IP --main_process_port MASTER_PORT run.py --config configs/config.json

where N_NODE is the number of machines; N_PROCESSES is N_NODE*8; MASTER_IP is the IP address of your master node, and MASTER_PORT is a port available on your machine (e.g. 6379).

On worker nodes, also run the above commmand but modify machine_rank accordingly.

Citation

If you use the F2LLM models, data, or code, please cite the following technical report.

@article{2025F2LLM,
  title={F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data},
  author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
  journal      = {CoRR},
  volume       = {abs/2510.02294},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2510.02294},
  doi          = {10.48550/ARXIV.2510.02294},
  eprinttype    = {arXiv},
  eprint       = {2510.02294}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F2LLM

Train

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

F2LLM

Train

Citation