Skip to content

ljubomirj/dlms-are-super-data-learners

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Diffusion Language Models are Super Data Learners

Static Badge Twitter

Highlights

  • We pre-trained DLMs and AR models from scratch for up to 8B parameters and 480B tokens. DLMs demonstrate > 3x greater data potential compared to autoregressive (AR) models. Notably, a 1B-parameter masked diffusion model achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. Note that more repetitions could further improve its performance, as no signs of diminishing returns were observed.
  • DLMs are super-dense models that consume more FLOPs than dense AR models. Training DLMs to fully leverage the data typically demands at least two orders of magnitude more FLOPs. During inference, generating sequences ranging from 16 to 4096 tokens incurs a 16× to 4700× increase in FLOPs compared to AR baselines. In addition, the more expressive bidirectional attention enabled by the diffusion objective allows bidirectional modeling of the language data, which is not fully causal, to fully squeeze its value.
  • Our concurrent work, “Diffusion Beats Autoregressive in Data-Constrained Settings”, contains critical methodological issues potentially leading to problematic conclusions, including problematic diffusion loss formulation, invalid metrics for comparison, unfair settings for AR models, and problematic scaling law formulation. All of which might lead to questionable results and conclusions.

The Crossover

Figure A of the blog: The performance comparison of autoregressive (AR) and masked diffusion models (Diffusion) when repeating on a limited portion of data. All models are trained on 96B total tokens (including repetition), varying the unique tokens from 0.5B to 96B. Diffusion models exploit the data better through more repetition on limited unique data. More unique tokens requires more repetition to see the crossover, where the high unique token runs postpone the crossover beyond our 96B token observation scope.


Citation

@misc{ni2025difflm,
title={Diffusion Language Models are Super Data Learners},
author={Jinjie Ni and the team},
year={2025},
howpublished={\url{https://jinjieni.notion.site/Diffusion-Language-Models-are-Super-Data-Learners-239d8f03a866800ab196e49928c019ac}},
note={Notion Blog},
}

About

The official github repo for "Diffusion Language Models are Super Data Learners".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published