Skip to content

[Feature] support DFlash: Block Diffusion for Flash Speculative Decoding #4530

@hicofeng

Description

@hicofeng

Motivation

https://github.com/z-lab/dflash

DFlash is a lightweight block diffusion model
You can give qwen3.5 27Bvllm sglang mlx already support model modification.

The following is from the publicity introduction:

🚀 Core Breakthrough
DFlash is a lightweight Block Diffusion Model purpose-built for speculative decoding. It predicts an entire token block in a single forward pass, delivering unprecedented inference acceleration.

Limitations of Traditional Methods

Conventional speculative decoding approaches (e.g., EAGLE-3) still generate drafts in an autoregressive manner, where each token must wait for the completion of the previous one. This caps the practical speedup at only 2–3×.

DFlash Innovations

DFlash adopts a fundamentally different strategy:

  • Parallel Draft Generation: Generates a full token block in one forward pass
  • KV Injection Mechanism: Injects hidden layer features from the target model as contextual conditions into every layer of the draft model
  • Feature Fusion: Fuses multi-layer hidden states via FC + RMSNorm to provide highly consistent contextual information

📊 Remarkable Acceleration Results

Qwen3-8B: 6× Lossless Acceleration

On the Qwen3-8B model, DFlash achieves:

  • 6× lossless speedup
  • 2.5× faster than EAGLE-3
  • Acceptance rate as high as 89%+

Qwen3.5-9B: 4.1× Acceleration

On Apple Silicon platforms, the Qwen3.5-9B model delivers:

  • 4.1× speedup
  • Validates 16 tokens generated in a single batch
  • Optimized with custom Metal kernels

🔥 Qwen3.5-27B: 5× Inference Speed Surge

Performance Comparison (Configuration)

Setup 1024 tokens 2048 tokens Speedup
Baseline 14 tok/s 11 tok/s 1×
8-bit Quantization 35 tok/s 26 tok/s 2.5×
4-bit Quantization 28 tok/s 20 tok/s 2.0×

Key Findings

1. 8-bit quantization outperforms 4-bit: Delivers superior speedup while maintaining higher precision
2. Strong long-sequence performance: Retains 2.3× speedup when generating 2048 tokens
3. Lossless decoding: Fully preserves model output quality with zero accuracy degradation

Related resources

https://github.com/z-lab/dflash

Additional context

Image Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions