[Feature] support DFlash: Block Diffusion for Flash Speculative Decoding

### Motivation

https://github.com/z-lab/dflash

DFlash is a lightweight block diffusion model
You can give qwen3.5 27Bvllm sglang mlx already support model modification.

The following is from the publicity introduction：


🚀 Core Breakthrough
DFlash is a lightweight Block Diffusion Model purpose-built for speculative decoding. It predicts an entire token block in a single forward pass, delivering unprecedented inference acceleration.
 
Limitations of Traditional Methods
 
Conventional speculative decoding approaches (e.g., EAGLE-3) still generate drafts in an autoregressive manner, where each token must wait for the completion of the previous one. This caps the practical speedup at only 2–3×.
 
DFlash Innovations
 
DFlash adopts a fundamentally different strategy:
 
- Parallel Draft Generation: Generates a full token block in one forward pass
- KV Injection Mechanism: Injects hidden layer features from the target model as contextual conditions into every layer of the draft model
- Feature Fusion: Fuses multi-layer hidden states via FC + RMSNorm to provide highly consistent contextual information
 
📊 Remarkable Acceleration Results
 
Qwen3-8B: 6× Lossless Acceleration
 
On the Qwen3-8B model, DFlash achieves:
 
- 6× lossless speedup
- 2.5× faster than EAGLE-3
- Acceptance rate as high as 89%+
 
Qwen3.5-9B: 4.1× Acceleration
 
On Apple Silicon platforms, the Qwen3.5-9B model delivers:
 
- 4.1× speedup
- Validates 16 tokens generated in a single batch
- Optimized with custom Metal kernels
 
🔥 Qwen3.5-27B: 5× Inference Speed Surge
 
Performance Comparison (Configuration)
 
Setup 1024 tokens 2048 tokens Speedup 
Baseline 14 tok/s 11 tok/s 1× 
8-bit Quantization 35 tok/s 26 tok/s 2.5× 
4-bit Quantization 28 tok/s 20 tok/s 2.0× 
 
Key Findings
 
1. 8-bit quantization outperforms 4-bit: Delivers superior speedup while maintaining higher precision
2. Strong long-sequence performance: Retains 2.3× speedup when generating 2048 tokens
3. Lossless decoding: Fully preserves model output quality with zero accuracy degradation

### Related resources

https://github.com/z-lab/dflash

### Additional context

<img width="812" height="906" alt="Image" src="https://github.com/user-attachments/assets/71ac3e69-8371-44e7-9462-b28856c309a1" />

<img width="773" height="679" alt="Image" src="https://github.com/user-attachments/assets/cabacde1-3747-4cfa-867e-d893b6e82b82" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] support DFlash: Block Diffusion for Flash Speculative Decoding #4530

Motivation

Related resources

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] support DFlash: Block Diffusion for Flash Speculative Decoding #4530

Description

Motivation

Related resources

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions