Motivation
https://github.com/z-lab/dflash
DFlash is a lightweight block diffusion model
You can give qwen3.5 27Bvllm sglang mlx already support model modification.
The following is from the publicity introduction:
🚀 Core Breakthrough
DFlash is a lightweight Block Diffusion Model purpose-built for speculative decoding. It predicts an entire token block in a single forward pass, delivering unprecedented inference acceleration.
Limitations of Traditional Methods
Conventional speculative decoding approaches (e.g., EAGLE-3) still generate drafts in an autoregressive manner, where each token must wait for the completion of the previous one. This caps the practical speedup at only 2–3×.
DFlash Innovations
DFlash adopts a fundamentally different strategy:
- Parallel Draft Generation: Generates a full token block in one forward pass
- KV Injection Mechanism: Injects hidden layer features from the target model as contextual conditions into every layer of the draft model
- Feature Fusion: Fuses multi-layer hidden states via FC + RMSNorm to provide highly consistent contextual information
📊 Remarkable Acceleration Results
Qwen3-8B: 6× Lossless Acceleration
On the Qwen3-8B model, DFlash achieves:
- 6× lossless speedup
- 2.5× faster than EAGLE-3
- Acceptance rate as high as 89%+
Qwen3.5-9B: 4.1× Acceleration
On Apple Silicon platforms, the Qwen3.5-9B model delivers:
- 4.1× speedup
- Validates 16 tokens generated in a single batch
- Optimized with custom Metal kernels
🔥 Qwen3.5-27B: 5× Inference Speed Surge
Performance Comparison (Configuration)
Setup 1024 tokens 2048 tokens Speedup
Baseline 14 tok/s 11 tok/s 1×
8-bit Quantization 35 tok/s 26 tok/s 2.5×
4-bit Quantization 28 tok/s 20 tok/s 2.0×
Key Findings
1. 8-bit quantization outperforms 4-bit: Delivers superior speedup while maintaining higher precision
2. Strong long-sequence performance: Retains 2.3× speedup when generating 2048 tokens
3. Lossless decoding: Fully preserves model output quality with zero accuracy degradation
Related resources
https://github.com/z-lab/dflash
Additional context

Motivation
https://github.com/z-lab/dflash
DFlash is a lightweight block diffusion model
You can give qwen3.5 27Bvllm sglang mlx already support model modification.
The following is from the publicity introduction:
🚀 Core Breakthrough
DFlash is a lightweight Block Diffusion Model purpose-built for speculative decoding. It predicts an entire token block in a single forward pass, delivering unprecedented inference acceleration.
Limitations of Traditional Methods
Conventional speculative decoding approaches (e.g., EAGLE-3) still generate drafts in an autoregressive manner, where each token must wait for the completion of the previous one. This caps the practical speedup at only 2–3×.
DFlash Innovations
DFlash adopts a fundamentally different strategy:
📊 Remarkable Acceleration Results
Qwen3-8B: 6× Lossless Acceleration
On the Qwen3-8B model, DFlash achieves:
Qwen3.5-9B: 4.1× Acceleration
On Apple Silicon platforms, the Qwen3.5-9B model delivers:
🔥 Qwen3.5-27B: 5× Inference Speed Surge
Performance Comparison (Configuration)
Setup 1024 tokens 2048 tokens Speedup
Baseline 14 tok/s 11 tok/s 1×
8-bit Quantization 35 tok/s 26 tok/s 2.5×
4-bit Quantization 28 tok/s 20 tok/s 2.0×
Key Findings
1. 8-bit quantization outperforms 4-bit: Delivers superior speedup while maintaining higher precision
2. Strong long-sequence performance: Retains 2.3× speedup when generating 2048 tokens
3. Lossless decoding: Fully preserves model output quality with zero accuracy degradation
Related resources
https://github.com/z-lab/dflash
Additional context