Skip to content

Commit be42b03

Browse files
authored
docs: flash-attn usage and install (#1706)
* docs: flash-attn usage and install Signed-off-by: Michele Dolfi <[email protected]> * fix link Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Michele Dolfi <[email protected]>
1 parent 96c54db commit be42b03

File tree

1 file changed

+35
-0
lines changed

1 file changed

+35
-0
lines changed

docs/faq/index.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,3 +194,38 @@ This is a collection of FAQ collected from the user questions on <https://github
194194
Also see [docling#725](https://github.com/docling-project/docling/issues/725).
195195

196196
Source: Issue [docling-core#119](https://github.com/docling-project/docling-core/issues/119)
197+
198+
199+
??? question "How to use flash attention?"
200+
201+
### How to use flash attention?
202+
203+
When running models in Docling on CUDA devices, you can enable the usage of the Flash Attention2 library.
204+
205+
Using environment variables:
206+
207+
```
208+
DOCLING_CUDA_USE_FLASH_ATTENTION2=1
209+
```
210+
211+
Using code:
212+
213+
```python
214+
from docling.datamodel.accelerator_options import (
215+
AcceleratorOptions,
216+
)
217+
218+
pipeline_options = VlmPipelineOptions(
219+
accelerator_options=AcceleratorOptions(cuda_use_flash_attention2=True)
220+
)
221+
```
222+
223+
This requires having the [flash-attn](https://pypi.org/project/flash-attn/) package installed. Below are two alternative ways for installing it:
224+
225+
```shell
226+
# Building from sources (required the CUDA dev environment)
227+
pip install flash-attn
228+
229+
# Using pre-built wheels (not available in all possible setups)
230+
FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn
231+
```

0 commit comments

Comments
 (0)