File tree Expand file tree Collapse file tree 1 file changed +35
-0
lines changed Expand file tree Collapse file tree 1 file changed +35
-0
lines changed Original file line number Diff line number Diff line change @@ -194,3 +194,38 @@ This is a collection of FAQ collected from the user questions on <https://github
194
194
Also see [docling#725](https://github.com/docling-project/docling/issues/725).
195
195
196
196
Source: Issue [docling-core#119](https://github.com/docling-project/docling-core/issues/119)
197
+
198
+
199
+ ??? question "How to use flash attention?"
200
+
201
+ ### How to use flash attention?
202
+
203
+ When running models in Docling on CUDA devices, you can enable the usage of the Flash Attention2 library.
204
+
205
+ Using environment variables:
206
+
207
+ ```
208
+ DOCLING_CUDA_USE_FLASH_ATTENTION2=1
209
+ ```
210
+
211
+ Using code:
212
+
213
+ ```python
214
+ from docling.datamodel.accelerator_options import (
215
+ AcceleratorOptions,
216
+ )
217
+
218
+ pipeline_options = VlmPipelineOptions(
219
+ accelerator_options=AcceleratorOptions(cuda_use_flash_attention2=True)
220
+ )
221
+ ```
222
+
223
+ This requires having the [flash-attn](https://pypi.org/project/flash-attn/) package installed. Below are two alternative ways for installing it:
224
+
225
+ ```shell
226
+ # Building from sources (required the CUDA dev environment)
227
+ pip install flash-attn
228
+
229
+ # Using pre-built wheels (not available in all possible setups)
230
+ FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn
231
+ ```
You can’t perform that action at this time.
0 commit comments