Skip to content

Commit 6c3b79d

Browse files
authored
BetterTransformer tutorial (pytorch#1976)
* Adding better transformer tutorial
1 parent 8cb5c3a commit 6c3b79d

File tree

3 files changed

+258
-0
lines changed

3 files changed

+258
-0
lines changed
34.9 KB
Loading

index.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ Welcome to PyTorch Tutorials
33

44
What's new in PyTorch tutorials?
55

6+
* `Fast Transformer Inference with Better Transformer <https://pytorch.org/tutorials/intermediate/bettertransformer_tutorial.html?utm_source=whats_new_tutorials&utm_medium=bettertransformer>`__
67
* `Introduction to TorchRec <https://pytorch.org/tutorials/intermediate/torchrec_tutorial.html?utm_source=whats_new_tutorials&utm_medium=torchrec>`__
78
* `Getting Started with Fully Sharded Data Parallel (FSDP) <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html?utm_source=whats_new_tutorials&utm_medium=FSDP>`__
89
* `Advanced model training with Fully Sharded Data Parallel (FSDP) <https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?utm_source=whats_new_tutorials&utm_medium=FSDP_advanced>`__
@@ -275,6 +276,14 @@ What's new in PyTorch tutorials?
275276

276277
.. Deploying PyTorch Models in Production
277278
279+
.. customcarditem::
280+
:header: Fast Transformer Inference with Better Transformer
281+
:card_description: Deploy a PyTorch Transformer model using Better Transformer with high performance for inference
282+
:image: _static/img/thumbnails/cropped/pytorch-logo.png
283+
:link: intermediate/bettertransformer_tutorial.html
284+
:tags: Production,Text
285+
286+
278287
.. customcarditem::
279288
:header: Deploying PyTorch in Python via a REST API with Flask
280289
:card_description: Deploy a PyTorch model using Flask and expose a REST API for model inference using the example of a pretrained DenseNet 121 model which detects the image.
@@ -795,6 +804,7 @@ Additional Resources
795804
:hidden:
796805
:caption: Deploying PyTorch Models in Production
797806

807+
intermediate/bettertransformer_tutorial
798808
intermediate/flask_rest_api_tutorial
799809
beginner/Intro_to_TorchScript_tutorial
800810
advanced/cpp_export
Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
Fast Transformer Inference with Better Transformer
2+
===============================================================
3+
4+
**Author**: `Michael Gschwind <https://github.com/mikekgfb>`__
5+
6+
This tutorial introduces Better Transformer (BT) as part of the PyTorch 1.12 release.
7+
In this tutorial, we show how to use Better Transformer for production
8+
inference with torchtext. Better Transformer is a production ready fastpath to
9+
accelerate deployment of Transformer models with high performance on CPU and GPU.
10+
The fastpath feature works transparently for models based either directly on
11+
PyTorch core nn.module or with torchtext.
12+
13+
Models which can be accelerated by Better Transformer fastpath execution are those
14+
using the following PyTorch core `torch.nn.module` classes `TransformerEncoder`,
15+
`TransformerEncoderLayer`, and `MultiHeadAttention`. In addition, torchtext has
16+
been updated to use the core library modules to benefit from fastpath acceleration.
17+
(Additional modules may be enabled with fastpath execution in the future.)
18+
19+
Better Transformer offers two types of acceleration:
20+
21+
* Native multihead attention implementation for CPU and GPU to improvee overall execution efficiency.
22+
* Exploiting sparsity in NLP inference. Because of variable input lengths, input
23+
tokens may contain a large number of padding tokens for which processing may be
24+
skipped, delivering significant speedups.
25+
26+
Fastpath execution is subject to some criteria. Most importantly, the model
27+
must be executed in inference mode and operate on input tensors that do not collect
28+
gradient tape information (e.g., running with torch.no_grad).
29+
30+
To follow this example in Google Colab, `click here
31+
<https://colab.research.google.com/drive/1LTCo7HqnmTuDMJhDCPgYfRHff1RBzPtI?usp=sharing>`__.
32+
33+
Better Transformer Features in This Tutorial
34+
--------------------------------------------
35+
* Load pre-trained models (pre-1.12 created without Better Transformer)
36+
* Run and benchmark inference on CPU with and without BT fastpath (native MHA only)
37+
* Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only)
38+
* Enable sparsity support
39+
* Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity)
40+
41+
Additional Information
42+
-----------------------
43+
Additional information about Better Transformer may be found in the PyTorch.Org blog
44+
`A Better Transformer for Fast Transformer Inference
45+
<https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference//>`__.
46+
47+
48+
49+
1. Setup
50+
51+
1.1 Load pre-trained models
52+
53+
We download the XLM-R model from the pre-defined torchtext models by following the instructions in
54+
`torchtext.models <https://pytorch.org/text/main/models.html>`__. We also set the DEVICE to execute
55+
on-accelerator tests. (Enable GPU execution for your environment as appropriate.)
56+
57+
.. code-block:: python
58+
59+
import torch
60+
import torch.nn as nn
61+
62+
print(f"torch version: {torch.__version__}")
63+
64+
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
65+
66+
print(f"torch cuda available: {torch.cuda.is_available()}")
67+
68+
import torch, torchtext
69+
from torchtext.models import RobertaClassificationHead
70+
from torchtext.functional import to_tensor
71+
xlmr_large = torchtext.models.XLMR_LARGE_ENCODER
72+
classifier_head = torchtext.models.RobertaClassificationHead(num_classes=2, input_dim = 1024)
73+
model = xlmr_large.get_model(head=classifier_head)
74+
transform = xlmr_large.transform()
75+
76+
1.2 Dataset Setup
77+
78+
We set up two types of inputs: a small input batch and a big input batch with sparsity.
79+
80+
.. code-block:: python
81+
82+
small_input_batch = [
83+
"Hello world",
84+
"How are you!"
85+
]
86+
big_input_batch = [
87+
"Hello world",
88+
"How are you!",
89+
"""`Well, Prince, so Genoa and Lucca are now just family estates of the
90+
Buonapartes. But I warn you, if you don't tell me that this means war,
91+
if you still try to defend the infamies and horrors perpetrated by
92+
that Antichrist- I really believe he is Antichrist- I will have
93+
nothing more to do with you and you are no longer my friend, no longer
94+
my 'faithful slave,' as you call yourself! But how do you do? I see
95+
I have frightened you- sit down and tell me all the news.`
96+
97+
It was in July, 1805, and the speaker was the well-known Anna
98+
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
99+
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
100+
of high rank and importance, who was the first to arrive at her
101+
reception. Anna Pavlovna had had a cough for some days. She was, as
102+
she said, suffering from la grippe; grippe being then a new word in
103+
St. Petersburg, used only by the elite."""
104+
]
105+
106+
Next, we select either the small or large input batch, preprocess the inputs and test the model.
107+
108+
.. code-block:: python
109+
110+
input_batch=big_input_batch
111+
112+
model_input = to_tensor(transform(input_batch), padding_value=1)
113+
output = model(model_input)
114+
output.shape
115+
116+
Finally, we set the benchmark iteration count:
117+
118+
.. code-block:: python
119+
120+
ITERATIONS=10
121+
122+
2. Execution
123+
124+
2.1 Run and benchmark inference on CPU with and without BT fastpath (native MHA only)
125+
126+
We run the model on CPU, and collect profile information:
127+
* The first run uses traditional ("slow path") execution.
128+
* The second run enables BT fastpath execution by putting the model in inference mode using `model.eval()` and disables gradient collection with `torch.no_grad()`.
129+
130+
You can see a small improvement when the model is executing on CPU. Notice that the fastpath profile shows most of the execution time
131+
in the native `TransformerEncoderLayer` implementation `aten::_transformer_encoder_layer_fwd`.
132+
133+
.. code-block:: python
134+
135+
print("slow path:")
136+
print("==========")
137+
with torch.autograd.profiler.profile(use_cuda=True) as prof:
138+
for i in range(ITERATIONS):
139+
output = model(model_input)
140+
print(prof)
141+
142+
model.eval()
143+
144+
print("fast path:")
145+
print("==========")
146+
with torch.autograd.profiler.profile(use_cuda=True) as prof:
147+
with torch.no_grad():
148+
for i in range(ITERATIONS):
149+
output = model(model_input)
150+
print(prof)
151+
152+
153+
2.2 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only)
154+
155+
We check the BT sparsity setting:
156+
157+
.. code-block:: python
158+
159+
model.encoder.transformer.layers.enable_nested_tensor
160+
161+
162+
We disable the BT sparsity:
163+
164+
.. code-block:: python
165+
166+
model.encoder.transformer.layers.enable_nested_tensor=False
167+
168+
169+
We run the model on DEVICE, and collect profile information for native MHA execution on DEVICE:
170+
* The first run uses traditional ("slow path") execution.
171+
* The second run enables BT fastpath execution by putting the model in inference mode using `model.eval()`
172+
and disables gradient collection with `torch.no_grad()`.
173+
174+
When executing on a GPU, you should see a significant speedup, in particular for the small input batch setting:
175+
176+
.. code-block:: python
177+
178+
model.to(DEVICE)
179+
model_input = model_input.to(DEVICE)
180+
181+
print("slow path:")
182+
print("==========")
183+
with torch.autograd.profiler.profile(use_cuda=True) as prof:
184+
for i in range(ITERATIONS):
185+
output = model(model_input)
186+
print(prof)
187+
188+
model.eval()
189+
190+
print("fast path:")
191+
print("==========")
192+
with torch.autograd.profiler.profile(use_cuda=True) as prof:
193+
with torch.no_grad():
194+
for i in range(ITERATIONS):
195+
output = model(model_input)
196+
print(prof)
197+
198+
199+
2.3 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity)
200+
201+
We enable sparsity support:
202+
203+
.. code-block:: python
204+
205+
model.encoder.transformer.layers.enable_nested_tensor = True
206+
207+
We run the model on DEVICE, and collect profile information for native MHA and sparsity support execution on DEVICE:
208+
209+
* The first run uses traditional ("slow path") execution.
210+
* The second run enables BT fastpath execution by putting the model in inference mode using `model.eval()` and disables gradient collection with `torch.no_grad()`.
211+
212+
When executing on a GPU, you should see a significant speedup, in particular for the large input batch setting which includes sparsity:
213+
214+
.. code-block:: python
215+
216+
model.to(DEVICE)
217+
model_input = model_input.to(DEVICE)
218+
219+
print("slow path:")
220+
print("==========")
221+
with torch.autograd.profiler.profile(use_cuda=True) as prof:
222+
for i in range(ITERATIONS):
223+
output = model(model_input)
224+
print(prof)
225+
226+
model.eval()
227+
228+
print("fast path:")
229+
print("==========")
230+
with torch.autograd.profiler.profile(use_cuda=True) as prof:
231+
with torch.no_grad():
232+
for i in range(ITERATIONS):
233+
output = model(model_input)
234+
print(prof)
235+
236+
237+
Summary
238+
-------
239+
240+
In this tutorial, we have introduced fast transformer inference with
241+
Better Transformer fastpath execution in torchtext using PyTorch core
242+
Better Transformer support for Transformer Encoder models. We have
243+
demonstrated the use of Better Transformer with models trained prior to
244+
the availability of BT fastpath execution. We have demonstrated and
245+
benchmarked the use of both BT fastpath execution modes, native MHA execution
246+
and BT sparsity acceleration.
247+
248+

0 commit comments

Comments
 (0)