|
| 1 | +Fast Transformer Inference with Better Transformer |
| 2 | +=============================================================== |
| 3 | + |
| 4 | +**Author**: `Michael Gschwind <https://github.com/mikekgfb>`__ |
| 5 | + |
| 6 | +This tutorial introduces Better Transformer (BT) as part of the PyTorch 1.12 release. |
| 7 | +In this tutorial, we show how to use Better Transformer for production |
| 8 | +inference with torchtext. Better Transformer is a production ready fastpath to |
| 9 | +accelerate deployment of Transformer models with high performance on CPU and GPU. |
| 10 | +The fastpath feature works transparently for models based either directly on |
| 11 | +PyTorch core nn.module or with torchtext. |
| 12 | + |
| 13 | +Models which can be accelerated by Better Transformer fastpath execution are those |
| 14 | +using the following PyTorch core `torch.nn.module` classes `TransformerEncoder`, |
| 15 | +`TransformerEncoderLayer`, and `MultiHeadAttention`. In addition, torchtext has |
| 16 | +been updated to use the core library modules to benefit from fastpath acceleration. |
| 17 | +(Additional modules may be enabled with fastpath execution in the future.) |
| 18 | + |
| 19 | +Better Transformer offers two types of acceleration: |
| 20 | + |
| 21 | +* Native multihead attention implementation for CPU and GPU to improvee overall execution efficiency. |
| 22 | +* Exploiting sparsity in NLP inference. Because of variable input lengths, input |
| 23 | + tokens may contain a large number of padding tokens for which processing may be |
| 24 | + skipped, delivering significant speedups. |
| 25 | + |
| 26 | +Fastpath execution is subject to some criteria. Most importantly, the model |
| 27 | +must be executed in inference mode and operate on input tensors that do not collect |
| 28 | +gradient tape information (e.g., running with torch.no_grad). |
| 29 | + |
| 30 | +To follow this example in Google Colab, `click here |
| 31 | +<https://colab.research.google.com/drive/1LTCo7HqnmTuDMJhDCPgYfRHff1RBzPtI?usp=sharing>`__. |
| 32 | + |
| 33 | +Better Transformer Features in This Tutorial |
| 34 | +-------------------------------------------- |
| 35 | +* Load pre-trained models (pre-1.12 created without Better Transformer) |
| 36 | +* Run and benchmark inference on CPU with and without BT fastpath (native MHA only) |
| 37 | +* Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only) |
| 38 | +* Enable sparsity support |
| 39 | +* Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity) |
| 40 | + |
| 41 | +Additional Information |
| 42 | +----------------------- |
| 43 | +Additional information about Better Transformer may be found in the PyTorch.Org blog |
| 44 | +`A Better Transformer for Fast Transformer Inference |
| 45 | +<https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference//>`__. |
| 46 | + |
| 47 | + |
| 48 | + |
| 49 | +1. Setup |
| 50 | + |
| 51 | +1.1 Load pre-trained models |
| 52 | + |
| 53 | +We download the XLM-R model from the pre-defined torchtext models by following the instructions in |
| 54 | +`torchtext.models <https://pytorch.org/text/main/models.html>`__. We also set the DEVICE to execute |
| 55 | +on-accelerator tests. (Enable GPU execution for your environment as appropriate.) |
| 56 | + |
| 57 | +.. code-block:: python |
| 58 | +
|
| 59 | + import torch |
| 60 | + import torch.nn as nn |
| 61 | +
|
| 62 | + print(f"torch version: {torch.__version__}") |
| 63 | +
|
| 64 | + DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") |
| 65 | +
|
| 66 | + print(f"torch cuda available: {torch.cuda.is_available()}") |
| 67 | +
|
| 68 | + import torch, torchtext |
| 69 | + from torchtext.models import RobertaClassificationHead |
| 70 | + from torchtext.functional import to_tensor |
| 71 | + xlmr_large = torchtext.models.XLMR_LARGE_ENCODER |
| 72 | + classifier_head = torchtext.models.RobertaClassificationHead(num_classes=2, input_dim = 1024) |
| 73 | + model = xlmr_large.get_model(head=classifier_head) |
| 74 | + transform = xlmr_large.transform() |
| 75 | +
|
| 76 | +1.2 Dataset Setup |
| 77 | + |
| 78 | +We set up two types of inputs: a small input batch and a big input batch with sparsity. |
| 79 | + |
| 80 | +.. code-block:: python |
| 81 | +
|
| 82 | + small_input_batch = [ |
| 83 | + "Hello world", |
| 84 | + "How are you!" |
| 85 | + ] |
| 86 | + big_input_batch = [ |
| 87 | + "Hello world", |
| 88 | + "How are you!", |
| 89 | + """`Well, Prince, so Genoa and Lucca are now just family estates of the |
| 90 | + Buonapartes. But I warn you, if you don't tell me that this means war, |
| 91 | + if you still try to defend the infamies and horrors perpetrated by |
| 92 | + that Antichrist- I really believe he is Antichrist- I will have |
| 93 | + nothing more to do with you and you are no longer my friend, no longer |
| 94 | + my 'faithful slave,' as you call yourself! But how do you do? I see |
| 95 | + I have frightened you- sit down and tell me all the news.` |
| 96 | +
|
| 97 | + It was in July, 1805, and the speaker was the well-known Anna |
| 98 | + Pavlovna Scherer, maid of honor and favorite of the Empress Marya |
| 99 | + Fedorovna. With these words she greeted Prince Vasili Kuragin, a man |
| 100 | + of high rank and importance, who was the first to arrive at her |
| 101 | + reception. Anna Pavlovna had had a cough for some days. She was, as |
| 102 | + she said, suffering from la grippe; grippe being then a new word in |
| 103 | + St. Petersburg, used only by the elite.""" |
| 104 | + ] |
| 105 | +
|
| 106 | +Next, we select either the small or large input batch, preprocess the inputs and test the model. |
| 107 | + |
| 108 | +.. code-block:: python |
| 109 | +
|
| 110 | + input_batch=big_input_batch |
| 111 | +
|
| 112 | + model_input = to_tensor(transform(input_batch), padding_value=1) |
| 113 | + output = model(model_input) |
| 114 | + output.shape |
| 115 | +
|
| 116 | +Finally, we set the benchmark iteration count: |
| 117 | + |
| 118 | +.. code-block:: python |
| 119 | +
|
| 120 | + ITERATIONS=10 |
| 121 | +
|
| 122 | +2. Execution |
| 123 | + |
| 124 | +2.1 Run and benchmark inference on CPU with and without BT fastpath (native MHA only) |
| 125 | + |
| 126 | +We run the model on CPU, and collect profile information: |
| 127 | +* The first run uses traditional ("slow path") execution. |
| 128 | +* The second run enables BT fastpath execution by putting the model in inference mode using `model.eval()` and disables gradient collection with `torch.no_grad()`. |
| 129 | + |
| 130 | +You can see a small improvement when the model is executing on CPU. Notice that the fastpath profile shows most of the execution time |
| 131 | +in the native `TransformerEncoderLayer` implementation `aten::_transformer_encoder_layer_fwd`. |
| 132 | + |
| 133 | +.. code-block:: python |
| 134 | +
|
| 135 | + print("slow path:") |
| 136 | + print("==========") |
| 137 | + with torch.autograd.profiler.profile(use_cuda=True) as prof: |
| 138 | + for i in range(ITERATIONS): |
| 139 | + output = model(model_input) |
| 140 | + print(prof) |
| 141 | +
|
| 142 | + model.eval() |
| 143 | +
|
| 144 | + print("fast path:") |
| 145 | + print("==========") |
| 146 | + with torch.autograd.profiler.profile(use_cuda=True) as prof: |
| 147 | + with torch.no_grad(): |
| 148 | + for i in range(ITERATIONS): |
| 149 | + output = model(model_input) |
| 150 | + print(prof) |
| 151 | +
|
| 152 | +
|
| 153 | +2.2 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only) |
| 154 | + |
| 155 | +We check the BT sparsity setting: |
| 156 | + |
| 157 | +.. code-block:: python |
| 158 | +
|
| 159 | + model.encoder.transformer.layers.enable_nested_tensor |
| 160 | + |
| 161 | +
|
| 162 | +We disable the BT sparsity: |
| 163 | + |
| 164 | +.. code-block:: python |
| 165 | +
|
| 166 | + model.encoder.transformer.layers.enable_nested_tensor=False |
| 167 | + |
| 168 | + |
| 169 | +We run the model on DEVICE, and collect profile information for native MHA execution on DEVICE: |
| 170 | +* The first run uses traditional ("slow path") execution. |
| 171 | +* The second run enables BT fastpath execution by putting the model in inference mode using `model.eval()` |
| 172 | + and disables gradient collection with `torch.no_grad()`. |
| 173 | + |
| 174 | +When executing on a GPU, you should see a significant speedup, in particular for the small input batch setting: |
| 175 | + |
| 176 | +.. code-block:: python |
| 177 | +
|
| 178 | + model.to(DEVICE) |
| 179 | + model_input = model_input.to(DEVICE) |
| 180 | +
|
| 181 | + print("slow path:") |
| 182 | + print("==========") |
| 183 | + with torch.autograd.profiler.profile(use_cuda=True) as prof: |
| 184 | + for i in range(ITERATIONS): |
| 185 | + output = model(model_input) |
| 186 | + print(prof) |
| 187 | +
|
| 188 | + model.eval() |
| 189 | +
|
| 190 | + print("fast path:") |
| 191 | + print("==========") |
| 192 | + with torch.autograd.profiler.profile(use_cuda=True) as prof: |
| 193 | + with torch.no_grad(): |
| 194 | + for i in range(ITERATIONS): |
| 195 | + output = model(model_input) |
| 196 | + print(prof) |
| 197 | + |
| 198 | +
|
| 199 | +2.3 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity) |
| 200 | + |
| 201 | +We enable sparsity support: |
| 202 | + |
| 203 | +.. code-block:: python |
| 204 | +
|
| 205 | + model.encoder.transformer.layers.enable_nested_tensor = True |
| 206 | +
|
| 207 | +We run the model on DEVICE, and collect profile information for native MHA and sparsity support execution on DEVICE: |
| 208 | + |
| 209 | +* The first run uses traditional ("slow path") execution. |
| 210 | +* The second run enables BT fastpath execution by putting the model in inference mode using `model.eval()` and disables gradient collection with `torch.no_grad()`. |
| 211 | + |
| 212 | +When executing on a GPU, you should see a significant speedup, in particular for the large input batch setting which includes sparsity: |
| 213 | + |
| 214 | +.. code-block:: python |
| 215 | +
|
| 216 | + model.to(DEVICE) |
| 217 | + model_input = model_input.to(DEVICE) |
| 218 | +
|
| 219 | + print("slow path:") |
| 220 | + print("==========") |
| 221 | + with torch.autograd.profiler.profile(use_cuda=True) as prof: |
| 222 | + for i in range(ITERATIONS): |
| 223 | + output = model(model_input) |
| 224 | + print(prof) |
| 225 | +
|
| 226 | + model.eval() |
| 227 | +
|
| 228 | + print("fast path:") |
| 229 | + print("==========") |
| 230 | + with torch.autograd.profiler.profile(use_cuda=True) as prof: |
| 231 | + with torch.no_grad(): |
| 232 | + for i in range(ITERATIONS): |
| 233 | + output = model(model_input) |
| 234 | + print(prof) |
| 235 | +
|
| 236 | +
|
| 237 | +Summary |
| 238 | +------- |
| 239 | + |
| 240 | +In this tutorial, we have introduced fast transformer inference with |
| 241 | +Better Transformer fastpath execution in torchtext using PyTorch core |
| 242 | +Better Transformer support for Transformer Encoder models. We have |
| 243 | +demonstrated the use of Better Transformer with models trained prior to |
| 244 | +the availability of BT fastpath execution. We have demonstrated and |
| 245 | +benchmarked the use of both BT fastpath execution modes, native MHA execution |
| 246 | +and BT sparsity acceleration. |
| 247 | + |
| 248 | + |
0 commit comments