Skip to content

Commit 9cf7cc0

Browse files
committed
intermediate/realtime_rpi: updated with benchmarks + feedback
1 parent 5bc5b94 commit 9cf7cc0

File tree

1 file changed

+94
-8
lines changed

1 file changed

+94
-8
lines changed

intermediate_source/realtime_rpi.rst

Lines changed: 94 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ classification model in real time (30 fps+) on the CPU.
99
This was all tested with Raspberry Pi 4 Model B 4GB but should work with the 2GB
1010
variant as well as on the 3B with reduced performance.
1111

12-
.. image:: https://user-images.githubusercontent.com/909104/152895495-7e9910c1-2b9f-4299-a788-d7ec43a93424.jpg
12+
.. image:: https://user-images.githubusercontent.com/909104/153093710-bc736b6f-69d9-4a50-a3e8-9f2b2c9e04fd.gif
1313

1414
Prerequisites
1515
~~~~~~~~~~~~~~~~
@@ -78,8 +78,7 @@ We can now check that everything installed correctly:
7878

7979
.. code:: shell
8080
81-
$ python3 -c "import torch; print(torch.__version__)"
82-
1.10.0+cpu
81+
$ python -c "import torch; print(torch.__version__)"
8382
8483
.. image:: https://user-images.githubusercontent.com/909104/152874271-d7057c2d-80fd-4761-aed4-df6c8b7aa99f.png
8584

@@ -116,7 +115,7 @@ shuffling to get it into the expected RGB format.
116115
# convert opencv output from BGR to RGB
117116
image = image[:, :, [2, 1, 0]]
118117
119-
NOTE: You can get even more performance by training the model directly with OpenCV's BGR data format to remove the conversion step.
118+
This data reading and processing takes about ``3.5 ms``.
120119

121120
Image Preprocessing
122121
~~~~~~~~~~~~~~~~~~~~
@@ -128,11 +127,55 @@ We need to take the frames and transform them into the format the model expects.
128127
from torchvision import transforms
129128
130129
preprocess = transforms.Compose([
130+
# convert the frame to a CHW torch tensor for training
131131
transforms.ToTensor(),
132+
# normalize the colors to the range that mobilenet_v2/3 expect
132133
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
133134
])
134135
input_tensor = preprocess(image)
135-
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
136+
# The model can handle multiple images simultaneously so we need to add an
137+
# empty dimension for the batch.
138+
# [3, 224, 224] -> [1, 3, 224, 224]
139+
input_batch = input_tensor.unsqueeze(0)
140+
141+
Model Choices
142+
~~~~~~~~~~~~~~~
143+
144+
There's a number of models you can choose from to use with different performance
145+
characteristics. Not all models provide a ``qnnpack`` pretrained variant so for
146+
testing purposes you should chose one that does but if you train and quantize
147+
your own model you can use any of them.
148+
149+
We're using ``mobilenet_v2`` for this tutorial since it has good performance and
150+
accuracy.
151+
152+
Raspberry Pi 4 Benchmark Results:
153+
154+
+--------------------+------+-----------------------+-----------------------+--------------------+
155+
| Model | FPS | Total Time (ms/frame) | Model Time (ms/frame) | qnnpack Pretrained |
156+
+====================+======+=======================+=======================+====================+
157+
| mobilenet_v2 | 33.7 | 29.7 | 26.4 | True |
158+
+--------------------+------+-----------------------+-----------------------+--------------------+
159+
| mobilenet_v3_large | 29.3 | 34.1 | 30.7 | True |
160+
+--------------------+------+-----------------------+-----------------------+--------------------+
161+
| resnet18 | 9.2 | 109.0 | 100.3 | False |
162+
+--------------------+------+-----------------------+-----------------------+--------------------+
163+
| resnet50 | 4.3 | 233.9 | 225.2 | False |
164+
+--------------------+------+-----------------------+-----------------------+--------------------+
165+
| resnext101_32x8d | 1.1 | 892.5 | 885.3 | False |
166+
+--------------------+------+-----------------------+-----------------------+--------------------+
167+
| inception_v3 | 4.9 | 204.1 | 195.5 | False |
168+
+--------------------+------+-----------------------+-----------------------+--------------------+
169+
| googlenet | 7.4 | 135.3 | 132.0 | False |
170+
+--------------------+------+-----------------------+-----------------------+--------------------+
171+
| shufflenet_v2_x0_5 | 46.7 | 21.4 | 18.2 | False |
172+
+--------------------+------+-----------------------+-----------------------+--------------------+
173+
| shufflenet_v2_x1_0 | 24.4 | 41.0 | 37.7 | False |
174+
+--------------------+------+-----------------------+-----------------------+--------------------+
175+
| shufflenet_v2_x1_5 | 16.8 | 59.6 | 56.3 | False |
176+
+--------------------+------+-----------------------+-----------------------+--------------------+
177+
| shufflenet_v2_x2_0 | 11.6 | 86.3 | 82.7 | False |
178+
+--------------------+------+-----------------------+-----------------------+--------------------+
136179

137180
MobileNetV2: Quantization and JIT
138181
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -163,7 +206,6 @@ We then want to jit the model to reduce Python overhead and fuse any ops. Jit gi
163206
.. code:: python
164207
165208
net = torch.jit.script(net)
166-
net.eval()
167209
168210
Putting It Together
169211
~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -196,7 +238,6 @@ We can now put all the pieces together and run it:
196238
net = models.quantization.mobilenet_v2(pretrained=True, quantize=True)
197239
# jit model to take it from ~20fps to ~30fps
198240
net = torch.jit.script(net)
199-
net.eval()
200241
201242
started = time.time()
202243
last_logged = time.time()
@@ -243,6 +284,50 @@ If we check ``htop`` we see that we have almost 100% utilization.
243284

244285
.. image:: https://user-images.githubusercontent.com/909104/152892630-f094b84b-19ba-48f6-8632-1b954abc59c7.png
245286

287+
To verify that it's working end to end we can compute the probabilities of the
288+
classes and
289+
`use the ImageNet class labels <https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a>`_
290+
to print the detections.
291+
292+
.. code:: python
293+
294+
top = list(enumerate(output[0].softmax(dim=0)))
295+
top.sort(key=lambda x: x[1], reverse=True)
296+
for idx, val in top[:10]:
297+
print(f"{val.item()*100:.2f}% {classes[idx]}")
298+
299+
``mobilenet_v3_large`` running in real time:
300+
301+
.. image:: https://user-images.githubusercontent.com/909104/153093710-bc736b6f-69d9-4a50-a3e8-9f2b2c9e04fd.gif
302+
303+
304+
Detecting an orange:
305+
306+
.. image:: https://user-images.githubusercontent.com/909104/153092153-d9c08dfe-105b-408a-8e1e-295da8a78c19.jpg
307+
308+
309+
Detecting a mug:
310+
311+
.. image:: https://user-images.githubusercontent.com/909104/153092155-4b90002f-a0f3-4267-8d70-e713e7b4d5a0.jpg
312+
313+
314+
Troubleshooting: Performance
315+
~~~~~~~~~~~~~~~~~
316+
317+
PyTorch by default will use all of the cores available. If you have anything
318+
running in the background on the Raspberry Pi it may cause contention with the
319+
model inference causing latency spikes. To alleviate this you can reduce the
320+
number of threads which will reduce the peak latency at a small performance
321+
penalty.
322+
323+
.. code:: python
324+
325+
torch.set_num_threads(2)
326+
327+
For ``shufflenet_v2_x1_5`` using ``2 threads`` instead of ``4 threads``
328+
increases best case latency to ``72 ms`` from ``60 ms`` but eliminates the
329+
latency spikes of ``128 ms``.
330+
246331
Next Steps
247332
~~~~~~~~~~~~~
248333

@@ -256,4 +341,5 @@ directly deploy with good performance on a Raspberry Pi.
256341
See more:
257342

258343
* `Quantization <https://pytorch.org/docs/stable/quantization.html>`_ for more information on how to quantize and fuse your model.
259-
* :ref:`beginner/transfer_learning_tutorial` for how to use transfer learning to fine tune a pre-existing model to your dataset.
344+
* `Transfer Learning Tutorial <https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html>`_
345+
for how to use transfer learning to fine tune a pre-existing model to your dataset.

0 commit comments

Comments
 (0)