6
6
7
7
Introduction
8
8
------------
9
- PyTorch 1.8 includes an updated profiler API capable of
9
+ PyTorch 1.8 includes an updated profiler API capable of
10
10
recording the CPU side operations as well as the CUDA kernel launches on the GPU side.
11
11
The profiler can visualize this information
12
12
in TensorBoard Plugin and provide analysis of the performance bottlenecks.
@@ -113,7 +113,8 @@ def train(data):
113
113
# After profiling, result files will be saved into the ``./log/resnet18`` directory.
114
114
# Specify this directory as a ``logdir`` parameter to analyze profile in TensorBoard.
115
115
# - ``record_shapes`` - whether to record shapes of the operator inputs.
116
- # - ``profile_memory`` - Track tensor memory allocation/deallocation.
116
+ # - ``profile_memory`` - Track tensor memory allocation/deallocation. Note, for old version of pytorch with version
117
+ # before 1.10, if you suffer long profiling time, please disable it or upgrade to new version.
117
118
# - ``with_stack`` - Record source information (file and line number) for the ops.
118
119
# If the TensorBoard is launched in VSCode (`reference <https://code.visualstudio.com/docs/datascience/pytorch-support#_tensorboard-integration>`_),
119
120
# clicking a stack frame will navigate to the specific code line.
@@ -122,6 +123,7 @@ def train(data):
122
123
schedule = torch .profiler .schedule (wait = 1 , warmup = 1 , active = 3 , repeat = 2 ),
123
124
on_trace_ready = torch .profiler .tensorboard_trace_handler ('./log/resnet18' ),
124
125
record_shapes = True ,
126
+ profile_memory = True ,
125
127
with_stack = True
126
128
) as prof :
127
129
for step , batch_data in enumerate (train_loader ):
@@ -287,28 +289,54 @@ def train(data):
287
289
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
288
290
#
289
291
# - Memory view
290
- # To profile memory, please add ``profile_memory= True`` in arguments of ``torch.profiler.profile``.
292
+ # To profile memory, ``profile_memory`` must be set to `` True`` in arguments of ``torch.profiler.profile``.
291
293
#
292
- # Note: Because of the current non-optimized implementation of PyTorch profiler,
293
- # enabling ``profile_memory=True`` will take about several minutes to finish.
294
- # To save time, you can try our existing examples first by running:
294
+ # You can try it by using existing example on Azure
295
295
#
296
296
# ::
297
297
#
298
- # tensorboard --logdir=https://torchtbprofiler.blob.core.windows.net/torchtbprofiler/demo/memory_demo
298
+ # pip install azure-storage-blob
299
+ # tensorboard --logdir=https://torchtbprofiler.blob.core.windows.net/torchtbprofiler/demo/memory_demo_1_10
299
300
#
300
- # The profiler records all memory allocation/release events during profiling.
301
- # For every specific operator, the plugin aggregates all these memory events inside its life span .
301
+ # The profiler records all memory allocation/release events and allocator's internal state during profiling.
302
+ # The memory view consists of three components as shown in the following .
302
303
#
303
304
# .. image:: ../../_static/img/profiler_memory_view.png
304
305
# :scale: 25 %
305
306
#
307
+ # The components are memory curve graph, memory events table and memory statistics table, from top to bottom, respectively.
308
+ #
306
309
# The memory type could be selected in "Device" selection box.
307
- # For example, "GPU0" means the following table only shows each operator’s memory usage on GPU 0, not including CPU or other GPUs.
310
+ # For example, "GPU0" means the following table only shows each operator's memory usage on GPU 0, not including CPU or other GPUs.
311
+ #
312
+ # The memory curve shows the trends of memory consumption. The "Allocated" curve shows the total memory that is actually
313
+ # in use, e.g., tensors. In PyTorch, caching mechanism is employed in CUDA allocator and some other allocators. The
314
+ # "Reserved" curve shows the total memory that is reserved by the allocator. You can left click and drag on the graph
315
+ # to select events in the desired range:
316
+ #
317
+ # .. image:: ../../_static/img/profiler_memory_curve_selecting.png
318
+ # :scale: 25 %
319
+ #
320
+ # After selection, the three components will be updated for the restricted time range, so that you can gain more
321
+ # information about it. By repeating this process, you can zoom into a very fine-grained detail. Right click on the graph
322
+ # will reset the graph to the initial state.
323
+ #
324
+ # .. image:: ../../_static/img/profiler_memory_curve_single.png
325
+ # :scale: 25 %
308
326
#
309
- # The "Size Increase" sums up all allocation bytes and minus all the memory release bytes.
327
+ # In the memory events table, the allocation and release events are paired into one entry. The "operator" column shows
328
+ # the immediate ATen operator that is causing the allocation. Notice that in PyTorch, ATen operators commonly use
329
+ # ``aten::empty`` to allocate memory. For example, ``aten::ones`` is implemented as ``aten::empty`` followed by an
330
+ # ``aten::fill_``. Solely display the opeartor name as ``aten::empty`` is of little help. It will be shown as
331
+ # ``aten::ones (aten::empty)`` in this special case. The "Allocation Time", "Release Time" and "Duration"
332
+ # columns' data might be missing if the event occurs outside of the time range.
310
333
#
311
- # The "Allocation Size" sums up all allocation bytes without considering the memory release.
334
+ # In the memory statistics table, the "Size Increase" column sums up all allocation size and minus all the memory
335
+ # release size, that is, the net increase of memory usage after this operator. The "Self Size Increase" column is
336
+ # similar to "Size Increase", but it does not count children operators' allocation. With regards to ATen operators'
337
+ # implementation detail, some operators might call other operators, so memory allocations can happen at any level of the
338
+ # call stack. That says, "Self Size Increase" only count the memory usage increase at current level of call stack.
339
+ # Finally, the "Allocation Size" column sums up all allocation without considering the memory release.
312
340
#
313
341
# - Distributed view
314
342
# The plugin now supports distributed view on profiling DDP with NCCL/GLOO as backend.
@@ -317,6 +345,7 @@ def train(data):
317
345
#
318
346
# ::
319
347
#
348
+ # pip install azure-storage-blob
320
349
# tensorboard --logdir=https://torchtbprofiler.blob.core.windows.net/torchtbprofiler/demo/distributed_bert
321
350
#
322
351
# .. image:: ../../_static/img/profiler_distributed_view.png
0 commit comments