23 Jan 17:43

laggui

75b7881

v0.20.1 Latest

Latest

Bug Fixes & Improvement

Fix book guide training changes (#4340) @laggui
Fix dequantize native debug statement (tracel-ai/cubek#69) @laggui
Do not point to pinned exact versions to allow pulling patch releases @laggui

Contributors

laggui

Assets 2

15 Jan 16:08

laggui

v0.20.0

3475ba8

v0.20.0

Summary

This release marks a major turning point for the ecosystem with the introduction of CubeK. Our goal was to solve a classic challenge in deep learning: achieving peak performance on diverse hardware without maintaining fragmented codebases.

By unifying CPU and GPU kernels through CubeCL, we've managed to squeeze maximum efficiency out of everything from NVIDIA Blackwell GPUs to standard consumer CPUs.

Beyond performance, this release makes the library more robust, flexible, and significantly easier to debug.

This release also features a complete overhaul of the ONNX import system, providing broader support for a wide range of ONNX models. In addition, various bug fixes and new tensor operations enhance stability and usability.

For more details, check out the release post on our website.

Changelog

Breaking

We've introduced a couple of breaking API changes with this release. The affected interfaces are detailed in the sections below.

Training

We refactored burn-train to better support different abstractions and custom training strategies. As part of this,
the LearnerBuilder has been replaced by the LearningParadigm flow:

- let learner = LearnerBuilder::new(ARTIFACT_DIR)
+ let training = SupervisedTraining::new(ARTIFACT_DIR, dataloader_train, dataloader_valid)
        .metrics((AccuracyMetric::new(), LossMetric::new()))
        .num_epochs(config.num_epochs)
-       .learning_strategy(burn::train::LearningStrategy::SingleDevice(device))
-       .build(model, config.optimizer.init(), lr_scheduler.init().unwrap());
+       .summary();
 
- let result = learner.fit(dataloader_train, dataloader_valid);
+ let result = training.launch(Learner::new(
+      model,
+      config.optimizer.init(),
+      lr_scheduler.init().unwrap(),
+ ));

Interface Changes

The scatter and select_assign operations now require an IndexingUpdateOp to specify the update behavior.

- let output = tensor.scatter(0, indices, values);
+ let output = tensor.scatter(0, indices, values, IndexingUpdateOp::Add);

API calls for slice, slice_assign, and slice_fill no longer require const generics for dimensions, which cleans up the syntax quite a bit:

- let prev_slice = tensor.slice::<[Range<usize>; D]>(slices.try_into().unwrap());
+ let prev_slice = tensor.slice(slices.as_slice());

The grid_sample_2d operation now supports different options.
To preserve the previous behavior, make sure to specify the matching options:

- let output = tensor.grid_sample_2d(grid, InterpolateMode::Bilinear);
+ let options = GridSampleOptions::new(InterpolateMode::Bilinear)
+     .with_padding_mode(GridSamplePaddingMode::Border)
+     .with_align_corners(true);
+ let output = tensor.grid_sample_2d(grid, options);

The QuantStore variants used in QuantScheme have been updated to support a packing dimension.

  pub enum QuantStore {
      /// Native quantization doesn't require packing and unpacking.
      Native,
+     /// Store packed quantized values in a natively supported packing format (i.e. e2m1x2).
+     PackedNative(usize),
      /// Store packed quantized values in a 4-byte unsigned integer.
-     U32,
+     PackedU32(usize),
 }

Finally, Shape no longer implements IntoIterator. If you need to iterate by-value over dimensions, access the dims field directly.

- for s in shape {
+ for s in shape.dims {

Module & Tensor

Generalize linalg::outer semantics; add linalg::outer_dim (#3923) @crutcher
Use square() where appropriate. (#3900) @crutcher
Add linalg matvec (#3967) @huy209vn
Add GaussianNoise layer (#4022) @kul-sudo
Make TransformerEncoderLayer fields public (#4053) @Mnwa
Workaround MPS embedding allocation error in LibTorch (#4073) @antimora
Fix Slice operation to handle empty ranges (#4083) @antimora
Handle empty tensors in cat and slice_assign ops (#4095) @antimora
[Breaking] Add IndexingUpdateOp to scatter and select_assign (#4070) @laggui
Add CrossAttention module to burn-nn (#4101) @huy209vn
Add reflect and edge padding modes to tensor.pad (#4105 #) @antimora
Fix GLU and quiet softmax activations (#4121) @laggui
Add ceil_mode support to pooling operations (MaxPool, AvgPool) (#4112) @antimora
[Breaking] Remove D2 const generic from slice / SliceArg (#4127) @crutcher
Add backend supports_dtype (#4155) @laggui
Fix repeat 0 times (#4216) @laggui
feat: add hardswish activation (#4209) @mertalev
Add more trig ops (#4282) @laggui
Add empty/zeros/ones/full TensorCreationOptions (#4285) @laggui
feat: nms op (#4246) @mertalev

Datasets & Training

Refactor metric loggers(#3895 #4017) @Charles23R
Add support for custom learning strategy (#3921) @Charles23R
Feat/optim/distributed (#4018) @nathanielsimard
Refactor MetricEntry (#4031) @Charles23R
Feature muon (#3925) @NewBornRustacean
Add warmup epochs to MetricEarlyStoppingStrategy (#4041) @crutcher
Log running values (#4199) @Charles23R
Fix checkpoint and summary log level (#4201) @J-F-Liu
[Breaking] Burn train api refactor (#4223 #4283) @Charles23R
Fix checkpointer interrupt (#4268) @Charles23R

Backends

Add candle device seeding (#3959) @laggui
feat: Enable tuning for MMA matmul (#3961) @wingertge
feat: TMA autotuning (#3986) @wingertge
feat: Enable tuning specialized matmul (#4026) @wingertge
Add CubeCL Flash Attention module (#4089 #4192) @louisfd
Zero-copy tensor loading for NdArray backend (#4178) @antimora
feat: Implicit GEMM weight gradients for convolution (#4182) @wingertge
Perf/reduce cpu + Fix OOB (#4197 #4204) @nathanielsimard
feat: Accelerated convolution data gradient (#4220) @wingertge
Remove linux-only constraint for cpu (#4233) @louisfd
Perf/into contiguous (#4257) @nathanielsimard
fix: grid sample using excessive memory (#4236 #4242) @mertalev
Add fast-path for batched vector–matrix matmul (#4300) @louisfd

Bug Fixes

Fix async barrier & TMA checks (#4007) @nathanielsimard
Fix fusion reduce local already registered as output (#4014) @laggui
Fix remainder int (#4015) @laggui
Fix cuda mem error (#4020) @nathanielsimard
Cleanup autodiff unused roots (#4039) @laggui
Fix autotuner (#4049) @nathanielsimard
Fix scatter values backward (#4064) @khoek
More correctness fixes in autodiff ops (#4069) @khoek
Fix transaction read (#4074) @laggui
Fix tch bf16 kind (#4088 #4142 #4203) @laggui
Fix cubecl cuda compilation error/typo (#4092) @BjornTheProgrammer
Fix output dtype for argmin / argmax (#4195) @tzemanovic
Return slice for each dimension in shape (#4152) @laggui

Documentation & Examples

Update raspberry pi pico example (#4034 #4132) @BjornTheProgrammer
Contributor Book: Update the "ONNX to Burn" Page (#4229) @softmaximalist
docs: add examples for bool tensor operations (#4248) @qburke
Update the "Adding New Operation" guide in the contributor book (#4284) @softmaximalist
Refactor dop_timer for multiple trials (for warmup). (#4288) @crutcher
Added documentation examples for more boolean tensor operations in burn-tensor (#4289) @qburke

Fixes

Fix book (#3942) @laggui
remove repetitive words in comment (#4029) @black5box
Include katex header as symlink (#4118) @laggui
Fix quantization docs (make it clear that only PTQ is currently supported) (#4316) @laggui

ONNX Support

ONNX IR and import refactor to better support complex graphs (#3872 #4019 #4033 #4094) @antimora
Add ONNX control flow operators: If, Loop, and Scan (#3936) @antimora
Silero VAD ONNX model verification (#3999) @antimora
Add support for yolo12x model variant (#4048) @antimora
Remove burn-import abstraction layer and use onnx-ir types directly (#4033) @antimora
Fix ConstantOfShape output size determination (#4085) @antimora
Specify output rank in squeeze_dims for type inference (#4086) @antimora
Fix Expand operation to use ONNX max-semantics (#4082) @antimora
[Breaking] Add ONNX GridSample op support and tests (#4084) @antimora
Add RF-DETR model check for burn-import (#4087) @antimora
Add LSTM operator support with configurable activations (#4106) @antimora
Add memory-mapped ONNX loading with tensor data ref (#4097) @antimora
Fix outer-scope variable references in ONNX subgraphs (If/Loop/Scan) (#4119) @antimora
Add Reshape scalar optimization and Gather scalar input support (#4146) @antimora
Update GELU ONNX test to use native op and fix expected values (#4161) @antimora
Add ONNX CumSum operator support (#4162) @antimora
Remove global ONNX opset version restriction, recommend opset 16 (#4168) @antimora
Handle 1D slope when importing prelu from onnx (#4205) @mertalev
Fix handling scalar scan outputs in ONNX loop nodes (#4210) @antimora
Add ONNX external data support for models >2GB (#4158) @antimora
fix: handle negative indices in onnx gather op (#4207) @mertalev
Split backend tensor ops tests (#4232) @laggui
Do not use alloc import in burn-import codegen (#4286) @laggui
Fix ONNX where broadcasted dims (#4315) @laggui

Enhancements

Feat/pinned memory staging (#4016) @nathanielsimard
burn-store enhancements for troubleshooting and new enum skip flag (#4051) @antimora
Feat/runtime error (#4079 #4110) @nathanielsimard
Perf/improve reduce autotuning + plane non uniform control flow check (#4208) @nathanielsimard
Packed quantized matmul with QuantStore changes (#4310 #4323) @wingertge

Refactoring

chore: Update to batch caching PR for cubecl (#3948) @wingertge
Refactor IR to define outputs as a function of the operation (#3877) ...

Contributors

khoek, antimora, and 19 other contributors

Assets 2

18 Dec 21:27

nathanielsimard

v0.20.0-pre.6

91dd62c

v0.20.0-pre.6 Pre-release

Pre-release

What's Changed

doc warning fix by @crutcher in #4130
Fix tch bf16 into_data by @laggui in #4142
Update raspberry-pi-pico example to use the Pico 2, and burnpack by @BjornTheProgrammer in #4132
Unify all_reduce LocalCollectiveClient operation handling. by @crutcher in #4125
Add direct tensor snapshot retrieval API to ModuleStore by @antimora in #4131
Fix outer-scope variable references in ONNX subgraphs (If/Loop/Scan) by @antimora in #4119
Add removed docs for tensor equal_elem by @laggui in #4145
Add ceil_mode support to pooling operations (MaxPool, AvgPool) by @antimora in #4112
chore: Update cubecl by @wingertge in #4134
Implement Slice iterator and utility methods. by @crutcher in #4042
Bump peter-evans/create-pull-request from 7 to 8 by @dependabot[bot] in #4148
Add slice_dyn, slice_assign_dyn, and slice_fill_dyn variants. by @crutcher in #4127
Add Reshape scalar optimization and Gather scalar input support by @antimora in #4146
Shape FromStr/ToString by @crutcher in #4143
Add contiguous reindexing for non-contiguous layer indices by @antimora in #4150
Add warmup epochs to MetricEarlyStoppingStrategy. (#3970) by @crutcher in #4041
fix(onnx): Use activation function for GELU codegen instead of non-existent tensor method by @antimora in #4161
Refactor more basic ops by @laggui in #4156
Refactor LocalCollectiveServer for improved clarity and error handling by @crutcher in #4126
Fix typo in comment for logger_task function by @crutcher in #4159
Refactor configurable backend tests (no more testgen macros) by @laggui in #4129
Zero-copy loading for embedded burnpack weights by @antimora in #4154
Fix candle cuda imports by @laggui in #4171
Backends no longer depend on burn-tensor, but strictly burn-backend by @laggui in #4169
Chore/update cubek cubecl by @nathanielsimard in #4172
Add ONNX CumSum operator support by @antimora in #4162
Add backend supports_dtype by @laggui in #4155
Fix attention shapes and out rank by @laggui in #4192
Fix matmul & reduce execute fuse no autotune by @laggui in #4193
Fix output dtype for argmin / argmax by @laggui in #4195
Add flatten_dims method to Shape and refactor tensor flattening API by @crutcher in #4189
Return slice for each dimension in shape by @laggui in #4152
Make xtask validate run no-std checks first. by @crutcher in #4198
Fix: CubeCL Reduce by @nathanielsimard in #4197
Reorganize and tracing::instrument collective operations. by @crutcher in #4157
Log running values by @Charles23R in #4199
Remove global ONNX opset version restriction, recommend opset 16 by @antimora in #4168
Fix dtype preservation when loading tensors in burn-store by @antimora in #4194
Fix TchTensor::from_data bf16 by @laggui in #4203
Perf/reduce cpu + Fix OOB by @nathanielsimard in #4204
feat: Implicit GEMM weight gradients for convolution by @wingertge in #4182
Fix checkpoint and summary log level by @J-F-Liu in #4201
fix: handle 1D slope when importing prelu from onnx by @mertalev in #4205
Zero-copy tensor loading for NdArray backend by @antimora in #4178
Fix quantized tensor storage data length calculation by @antimora in #4180
Fix handling scalar scan outputs in ONNX loop nodes by @antimora in #4210
Perf/improve reduce autotuning + plane non uniform control flow check by @nathanielsimard in #4208
Add ONNX external data support for models >2GB by @antimora in #4158
Update/cubek by @louisfd in #4214
Refactor: Replace canonicalize_dim with expect_dim by @crutcher in #4196
fix: handle negative indices in onnx gather op by @mertalev in #4207
Refactor/cube dim by @nathanielsimard in #4217
Refactor: Consolidate shape and slice error handling into ExpressionError by @crutcher in #4218
Update: CubeK by @louisfd in #4222
feat: Accelerated convolution data gradient by @wingertge in #4220
Fix repeat 0 times by @laggui in #4216
Burn train api refactor by @Charles23R in #4223
Chore/pre release 6 by @nathanielsimard in #4224

Contributors

antimora, wingertge, and 9 other contributors

Assets 2

08 Dec 14:53

nathanielsimard

v0.20.0-pre.5

42edc63

v0.20.0-pre.5 Pre-release

Pre-release

What's Changed

Bump version by @nathanielsimard in #4102
Handle empty tensors in cat and slice_assign ops by @antimora in #4095
Add network utilities to burn-std by @laggui in #4104
Remove RefCell from onnx-ir Arguments by @antimora in #4094
Fix raspberry pi pico example not compiling by @BjornTheProgrammer in #4034
Flash Attention module by @louisfd in #4089
[Breaking] Add IndexingUpdateOp to scatter and select_assign by @laggui in #4070
Feat/improve errors by @nathanielsimard in #4110
Add 256-byte tensor alignment to burnpack format for mmap zero-copy support by @antimora in #4100
Add CrossAttention module to burn-nn by @huy209vn in #4101
Add reflect and edge padding modes to tensor.pad by @antimora in #4105
Add LSTM operator support with configurable activations by @antimora in #4106
Add memory-mapped ONNX loading with lazy tensor data by @antimora in #4097
Refactor RemoteDevice to use a thread-safe global address registry. by @crutcher in #4113
Partial cleanup of RemoteSender api. by @crutcher in #4108
Move backend traits and types to burn-backend by @laggui in #4111
Fix remote sync error by @laggui in #4117
Small LSTM clean up of unused variable by @antimora in #4116
Fix/autotune checks by @nathanielsimard in #4114
Include katex header as symlink by @laggui in #4118
chore: Update cubecl by @wingertge in #4120
Fix GLU and quiet softmax activations by @laggui in #4121
Migrate ONNX import to burnpack format (removing Record type) by @antimora in #4122
Combined PRs by @github-actions[bot] in #4140
Chore/pre release 5 by @nathanielsimard in #4141

Contributors

antimora, wingertge, and 6 other contributors

Assets 2

01 Dec 19:15

nathanielsimard

v0.20.0-pre.4

c9af669

v0.20.0-pre.4 Pre-release

Pre-release

What's Changed

Make TransformerEncoderLayer fields public by @Mnwa in #4053
Feature muon by @NewBornRustacean in #3925
Implement FromStr for Slice with parsing and error handling by @crutcher in #3983
chore: Update to cubecl scalar refactor by @wingertge in #4062
refactor: cubecl Runtime trait by @wingertge in #4065
Fix scatter values backward by @khoek in #4064
Refactor/autotuner by @nathanielsimard in #4068
Fix MPS "Placeholder storage has not been allocated" error for embedding operations by @antimora in #4073
Remove burn-import abstraction layer and use onnx-ir types directly by @antimora in #4033
More correctness fixes in autodiff ops by @khoek in #4069
Fix transaction read by @laggui in #4074
Feat/error handling cubecl by @nathanielsimard in #4076
Move types from burn-tensor by @laggui in #4050
burn-store enhancements for troubleshooting and new enum skip flag by @antimora in #4051
Re-enabled no-std support for safetensors store by @antimora in #4071
Fix tch bf16 kind by @laggui in #4088
Feat/runtime error by @nathanielsimard in #4079
Fix ConstantOfShape output size determination by @antimora in #4085
Fix reduce codegen to use turbofish for squeeze_dims by @antimora in #4086
Fix Expand operation to use ONNX max-semantics by @antimora in #4082
Add ONNX GridSample op support and tests by @antimora in #4084
Fix Slice operation to handle empty ranges by @antimora in #4083
Add RF-DETR model check for burn-import by @antimora in #4087
Fix cubecl by @BjornTheProgrammer in #4092

Contributors

khoek, antimora, and 7 other contributors

Assets 2

24 Nov 17:37

nathanielsimard

v0.20.0-pre.3

88d662d

v0.20.0-pre.3 Pre-release

Pre-release

What's Changed

Node to Enum-based design for type-safe IR by @antimora in #4019
Ignore number_prefix advisory from tokenizers by @laggui in #4037
BUG: Fixed burn version by @Marc-AnthonyG in #4035
Refactor/dtype cubecl by @nathanielsimard in #4032
Fix parallel spelling error. by @crutcher in #4046
Refactor MetricEntry by @Charles23R in #4031
Bump actions/checkout from 5 to 6 by @dependabot[bot] in #4047
Refactor of burn fusion and burn cubecl fusion by @nathanielsimard in #4044
update cubecl by @louisfd in #4045
Cleanup autodiff unused roots by @laggui in #4039
Fix autotuner by @nathanielsimard in #4049
Combined PRs by @github-actions[bot] in #4059
Fix floating point norm test tolerance by @laggui in #4061
Add support for yolo12x model variant check by @antimora in #4048
Chore: Prepare pre-release 3 by @nathanielsimard in #4060

Contributors

antimora, laggui, and 6 other contributors

Assets 2

17 Nov 15:26

nathanielsimard

v0.20.0-pre.2

cc0f22a

v0.20.0-pre.2 Pre-release

Pre-release

What's Changed

Add ONNX control flow operators: If, Loop, and Scan by @antimora in #3936
Fix fusion reduce local already registered as output by @laggui in #4014
Silero VAD ONNX model verification by @antimora in #3999
Feat/pinned memory staging by @nathanielsimard in #4016
Refactor metric logger : epoch summary and multiple entries at once by @Charles23R in #4017
Fix cuda mem error by @nathanielsimard in #4020
Add GaussianNoise layer by @kul-sudo in #4022
Fix remainder int by @laggui in #4015
Feat/optim/distributed by @nathanielsimard in #4018
Cleanup quantization strategy (CPU ref, ndarray only) by @laggui in #4023
chore: remove repetitive words in comment by @black5box in #4029
feat: Enable tuning specialized matmul by @wingertge in #4026

Contributors

antimora, wingertge, and 5 other contributors

Assets 2

11 Nov 15:35

nathanielsimard

v0.20.0-pre.1

913ddc0

v0.20.0-pre.1 Pre-release

Pre-release

Summary

This release includes significant performance improvements, bug fixes, and architectural refactoring.
Key Improvements:

TMA autotuning and MMA matmul tuning enabled for better performance
ONNX-IR refactored to an op/node-centric architecture IR refactored to define outputs as a function of the operation

Bug Fixes:

Fixed autodiff graph cleanup issues (multiple fixes for deferred/consumed nodes)
Fixed Linear layer panic when output size is one
Fixed PyTorch pickle reader regression with integer dict keys
Fixed RoPE sum_dim calculation
Fixed tensor *_like dtype preservation
Fixed squeeze check for D2 > 0
Fixed QLinear implementation
Fixed async barrier & TMA checks

New Features:

Added matvec operation
Added support for custom learning strategies
Added Candle device seeding
Added Shape::ravel_index for row-major raveling
Generalized linalg::outer semantics with new linalg::outer_dim
Implemented error handling for DataError
Added square() optimization where appropriate

Assets 2

06 Nov 16:18

laggui

v0.19.1

a6da424

v0.19.1

Bug Fixes & Improvements

Autodiff: fixed RAM memory leak with correct graph cleanup (#3957 #3982) @laggui
Better memory reuse: improved sliced memory pool implementation (#3941) @nathanielsimard
Cuda: update cudarc, auto-detect CUDA version and fix some 12.8 features (CubeCL #1008) @wingertge
Quantized Linear: fixed fusion configuration to fuse more precisions (#3941) @nathanielsimard
PyTorch import: fixed pickle reader regression with integer dictionary keys (#3978) @laggui
Docs: switched to doc_cfg to fix docs.rs builds (#3979) @laggui
Tensor API fixes:
- *_like preserves dtype (#3953) @crutcher
- RotaryEncoding sum dimension for 3D input (#3954) @laggui
- squeeze check for output rank > 0 (#3946) @laggui
- Linear for input/output rank 1 (#3966) @lucasmdjl

Contributors

wingertge, laggui, and 3 other contributors

Assets 2

28 Oct 17:00

laggui

v0.19.0

0767a9a

v0.19.0

Summary

This release brings major improvements to enable efficient distributed training, quantization, and CPU support in Burn.

To achieve true multi-GPU parallelism, we had to rethink several core systems: we implemented multi-stream execution to keep all GPUs busy, optimized device transfers to avoid unnecessary synchronization, and redesigned our locking strategies to eliminate bottlenecks in autotuning, fusion, and autodiff. We also introduced burn-collective for gradient synchronization and refactored our training loop to support different distributed training strategies.

Additionally, we added comprehensive quantization support, allowing models to use significantly less memory while maintaining performance through fused dequantization and optimized quantized operations.

Finally, we introduced a new CPU backend powered by MLIR and LLVM, bringing the same JIT compilation, autotuning, and fusion capabilities from our GPU backends to CPU execution.

As with previous releases, this version includes various bug fixes, further optimizations and enhanced documentation. Support for ONNX models has also been expanded, with additional operators and bug fixes for better operator coverage.

For more details, check out the release post on our website.

Changelog

Breaking

We've introduced a couple of breaking API changes with this release. The affected interfaces are detailed in the sections below.

Learning Strategy

We refactored the Learner to support better distributed training strategies. Instead of registering a list of device(s), you now specify a training strategy.

  let learner = LearnerBuilder::new(artifact_dir)
      .metric_train_numeric(AccuracyMetric::new())
      .metric_valid_numeric(AccuracyMetric::new())
      .metric_train_numeric(LossMetric::new())
      .metric_valid_numeric(LossMetric::new())
      .with_file_checkpointer(CompactRecorder::new())
-     .devices(vec![device.clone()])
+     .learning_strategy(LearningStrategy::SingleDevice(device.clone()))
      .num_epochs(config.num_epochs)
      .summary()
      .build(
          config.model.init::<B>(&device),
          config.optimizer.init(),
          config.learning_rate,
      );

Learner Training Result

The Learner previously lacked an evaluation loop. We extended its return type to include all training states in a TrainingResult, which includes the trained model and a metrics renderer.

- let model_trained = learner.fit(dataloader_train, dataloader_valid);
+ let result = learner.fit(dataloader_train, dataloader_valid);

- model_trained
+ result
+    .model
     .save_file(format!("{artifact_dir}/model"), &CompactRecorder::new())
     .expect("Trained model should be saved successfully");

This enables the renderer to be reused by the new evaluator so that training and evaluation metrics appear together in the TUI dashboard:

let mut renderer = result.renderer;
let evaluator = EvaluatorBuilder::new(artifact_dir)
    .renderer(renderer)
    .metrics((AccuracyMetric::new(), LossMetric::new()))
    .build(result.model.clone());

evaluator.eval(name, dataloader_test);

Interface Changes

`Config`

The Config trait now requires Debug:

- #[derive(Config)]
+ #[derive(Config, Debug)]
  pub struct TrainingConfig {
      // ...
  }

`BatchNorm`

BatchNorm no longer requires the spatial dimension generic:

  #[derive(Module, Debug)]
  pub struct ConvBlock<B: Backend> {
      conv: nn::conv::Conv2d<B>,
-     norm: BatchNorm<B, 2>,
+     norm: BatchNorm<B>,
      pool: Option<MaxPool2d>,
      activation: nn::Relu,
  }

`Backend::seed`

Seeding is now device-specific:

- B::seed(seed);
+ B::seed(&device, seed);

`Tensor`

For consistency with other methods like unsqueeze() / unsqueeze_dim(dim), squeeze(dim) was renamed:

- tensor.squeeze(dim)
+ tensor.squeeze_dim(dim)

We've also added a tensor.squeeze() method which squeezes all singleton dimensions.

Finally, we removed tensor ^ T syntax, which was clunky.

- use burn::tensor::T;
- tensor ^ T
+ tensor.t()

tensor.t() is also a simple alias for tensor.transpose().

Module & Tensor

Fix unsqueeze rank check (#3429) @laggui
Feat/quant block (#3442) @laggui
Kill tensor^T magic transpose marker in favor of tensor.t(). (#3452) @crutcher
ADD GLU activation function (#3444) @bn-c
Add quantization params precision (#3453) @laggui
Improve select_assign check (#3483) @laggui
Add grid_sample function (#3495 #3523 #3522) @Cielbird
save_tensor_as_image utility (#3520) @Cielbird
Add affine_grid_2d (#3526) @Cielbird
ADD missing Debug derive for embedding (#3547) @bn-c
Dot Product Op (#3537) @kikefdezl
Lift .full()/.full_like() into base Tensor - support Tensor<B, D, Bool>::full()/full_like(). (#3562) @crutcher
Make Distribution::Default the Default::default(). (#3582) @crutcher
Implement int matmul (#3575) @wingertge
Feat/quant formats (#3613) @laggui
Switch Tensor::swap_dims/permute to AsIndex dim support. (#3619) @crutcher
Tensor::flatten() => AsIndex dims support. (#3620) @crutcher
Remove D param from BatchNorm<B, D>. (#3625) @crutcher
nn.activation; Activation (#3603 #3693) @crutcher
Add q4 q2 quantization (#3617) @laggui
Introduce NormLayer abstraction for unified normalization layers. (#3630) @crutcher
Add dtype to trait creation ops (#3670) @laggui
Make Config require Debug (#3689) @crutcher
Add NormalizationConfig::with_num_features() and related (#3688) @crutcher
Module quantization w/ tests (#3637) @nathanielsimard
Add NumPy-like take operation with multi-dimensional index support (#3681) @antimora
Added trace and diag with batch support for linalg crate (#3703) @niklund
Add step support to tensor slice operations (#3748) @antimora
Tensor::unfold(dim, size, step) (#3751 #3782 #3783) @crutcher
Slice assign with steps (#3776) @antimora
Add bool_xor operation for boolean tensors (#3785) @crutcher
[Breaking] Make squeeze/squeeze_dim consistent with other APIs (#3790) @laggui
Add cross product (#3743) @SinanGncgl
Enable stepped slicing for slice_fill and complete slice API cleanup (#3784) @antimora
Tensor::rank() (#3797) @crutcher
AsIndex dim handling for Numeric ops (#3795) @crutcher
Add outer and outer_batch ops in linalg (#3786) @huy209vn
Tensor::_dims() (#3811) @crutcher
Add tensor.cumsum(dim) first implementation (#3806) @antimora
slice_fill() should pick a compatible dtype (#3826) @crutcher
Default LU decomposition implementation (#3816) @DimitriTimoz
Add tensor.square and fast-path int-power exponents. (#3847) @crutcher
Add cumulative operations: cumprod, cummin, and cummax (#3819) @antimora
Add Tensor::sum_dims_squeeze(dims) (#3817) @crutcher
Allow linear to use quantized matmul (#3913) @wingertge

Datasets & Training

Pre-Shuffle Multithread DataLoaders on Shuffle (#3390) @crutcher
PixelDepth + Copy (#3419) @crutcher
Add Dice-Sorenson Coefficient Metric (#3407) @MathijsdeBoer
Add SelectionDataset, refactor ShuffledDataset, and add transform tests. (#3406) @crutcher
Evenly distribute complete chunks/batches across partial dataset splits (#3476) @laggui
Distributed Data Parallel (#3456) @Cielbird
Use tensor ops for clip_by_norm (#3485) @laggui
SamplerDataset distribution fix; constructors and builder. (#3490) @crutcher
Unify transform usage of RngOptions. (#3577) @crutcher
Fix bugs with ddp learning (#3581) @Cielbird
Add support for CIFAR-10 and CIFAR-100 datasets (#3579) @buttfa
Add with_interrupter for LearnerBuilder (#3611) @amfaber
Improved Burn Train (#3614 #3935) @nathanielsimard @laggui
Add 'TextFolderDataset' struct and AgNewsDataset (#3698) @buttfa
Add PerplexityMetric for language model evaluation (#3707) @TheDarkchip
Adding CER/WER metrics (#3418) @yazanmashal03
Fix/autodiff/multi threads (#3793) @nathanielsimard
Add cautious_weight_decay to AdamW optimizer. (#3869) @crutcher
Fix evaluator dataloader device (#3893) @laggui

Backends

Migrate to new cubecl multi tensor handle changes (#3136) @wingertge
More memory control with scoped static memory management (#3410) @nathanielsimard
Feat/fusion quant (#3454) @nathanielsimard
Expose client utilities (#3559) @allenqm
New CPU backend based on MLIR (#3411) @marcantoinem
feat: ndarray dynamic tensor types and int tensor cast (#3647) @wingertge
Implement optimized bool_select for primary backends (#3710) @TheDarkchip
Add backend level is_nan / is_inf implementations (#3809) @laggui
Feat/persistent memory (#3842) @nathanielsimard
feat: add backend implementations for Trunc op (#3860) @mooori

Bug Fixes

Fix ndarray interpolate coord precision at boundaries (#3481) @laggui
Fix ndarray conv2d groups channels (#3415) @laggui
Fix candle mask broadcasting (#3489) @laggui
Update cubecl: fix wgpu vec to scalar cast (#3496) @Cielbird
Fix/conv2d groups backward (#3521) @laggui
Fix/conv3d backward groups (#3533) @laggui
[Fix] Add some missing handling for flex32 (#3551) @wingertge
Fix backward scatter dim (#3555) @laggui
fix: Use correct datatype when filling boolean tensors (#3593) @wingertge
fix: Ensure output layout is the same for non-inplace SIMD ops in ndarray (#3604) @wingertge
Fix scalar binop not contiguous (#3636) @laggui
Fix dtype dispatch in cubecl module ops (#3658) @laggui
Fix wgpu bool and/or (#3664) @laggui
Fix tch bool ones and rand int (#3684) @laggui
fix: Select assign + bool cast (#3730) @wingertge
Fix register_float_tensor to use the correct dtype (#3774) @A2va
Fix: autotune errors with fu...