Releases: tracel-ai/burn
v0.20.1
v0.20.0
Summary
This release marks a major turning point for the ecosystem with the introduction of CubeK. Our goal was to solve a classic challenge in deep learning: achieving peak performance on diverse hardware without maintaining fragmented codebases.
By unifying CPU and GPU kernels through CubeCL, we've managed to squeeze maximum efficiency out of everything from NVIDIA Blackwell GPUs to standard consumer CPUs.
Beyond performance, this release makes the library more robust, flexible, and significantly easier to debug.
This release also features a complete overhaul of the ONNX import system, providing broader support for a wide range of ONNX models. In addition, various bug fixes and new tensor operations enhance stability and usability.
For more details, check out the release post on our website.
Changelog
Breaking
We've introduced a couple of breaking API changes with this release. The affected interfaces are detailed in the sections below.
Training
We refactored burn-train to better support different abstractions and custom training strategies. As part of this,
the LearnerBuilder has been replaced by the LearningParadigm flow:
- let learner = LearnerBuilder::new(ARTIFACT_DIR)
+ let training = SupervisedTraining::new(ARTIFACT_DIR, dataloader_train, dataloader_valid)
.metrics((AccuracyMetric::new(), LossMetric::new()))
.num_epochs(config.num_epochs)
- .learning_strategy(burn::train::LearningStrategy::SingleDevice(device))
- .build(model, config.optimizer.init(), lr_scheduler.init().unwrap());
+ .summary();
- let result = learner.fit(dataloader_train, dataloader_valid);
+ let result = training.launch(Learner::new(
+ model,
+ config.optimizer.init(),
+ lr_scheduler.init().unwrap(),
+ ));Interface Changes
The scatter and select_assign operations now require an IndexingUpdateOp to specify the update behavior.
- let output = tensor.scatter(0, indices, values);
+ let output = tensor.scatter(0, indices, values, IndexingUpdateOp::Add);API calls for slice, slice_assign, and slice_fill no longer require const generics for dimensions, which cleans up the syntax quite a bit:
- let prev_slice = tensor.slice::<[Range<usize>; D]>(slices.try_into().unwrap());
+ let prev_slice = tensor.slice(slices.as_slice());The grid_sample_2d operation now supports different options.
To preserve the previous behavior, make sure to specify the matching options:
- let output = tensor.grid_sample_2d(grid, InterpolateMode::Bilinear);
+ let options = GridSampleOptions::new(InterpolateMode::Bilinear)
+ .with_padding_mode(GridSamplePaddingMode::Border)
+ .with_align_corners(true);
+ let output = tensor.grid_sample_2d(grid, options);The QuantStore variants used in QuantScheme have been updated to support a packing dimension.
pub enum QuantStore {
/// Native quantization doesn't require packing and unpacking.
Native,
+ /// Store packed quantized values in a natively supported packing format (i.e. e2m1x2).
+ PackedNative(usize),
/// Store packed quantized values in a 4-byte unsigned integer.
- U32,
+ PackedU32(usize),
}Finally, Shape no longer implements IntoIterator. If you need to iterate by-value over dimensions, access the dims field directly.
- for s in shape {
+ for s in shape.dims {Module & Tensor
- Generalize linalg::outer semantics; add linalg::outer_dim (#3923) @crutcher
- Use square() where appropriate. (#3900) @crutcher
- Add linalg matvec (#3967) @huy209vn
- Add GaussianNoise layer (#4022) @kul-sudo
- Make TransformerEncoderLayer fields public (#4053) @Mnwa
- Workaround MPS embedding allocation error in LibTorch (#4073) @antimora
- Fix Slice operation to handle empty ranges (#4083) @antimora
- Handle empty tensors in cat and slice_assign ops (#4095) @antimora
- [Breaking] Add
IndexingUpdateOptoscatterandselect_assign(#4070) @laggui - Add CrossAttention module to burn-nn (#4101) @huy209vn
- Add reflect and edge padding modes to tensor.pad (#4105 #) @antimora
- Fix GLU and quiet softmax activations (#4121) @laggui
- Add ceil_mode support to pooling operations (MaxPool, AvgPool) (#4112) @antimora
- [Breaking] Remove D2 const generic from slice / SliceArg (#4127) @crutcher
- Add backend supports_dtype (#4155) @laggui
- Fix repeat 0 times (#4216) @laggui
- feat: add hardswish activation (#4209) @mertalev
- Add more trig ops (#4282) @laggui
- Add empty/zeros/ones/full
TensorCreationOptions(#4285) @laggui - feat: nms op (#4246) @mertalev
Datasets & Training
- Refactor metric loggers(#3895 #4017) @Charles23R
- Add support for custom learning strategy (#3921) @Charles23R
- Feat/optim/distributed (#4018) @nathanielsimard
- Refactor MetricEntry (#4031) @Charles23R
- Feature muon (#3925) @NewBornRustacean
- Add warmup epochs to
MetricEarlyStoppingStrategy(#4041) @crutcher - Log running values (#4199) @Charles23R
- Fix checkpoint and summary log level (#4201) @J-F-Liu
- [Breaking] Burn train api refactor (#4223 #4283) @Charles23R
- Fix checkpointer interrupt (#4268) @Charles23R
Backends
- Add candle device seeding (#3959) @laggui
- feat: Enable tuning for MMA matmul (#3961) @wingertge
- feat: TMA autotuning (#3986) @wingertge
- feat: Enable tuning specialized matmul (#4026) @wingertge
- Add CubeCL Flash Attention module (#4089 #4192) @louisfd
- Zero-copy tensor loading for NdArray backend (#4178) @antimora
- feat: Implicit GEMM weight gradients for convolution (#4182) @wingertge
- Perf/reduce cpu + Fix OOB (#4197 #4204) @nathanielsimard
- feat: Accelerated convolution data gradient (#4220) @wingertge
- Remove linux-only constraint for cpu (#4233) @louisfd
- Perf/into contiguous (#4257) @nathanielsimard
- fix: grid sample using excessive memory (#4236 #4242) @mertalev
- Add fast-path for batched vector–matrix matmul (#4300) @louisfd
Bug Fixes
- Fix async barrier & TMA checks (#4007) @nathanielsimard
- Fix fusion reduce local already registered as output (#4014) @laggui
- Fix remainder int (#4015) @laggui
- Fix cuda mem error (#4020) @nathanielsimard
- Cleanup autodiff unused roots (#4039) @laggui
- Fix autotuner (#4049) @nathanielsimard
- Fix scatter values backward (#4064) @khoek
- More correctness fixes in autodiff ops (#4069) @khoek
- Fix transaction read (#4074) @laggui
- Fix tch bf16 kind (#4088 #4142 #4203) @laggui
- Fix cubecl cuda compilation error/typo (#4092) @BjornTheProgrammer
- Fix output dtype for argmin / argmax (#4195) @tzemanovic
- Return slice for each dimension in shape (#4152) @laggui
Documentation & Examples
- Update raspberry pi pico example (#4034 #4132) @BjornTheProgrammer
- Contributor Book: Update the "ONNX to Burn" Page (#4229) @softmaximalist
- docs: add examples for bool tensor operations (#4248) @qburke
- Update the "Adding New Operation" guide in the contributor book (#4284) @softmaximalist
- Refactor dop_timer for multiple trials (for warmup). (#4288) @crutcher
- Added documentation examples for more boolean tensor operations in burn-tensor (#4289) @qburke
Fixes
- Fix book (#3942) @laggui
- remove repetitive words in comment (#4029) @black5box
- Include katex header as symlink (#4118) @laggui
- Fix quantization docs (make it clear that only PTQ is currently supported) (#4316) @laggui
ONNX Support
- ONNX IR and import refactor to better support complex graphs (#3872 #4019 #4033 #4094) @antimora
- Add ONNX control flow operators:
If,Loop, andScan(#3936) @antimora - Silero VAD ONNX model verification (#3999) @antimora
- Add support for yolo12x model variant (#4048) @antimora
- Remove burn-import abstraction layer and use onnx-ir types directly (#4033) @antimora
- Fix ConstantOfShape output size determination (#4085) @antimora
- Specify output rank in squeeze_dims for type inference (#4086) @antimora
- Fix Expand operation to use ONNX max-semantics (#4082) @antimora
- [Breaking] Add ONNX GridSample op support and tests (#4084) @antimora
- Add RF-DETR model check for burn-import (#4087) @antimora
- Add LSTM operator support with configurable activations (#4106) @antimora
- Add memory-mapped ONNX loading with tensor data ref (#4097) @antimora
- Fix outer-scope variable references in ONNX subgraphs (If/Loop/Scan) (#4119) @antimora
- Add Reshape scalar optimization and Gather scalar input support (#4146) @antimora
- Update GELU ONNX test to use native op and fix expected values (#4161) @antimora
- Add ONNX CumSum operator support (#4162) @antimora
- Remove global ONNX opset version restriction, recommend opset 16 (#4168) @antimora
- Handle 1D slope when importing prelu from onnx (#4205) @mertalev
- Fix handling scalar scan outputs in ONNX loop nodes (#4210) @antimora
- Add ONNX external data support for models >2GB (#4158) @antimora
- fix: handle negative indices in onnx gather op (#4207) @mertalev
- Split backend tensor ops tests (#4232) @laggui
- Do not use alloc import in burn-import codegen (#4286) @laggui
- Fix ONNX where broadcasted dims (#4315) @laggui
Enhancements
- Feat/pinned memory staging (#4016) @nathanielsimard
- burn-store enhancements for troubleshooting and new enum skip flag (#4051) @antimora
- Feat/runtime error (#4079 #4110) @nathanielsimard
- Perf/improve reduce autotuning + plane non uniform control flow check (#4208) @nathanielsimard
- Packed quantized matmul with
QuantStorechanges (#4310 #4323) @wingertge
Refactoring
- chore: Update to batch caching PR for
cubecl(#3948) @wingertge - Refactor IR to define outputs as a function of the operation (#3877) ...
v0.20.0-pre.6
What's Changed
- doc warning fix by @crutcher in #4130
- Fix tch bf16 into_data by @laggui in #4142
- Update raspberry-pi-pico example to use the Pico 2, and burnpack by @BjornTheProgrammer in #4132
- Unify all_reduce
LocalCollectiveClientoperation handling. by @crutcher in #4125 - Add direct tensor snapshot retrieval API to ModuleStore by @antimora in #4131
- Fix outer-scope variable references in ONNX subgraphs (If/Loop/Scan) by @antimora in #4119
- Add removed docs for tensor equal_elem by @laggui in #4145
- Add ceil_mode support to pooling operations (MaxPool, AvgPool) by @antimora in #4112
- chore: Update cubecl by @wingertge in #4134
- Implement Slice iterator and utility methods. by @crutcher in #4042
- Bump peter-evans/create-pull-request from 7 to 8 by @dependabot[bot] in #4148
- Add slice_dyn, slice_assign_dyn, and slice_fill_dyn variants. by @crutcher in #4127
- Add Reshape scalar optimization and Gather scalar input support by @antimora in #4146
- Shape FromStr/ToString by @crutcher in #4143
- Add contiguous reindexing for non-contiguous layer indices by @antimora in #4150
- Add warmup epochs to
MetricEarlyStoppingStrategy. (#3970) by @crutcher in #4041 - fix(onnx): Use activation function for GELU codegen instead of non-existent tensor method by @antimora in #4161
- Refactor more basic ops by @laggui in #4156
- Refactor
LocalCollectiveServerfor improved clarity and error handling by @crutcher in #4126 - Fix typo in comment for logger_task function by @crutcher in #4159
- Refactor configurable backend tests (no more testgen macros) by @laggui in #4129
- Zero-copy loading for embedded burnpack weights by @antimora in #4154
- Fix candle cuda imports by @laggui in #4171
- Backends no longer depend on
burn-tensor, but strictlyburn-backendby @laggui in #4169 - Chore/update cubek cubecl by @nathanielsimard in #4172
- Add ONNX CumSum operator support by @antimora in #4162
- Add backend supports_dtype by @laggui in #4155
- Fix attention shapes and out rank by @laggui in #4192
- Fix matmul & reduce execute fuse no autotune by @laggui in #4193
- Fix output dtype for argmin / argmax by @laggui in #4195
- Add
flatten_dimsmethod toShapeand refactor tensor flattening API by @crutcher in #4189 - Return slice for each dimension in shape by @laggui in #4152
- Make xtask validate run no-std checks first. by @crutcher in #4198
- Fix: CubeCL Reduce by @nathanielsimard in #4197
- Reorganize and tracing::instrument collective operations. by @crutcher in #4157
- Log running values by @Charles23R in #4199
- Remove global ONNX opset version restriction, recommend opset 16 by @antimora in #4168
- Fix dtype preservation when loading tensors in burn-store by @antimora in #4194
- Fix TchTensor::from_data bf16 by @laggui in #4203
- Perf/reduce cpu + Fix OOB by @nathanielsimard in #4204
- feat: Implicit GEMM weight gradients for convolution by @wingertge in #4182
- Fix checkpoint and summary log level by @J-F-Liu in #4201
- fix: handle 1D slope when importing prelu from onnx by @mertalev in #4205
- Zero-copy tensor loading for NdArray backend by @antimora in #4178
- Fix quantized tensor storage data length calculation by @antimora in #4180
- Fix handling scalar scan outputs in ONNX loop nodes by @antimora in #4210
- Perf/improve reduce autotuning + plane non uniform control flow check by @nathanielsimard in #4208
- Add ONNX external data support for models >2GB by @antimora in #4158
- Update/cubek by @louisfd in #4214
- Refactor: Replace
canonicalize_dimwithexpect_dimby @crutcher in #4196 - fix: handle negative indices in onnx gather op by @mertalev in #4207
- Refactor/cube dim by @nathanielsimard in #4217
- Refactor: Consolidate shape and slice error handling into
ExpressionErrorby @crutcher in #4218 - Update: CubeK by @louisfd in #4222
- feat: Accelerated convolution data gradient by @wingertge in #4220
- Fix repeat 0 times by @laggui in #4216
- Burn train api refactor by @Charles23R in #4223
- Chore/pre release 6 by @nathanielsimard in #4224
v0.20.0-pre.5
What's Changed
- Bump version by @nathanielsimard in #4102
- Handle empty tensors in cat and slice_assign ops by @antimora in #4095
- Add network utilities to
burn-stdby @laggui in #4104 - Remove RefCell from onnx-ir Arguments by @antimora in #4094
- Fix raspberry pi pico example not compiling by @BjornTheProgrammer in #4034
- Flash Attention module by @louisfd in #4089
- [Breaking] Add
IndexingUpdateOptoscatterandselect_assignby @laggui in #4070 - Feat/improve errors by @nathanielsimard in #4110
- Add 256-byte tensor alignment to burnpack format for mmap zero-copy support by @antimora in #4100
- Add CrossAttention module to burn-nn by @huy209vn in #4101
- Add reflect and edge padding modes to tensor.pad by @antimora in #4105
- Add LSTM operator support with configurable activations by @antimora in #4106
- Add memory-mapped ONNX loading with lazy tensor data by @antimora in #4097
- Refactor
RemoteDeviceto use a thread-safe global address registry. by @crutcher in #4113 - Partial cleanup of RemoteSender api. by @crutcher in #4108
- Move backend traits and types to
burn-backendby @laggui in #4111 - Fix remote sync error by @laggui in #4117
- Small LSTM clean up of unused variable by @antimora in #4116
- Fix/autotune checks by @nathanielsimard in #4114
- Include katex header as symlink by @laggui in #4118
- chore: Update cubecl by @wingertge in #4120
- Fix GLU and quiet softmax activations by @laggui in #4121
- Migrate ONNX import to burnpack format (removing Record type) by @antimora in #4122
- Combined PRs by @github-actions[bot] in #4140
- Chore/pre release 5 by @nathanielsimard in #4141
v0.20.0-pre.4
What's Changed
- Make TransformerEncoderLayer fields public by @Mnwa in #4053
- Feature muon by @NewBornRustacean in #3925
- Implement
FromStrforSlicewith parsing and error handling by @crutcher in #3983 - chore: Update to cubecl scalar refactor by @wingertge in #4062
- refactor: cubecl Runtime trait by @wingertge in #4065
- Fix scatter values backward by @khoek in #4064
- Refactor/autotuner by @nathanielsimard in #4068
- Fix MPS "Placeholder storage has not been allocated" error for embedding operations by @antimora in #4073
- Remove burn-import abstraction layer and use onnx-ir types directly by @antimora in #4033
- More correctness fixes in autodiff ops by @khoek in #4069
- Fix transaction read by @laggui in #4074
- Feat/error handling cubecl by @nathanielsimard in #4076
- Move types from
burn-tensorby @laggui in #4050 - burn-store enhancements for troubleshooting and new enum skip flag by @antimora in #4051
- Re-enabled no-std support for safetensors store by @antimora in #4071
- Fix tch bf16 kind by @laggui in #4088
- Feat/runtime error by @nathanielsimard in #4079
- Fix ConstantOfShape output size determination by @antimora in #4085
- Fix reduce codegen to use turbofish for squeeze_dims by @antimora in #4086
- Fix Expand operation to use ONNX max-semantics by @antimora in #4082
- Add ONNX GridSample op support and tests by @antimora in #4084
- Fix Slice operation to handle empty ranges by @antimora in #4083
- Add RF-DETR model check for burn-import by @antimora in #4087
- Fix cubecl by @BjornTheProgrammer in #4092
v0.20.0-pre.3
What's Changed
- Node to Enum-based design for type-safe IR by @antimora in #4019
- Ignore number_prefix advisory from tokenizers by @laggui in #4037
- BUG: Fixed burn version by @Marc-AnthonyG in #4035
- Refactor/dtype cubecl by @nathanielsimard in #4032
- Fix parallel spelling error. by @crutcher in #4046
- Refactor MetricEntry by @Charles23R in #4031
- Bump actions/checkout from 5 to 6 by @dependabot[bot] in #4047
- Refactor of burn fusion and burn cubecl fusion by @nathanielsimard in #4044
- update cubecl by @louisfd in #4045
- Cleanup autodiff unused roots by @laggui in #4039
- Fix autotuner by @nathanielsimard in #4049
- Combined PRs by @github-actions[bot] in #4059
- Fix floating point norm test tolerance by @laggui in #4061
- Add support for yolo12x model variant check by @antimora in #4048
- Chore: Prepare pre-release 3 by @nathanielsimard in #4060
v0.20.0-pre.2
What's Changed
- Add ONNX control flow operators:
If,Loop, andScanby @antimora in #3936 - Fix fusion reduce local already registered as output by @laggui in #4014
- Silero VAD ONNX model verification by @antimora in #3999
- Feat/pinned memory staging by @nathanielsimard in #4016
- Refactor metric logger : epoch summary and multiple entries at once by @Charles23R in #4017
- Fix cuda mem error by @nathanielsimard in #4020
- Add GaussianNoise layer by @kul-sudo in #4022
- Fix remainder int by @laggui in #4015
- Feat/optim/distributed by @nathanielsimard in #4018
- Cleanup quantization strategy (CPU ref, ndarray only) by @laggui in #4023
- chore: remove repetitive words in comment by @black5box in #4029
- feat: Enable tuning specialized matmul by @wingertge in #4026
v0.20.0-pre.1
Summary
This release includes significant performance improvements, bug fixes, and architectural refactoring.
Key Improvements:
- TMA autotuning and MMA matmul tuning enabled for better performance
- ONNX-IR refactored to an op/node-centric architecture IR refactored to define outputs as a function of the operation
Bug Fixes:
- Fixed autodiff graph cleanup issues (multiple fixes for deferred/consumed nodes)
- Fixed Linear layer panic when output size is one
- Fixed PyTorch pickle reader regression with integer dict keys
- Fixed RoPE sum_dim calculation
- Fixed tensor *_like dtype preservation
- Fixed squeeze check for D2 > 0
- Fixed QLinear implementation
- Fixed async barrier & TMA checks
New Features:
- Added matvec operation
- Added support for custom learning strategies
- Added Candle device seeding
- Added Shape::ravel_index for row-major raveling
- Generalized linalg::outer semantics with new linalg::outer_dim
- Implemented error handling for DataError
- Added square() optimization where appropriate
v0.19.1
Bug Fixes & Improvements
- Autodiff: fixed RAM memory leak with correct graph cleanup (#3957 #3982) @laggui
- Better memory reuse: improved sliced memory pool implementation (#3941) @nathanielsimard
- Cuda: update
cudarc, auto-detect CUDA version and fix some 12.8 features (CubeCL #1008) @wingertge - Quantized Linear: fixed fusion configuration to fuse more precisions (#3941) @nathanielsimard
- PyTorch import: fixed pickle reader regression with integer dictionary keys (#3978) @laggui
- Docs: switched to
doc_cfgto fixdocs.rsbuilds (#3979) @laggui - Tensor API fixes:
v0.19.0
Summary
This release brings major improvements to enable efficient distributed training, quantization, and CPU support in Burn.
To achieve true multi-GPU parallelism, we had to rethink several core systems: we implemented multi-stream execution to keep all GPUs busy, optimized device transfers to avoid unnecessary synchronization, and redesigned our locking strategies to eliminate bottlenecks in autotuning, fusion, and autodiff. We also introduced burn-collective for gradient synchronization and refactored our training loop to support different distributed training strategies.
Additionally, we added comprehensive quantization support, allowing models to use significantly less memory while maintaining performance through fused dequantization and optimized quantized operations.
Finally, we introduced a new CPU backend powered by MLIR and LLVM, bringing the same JIT compilation, autotuning, and fusion capabilities from our GPU backends to CPU execution.
As with previous releases, this version includes various bug fixes, further optimizations and enhanced documentation. Support for ONNX models has also been expanded, with additional operators and bug fixes for better operator coverage.
For more details, check out the release post on our website.
Changelog
Breaking
We've introduced a couple of breaking API changes with this release. The affected interfaces are detailed in the sections below.
Learning Strategy
We refactored the Learner to support better distributed training strategies. Instead of registering a list of device(s), you now specify a training strategy.
let learner = LearnerBuilder::new(artifact_dir)
.metric_train_numeric(AccuracyMetric::new())
.metric_valid_numeric(AccuracyMetric::new())
.metric_train_numeric(LossMetric::new())
.metric_valid_numeric(LossMetric::new())
.with_file_checkpointer(CompactRecorder::new())
- .devices(vec![device.clone()])
+ .learning_strategy(LearningStrategy::SingleDevice(device.clone()))
.num_epochs(config.num_epochs)
.summary()
.build(
config.model.init::<B>(&device),
config.optimizer.init(),
config.learning_rate,
);Learner Training Result
The Learner previously lacked an evaluation loop. We extended its return type to include all training states in a TrainingResult, which includes the trained model and a metrics renderer.
- let model_trained = learner.fit(dataloader_train, dataloader_valid);
+ let result = learner.fit(dataloader_train, dataloader_valid);
- model_trained
+ result
+ .model
.save_file(format!("{artifact_dir}/model"), &CompactRecorder::new())
.expect("Trained model should be saved successfully");This enables the renderer to be reused by the new evaluator so that training and evaluation metrics appear together in the TUI dashboard:
let mut renderer = result.renderer;
let evaluator = EvaluatorBuilder::new(artifact_dir)
.renderer(renderer)
.metrics((AccuracyMetric::new(), LossMetric::new()))
.build(result.model.clone());
evaluator.eval(name, dataloader_test);Interface Changes
Config
The Config trait now requires Debug:
- #[derive(Config)]
+ #[derive(Config, Debug)]
pub struct TrainingConfig {
// ...
}BatchNorm
BatchNorm no longer requires the spatial dimension generic:
#[derive(Module, Debug)]
pub struct ConvBlock<B: Backend> {
conv: nn::conv::Conv2d<B>,
- norm: BatchNorm<B, 2>,
+ norm: BatchNorm<B>,
pool: Option<MaxPool2d>,
activation: nn::Relu,
}Backend::seed
Seeding is now device-specific:
- B::seed(seed);
+ B::seed(&device, seed);Tensor
For consistency with other methods like unsqueeze() / unsqueeze_dim(dim), squeeze(dim) was renamed:
- tensor.squeeze(dim)
+ tensor.squeeze_dim(dim)We've also added a tensor.squeeze() method which squeezes all singleton dimensions.
Finally, we removed tensor ^ T syntax, which was clunky.
- use burn::tensor::T;
- tensor ^ T
+ tensor.t()tensor.t() is also a simple alias for tensor.transpose().
Module & Tensor
- Fix unsqueeze rank check (#3429) @laggui
- Feat/quant block (#3442) @laggui
- Kill
tensor^Tmagic transpose marker in favor oftensor.t(). (#3452) @crutcher - ADD GLU activation function (#3444) @bn-c
- Add quantization params precision (#3453) @laggui
- Improve select_assign check (#3483) @laggui
- Add grid_sample function (#3495 #3523 #3522) @Cielbird
- save_tensor_as_image utility (#3520) @Cielbird
- Add affine_grid_2d (#3526) @Cielbird
- ADD missing Debug derive for embedding (#3547) @bn-c
- Dot Product Op (#3537) @kikefdezl
- Lift .full()/.full_like() into base Tensor - support Tensor<B, D, Bool>::full()/full_like(). (#3562) @crutcher
- Make
Distribution::DefaulttheDefault::default(). (#3582) @crutcher - Implement int matmul (#3575) @wingertge
- Feat/quant formats (#3613) @laggui
- Switch Tensor::swap_dims/permute to AsIndex dim support. (#3619) @crutcher
- Tensor::flatten() => AsIndex dims support. (#3620) @crutcher
- Remove D param from
BatchNorm<B, D>. (#3625) @crutcher - nn.activation; Activation (#3603 #3693) @crutcher
- Add q4 q2 quantization (#3617) @laggui
- Introduce
NormLayerabstraction for unified normalization layers. (#3630) @crutcher - Add dtype to trait creation ops (#3670) @laggui
- Make Config require Debug (#3689) @crutcher
- Add NormalizationConfig::with_num_features() and related (#3688) @crutcher
- Module quantization w/ tests (#3637) @nathanielsimard
- Add NumPy-like take operation with multi-dimensional index support (#3681) @antimora
- Added trace and diag with batch support for linalg crate (#3703) @niklund
- Add step support to tensor
sliceoperations (#3748) @antimora - Tensor::unfold(dim, size, step) (#3751 #3782 #3783) @crutcher
- Slice assign with steps (#3776) @antimora
- Add
bool_xoroperation for boolean tensors (#3785) @crutcher - [Breaking] Make squeeze/squeeze_dim consistent with other APIs (#3790) @laggui
- Add cross product (#3743) @SinanGncgl
- Enable stepped slicing for slice_fill and complete slice API cleanup (#3784) @antimora
- Tensor::rank() (#3797) @crutcher
- AsIndex dim handling for Numeric ops (#3795) @crutcher
- Add outer and outer_batch ops in linalg (#3786) @huy209vn
- Tensor::_dims() (#3811) @crutcher
- Add
tensor.cumsum(dim)first implementation (#3806) @antimora - slice_fill() should pick a compatible dtype (#3826) @crutcher
- Default LU decomposition implementation (#3816) @DimitriTimoz
- Add
tensor.squareand fast-path int-power exponents. (#3847) @crutcher - Add cumulative operations: cumprod, cummin, and cummax (#3819) @antimora
- Add Tensor::sum_dims_squeeze(dims) (#3817) @crutcher
- Allow linear to use quantized matmul (#3913) @wingertge
Datasets & Training
- Pre-Shuffle Multithread DataLoaders on Shuffle (#3390) @crutcher
- PixelDepth + Copy (#3419) @crutcher
- Add Dice-Sorenson Coefficient Metric (#3407) @MathijsdeBoer
- Add SelectionDataset, refactor ShuffledDataset, and add transform tests. (#3406) @crutcher
- Evenly distribute complete chunks/batches across partial dataset splits (#3476) @laggui
- Distributed Data Parallel (#3456) @Cielbird
- Use tensor ops for clip_by_norm (#3485) @laggui
SamplerDatasetdistribution fix; constructors and builder. (#3490) @crutcher- Unify transform usage of RngOptions. (#3577) @crutcher
- Fix bugs with ddp learning (#3581) @Cielbird
- Add support for CIFAR-10 and CIFAR-100 datasets (#3579) @buttfa
- Add with_interrupter for LearnerBuilder (#3611) @amfaber
- Improved Burn Train (#3614 #3935) @nathanielsimard @laggui
- Add 'TextFolderDataset' struct and
AgNewsDataset(#3698) @buttfa - Add PerplexityMetric for language model evaluation (#3707) @TheDarkchip
- Adding CER/WER metrics (#3418) @yazanmashal03
- Fix/autodiff/multi threads (#3793) @nathanielsimard
- Add
cautious_weight_decayto AdamW optimizer. (#3869) @crutcher - Fix evaluator dataloader device (#3893) @laggui
Backends
- Migrate to new cubecl multi tensor handle changes (#3136) @wingertge
- More memory control with scoped static memory management (#3410) @nathanielsimard
- Feat/fusion quant (#3454) @nathanielsimard
- Expose client utilities (#3559) @allenqm
- New CPU backend based on MLIR (#3411) @marcantoinem
- feat: ndarray dynamic tensor types and int tensor cast (#3647) @wingertge
- Implement optimized bool_select for primary backends (#3710) @TheDarkchip
- Add backend level is_nan / is_inf implementations (#3809) @laggui
- Feat/persistent memory (#3842) @nathanielsimard
- feat: add backend implementations for
Truncop (#3860) @mooori
Bug Fixes
- Fix ndarray interpolate coord precision at boundaries (#3481) @laggui
- Fix ndarray conv2d groups channels (#3415) @laggui
- Fix candle mask broadcasting (#3489) @laggui
- Update cubecl: fix wgpu vec to scalar cast (#3496) @Cielbird
- Fix/conv2d groups backward (#3521) @laggui
- Fix/conv3d backward groups (#3533) @laggui
- [Fix] Add some missing handling for flex32 (#3551) @wingertge
- Fix backward scatter dim (#3555) @laggui
- fix: Use correct datatype when filling boolean tensors (#3593) @wingertge
- fix: Ensure output layout is the same for non-inplace SIMD ops in ndarray (#3604) @wingertge
- Fix scalar binop not contiguous (#3636) @laggui
- Fix dtype dispatch in cubecl module ops (#3658) @laggui
- Fix wgpu bool and/or (#3664) @laggui
- Fix tch bool ones and rand int (#3684) @laggui
- fix: Select assign + bool cast (#3730) @wingertge
- Fix register_float_tensor to use the correct dtype (#3774) @A2va
- Fix: autotune errors with fu...