Nabla - High-Performance Scientific Computing

Nabla is a high-performance scientific computing and machine learning framework combining imperative and functional APIs. You can seamlessly drop custom Mojo kernels into the autodiff engine and automatically shard distributed workloads, built on Mojo & MAX.

Active Development: This is the main development branch with distributed SPMD execution and a refined lazy, MAX-native execution model. Read the docs: https://nablaml.com.

Installation

Nabla requires Modular nightly.

python -m venv venv
source venv/bin/activate
pip install --pre --extra-index-url https://whl.modular.com/nightly/simple/ modular nabla-ml

GPU Support:

Linux (AMD/NVIDIA): Supported natively via Modular MAX.
macOS (Apple Silicon): Requires Xcode Metal toolchain (xcode-select --install).

Development Setup

Installation of all dependencies (torch/jax for testing, mypy/black for linting, etc.)

git clone https://github.com/nabla-ml/nabla.git
cd nabla
python -m venv venv
source venv/bin/activate
pip install -r requirements-dev.txt
pip install -e ".[dev]"

Feature Showcase

1. Tensors & Autodiff

Define Python functions and compute gradients using trace-based automatic differentiation. Read more

import nabla

# Use Accelerator (GPU) or CPU for execution
with nabla.default_device(nabla.Accelerator()):
    x = nabla.uniform((4, 8))
    w = nabla.uniform((8, 16))

    # Define loss function
    def compute_loss(x, w):
        return nabla.mean(nabla.relu(x @ w))

    # Compute loss (implicit .realize() on print)
    loss = compute_loss(x, w)
    print("Loss:", loss)

    # Compute gradients via backward replay
    grad_x, grad_w = nabla.grad(compute_loss, argnums=(0, 1))(x, w)
    print("Gradients:", grad_x.shape, grad_w.shape)

2. SPMD Sharding

Shard tensors on a logical mesh; operations automatically propagate sharding constraints. Read more

# Define 2×4 device mesh (Logical DP × TP)
mesh = nabla.DeviceMesh("my_mini_pod", (2, 4), ("dp", "tp"))

# Shard x on 'dp' (rows), w on 'tp' (columns)
x = nabla.shard(nabla.uniform((32, 128)), mesh, nabla.P("dp", None))
w = nabla.shard(nabla.uniform((128, 256)), mesh, nabla.P(None, "tp"))

def compute_loss(x, w):
    return nabla.mean(nabla.relu(x @ w))

# Automatic AllReduce is inserted for 'tp' sum
loss = compute_loss(x, w)
print("Loss (Sharded):", loss)

3. Mojo Integration

Nabla's core strength is its ability to drop down to Mojo for high-performance custom kernels, bridging the gap between high-level Python and bare-metal execution. Read more

Mojo Kernel (kernels/custom_kernel.mojo)

@compiler.register("my_kernel")
struct MyKernel:
    @staticmethod
    def execute[target: StaticString](
        output: OutputTensor,
        x: InputTensor[dtype = output.dtype, rank = output.rank],
        ctx: DeviceContextPtr,
    ):
        @parameter
        fn add_one[W: Int](idx: IndexList[x.rank]) -> SIMD[x.dtype, W]:
            return x.load[W](idx) + 1

        foreach[add_one, target=target](output, ctx)

Python Usage

class AddOneOp(nabla.UnaryOperation):
    name = "my_kernel"

    def kernel(self, x, **kwargs):
        # Concise invocation: (func_name, path, inputs, out_types)
        return nabla.call_custom_kernel("my_kernel", "./kernels", x, x.type)

x = nabla.Tensor.constant([1., 2., 3.])
y = AddOneOp()(x)

4. Distributed Pipeline Parallelism (GPipe)

Define complex distributed schedules like GPipe using vmap for parallel execution and ppermute for explicit data movement. Read more

# Parallel execution across 'num_stages'
@nabla.vmap(in_axes=(0, 0), spmd_axis_name="stage")
def stage_compute(x, w): 
    return nabla.relu(x @ w)

def pipeline_step(current_state, fresh_input, weights, mask_0):
    # 1. Compute: Run all stages in parallel
    computed = stage_compute(current_state, weights)

    # 2. Communicate: Shift activations to the next stage (i -> i+1)
    shifted = nabla.ppermute(computed, perm=[(i, (i + 1) % stages) for i in range(stages)])

    # 3. Control: Stage 0 takes fresh input; others take shifted data
    return nabla.where(mask_0, fresh_input, shifted)

5. Dynamic Shape Compilation

Compile functions once with symbolic dimensions to handle varying input sizes without recompilation.

# Compile once for ANY batch size (dim 0)
@nabla.compile(dynamic_dims={0: {0: "batch"}})
def square(x):
    return x * x

x_small = nabla.uniform((2, 10))
x_large = nabla.uniform((128, 10))

res1 = square(x_small) # Triggers compilation
res2 = square(x_large) # Reuses compiled graph!

Architecture Overview

Nabla relies on three core principles:

Lazy Execution: Shapes are computed eagerly, but the computation graph is built and compiled only when .realize() is called.
- Read more: Operation Pipeline
Trace-Based Autodiff: Gradients are computed by tracing the forward pass and replaying operations in reverse.
- Read more: Autograd Engine
Factor-Based SPMD: Sharding is propagated using "semantic factors" (e.g., batch, heads) rather than physical mesh axes.
- Read more: Sharding & Solver

Contributing

Bugs/Docs: Submit PR directly.
Features: Open an Issue first.
New Ops: See nabla/ops/README.md.

License

Apache-2.0 — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 930 Commits
.github		.github
assets		assets
docs		docs
examples		examples
nabla		nabla
scripts		scripts
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
package-lock.json		package-lock.json
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nabla - High-Performance Scientific Computing

Installation

Development Setup

Feature Showcase

1. Tensors & Autodiff

2. SPMD Sharding

3. Mojo Integration

4. Distributed Pipeline Parallelism (GPipe)

5. Dynamic Shape Compilation

Architecture Overview

Contributing

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Uh oh!

Contributors 1

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Nabla - High-Performance Scientific Computing

Installation

Development Setup

Feature Showcase

1. Tensors & Autodiff

2. SPMD Sharding

3. Mojo Integration

4. Distributed Pipeline Parallelism (GPipe)

5. Dynamic Shape Compilation

Architecture Overview

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Uh oh!

Contributors 1

Languages