Add activation sparsity (24 + fp8 dynamic quant) subclass #2213

jcaip · 2025-05-15T17:28:22Z

Summary:

This PR adds in the following things:

A new subclass + config for activation sparsity, Float8DynamicSemiSparseActivationFloat8WeightConfig, that works with LinearActivationQuantizedTensor.
Modifies CutlassSemiSparseLayout to support both weight and activation sparsity and adds a new impl + check.
adds in a kernel to accelerate $$xW^T$$ when x is sparse and we are memory bound. The kernel is an adapted version of https://github.com/FasterDecoding/TEAL

Test Plan:
pytest test/sparsity/test_activation.py

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: This PR adds in a kernel to accelerate $$xW^T$$ when x is sparse and we are memory bound. The idea here is that we can avoid loading the columns of $W$ that correspond to the zero elements of $x$. This lets us accelerate activation sparsity for bs=1 decode use cases. Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2025-05-15T17:28:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2213

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit 61aedfd with merge base 1017c7e ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test (CPU 2.5.1, linux.4xlarge, torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu, cp... / linux-job (gh)
RuntimeError: Command docker exec -t fa2e33bfa993e1a67f8429b357b10dbb60af1e7f2460fa88a060744896b097e9 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.6, linux.4xlarge, torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t c4222ccca782fc80387a7a7458032891801f981850a051ee9432ca1cc4fbfb10 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t e298043a00003d48473be13410b5ca90c2bf264946abede431871f3c9b228ba4 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
RuntimeError: Command docker exec -t 1ae2d7517e948e2f62bbb934966f387e33c2533d5c185f1240cf7dca5900d12c /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 66af99b8ea5dfc77b9f0c72151a76d180a842c0973c5e9068be2f25facea377c /exec failed with exit code 1
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
RuntimeError: Command docker exec -t c7271a64f2f5cba49917c5ee02a26cf345c538fb6bdef96ea91f07f1bf39ea55 /exec failed with exit code 2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2025-05-30T17:50:58Z

torchao/dtypes/floatx/cutlass_semi_sparse_layout.py

 )

 aten = torch.ops.aten


+def _pad_dense_input(dense_input: torch.Tensor) -> torch.Tensor:


I think ideally we don't do padding at this level, since this interferes with aliasing semantics, and make some op implementation impossible, like slice. is it possible to move this to the kernel itself? or improve the kernel to be able to handle the inputs without padding?

jerryzh168 · 2025-05-30T17:51:53Z

torchao/dtypes/floatx/cutlass_semi_sparse_layout.py

@@ -66,11 +97,12 @@ def __new__(
        )
        kwargs["dtype"] = sparse.dtype
        kwargs["requires_grad"] = False
-        shape = (sparse.shape[0], 2 * sparse.shape[-1])
+        # shape = (sparse.shape[0], 2 * sparse.shape[-1])


jerryzh168 · 2025-05-30T17:52:29Z

torchao/dtypes/floatx/cutlass_semi_sparse_layout.py

@@ -80,6 +112,7 @@ def __init__(
        self.meta = meta
        self.scale = scale
        self._layout = _layout
+        self._shape = shape


why do we need a separate shape, instead of deriving from the sparse tensor?

nit: also sparse naming might be a bit vague, renaming to sparse_data or something might be better

jerryzh168 · 2025-05-30T17:54:50Z

torchao/dtypes/floatx/cutlass_semi_sparse_layout.py

+    from torchao.dtypes.floatx import Float8Layout
+
+    res = (
+        isinstance(input_tensor, AffineQuantizedTensor)


just FYI, I feel we could simplify this list of checks if don't need to support a variety of kernels that all use the same subclassed tensor, it only needs to be specific enough so we can dispatch to the correct kernel. maybe this can be simplified if we split the AQT in the future.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 15, 2025

jcaip added 10 commits May 15, 2025 14:40

update

17c9531

cleanup

a5ec96e

ruff format

8bc4dc4

cleanup

d49492e

squarerelu working

cb8c629

restore file

e6860c6

cleanup

3248e79

ruff format sparse_api

ddc1d9a

ruff

55980da

ruff one more time

dd11ddb

jcaip added the topic: new feature Use this tag if this PR adds a new feature label May 22, 2025

jcaip added 14 commits May 22, 2025 06:12

ruff

a545b3e

run tests only on cuda

e099d78

check in changes before merge main

747e0e1

fix merge conflict

9e33323

wip

2a68435

Merge branch 'main' into jcaip/sparse-decoding-kernel

5b8d822

update

3c16b7d

eager working

fe8d22e

cleanup

8180ee7

reset

b41ee44

remove unneeded changes

e2593c4

more cleanup

f2dab64

ruff passing

6102a49

rename file

ec8d5a9

jcaip changed the title ~~Add selective weight loading decode kernel for activation sparsity~~ Add activation sparsity (24 + dynamic quant) subclass May 30, 2025

jcaip changed the title ~~Add activation sparsity (24 + dynamic quant) subclass~~ Add activation sparsity (24 + fp8 dynamic quant) subclass May 30, 2025

jcaip added 2 commits May 30, 2025 01:02

move test to test dir

d41eda2

update

61aedfd

jerryzh168 reviewed May 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add activation sparsity (24 + fp8 dynamic quant) subclass #2213

Add activation sparsity (24 + fp8 dynamic quant) subclass #2213

Uh oh!

jcaip commented May 15, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 15, 2025 •

edited

Loading

Uh oh!

jerryzh168 May 30, 2025 •

edited

Loading

Uh oh!

jerryzh168 May 30, 2025

Uh oh!

jerryzh168 May 30, 2025

Uh oh!

jerryzh168 May 30, 2025

Uh oh!

jerryzh168 May 30, 2025

Uh oh!

Uh oh!

Add activation sparsity (24 + fp8 dynamic quant) subclass #2213

Are you sure you want to change the base?

Add activation sparsity (24 + fp8 dynamic quant) subclass #2213

Uh oh!

Conversation

jcaip commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2213

❌ 6 New Failures

Uh oh!

jerryzh168 May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jcaip commented May 15, 2025 •

edited

Loading

pytorch-bot bot commented May 15, 2025 •

edited

Loading

jerryzh168 May 30, 2025 •

edited

Loading