Open
Description
We've come up with a training recipe for 2:4 activation sparsity, which is outlined in this paper: https://openreview.net/pdf?id=O5feVk7p6Y
The gist of this approach is that:
- we find high level of activation sparsity (> 85%) when training SquaredRELU based FFNs instead of SwiGLU FFNs. These Squared-RELU based FFNs show minimal to no accuracy loss.
- We accelerate the sparse activation x dense weight matmul with 2:4 sparsity. We can naively sparsity for the forwards pass, dropping values to fit the 2:4 constraint if they do not fit. For the backwards pass, we need some special sauce to mantain accuraccy.
However @janeyx99 pointed out to me that instead of accelerating the model using 2:4 sparsity, we can seek to exploit (1) with activation compression instead. The idea here is that we can use something like nvcomp to compress the sparse squared-relu activations.
We should run some tests to know what compression ratio and thus the memory savings we could achieve, as well as if there's additional overhead for the compression to account for.