Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,7 @@ We're also fortunate to be integrated into some of the leading open-source libra
4. [TorchTune](https://pytorch.org/torchtune/main/tutorials/qlora_finetune.html?highlight=qlora) for our QLoRA and QAT recipes
5. VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html)
6. SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341).
7. Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)

## Videos
* [Keynote talk at GPU MODE IRL](https://youtu.be/FH5wiwOyPX4?si=VZK22hHz25GRzBG1&t=1009)
Expand Down
15 changes: 12 additions & 3 deletions torchao/quantization/qat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,11 +115,20 @@ To fake quantize embedding in addition to linear, you can additionally call
the following with a filter function during the prepare step:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a section below that says "torchtune integration". Can you add a similar section for "Axolotl Integration" and add a small code snippet there?

```
from torchao.quantization.quant_api import _is_linear
# first apply linear transformation to the model as above
activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
quantize_(
model,
IntXQuantizationAwareTrainingConfig(activation_config, weight_config),
)

# then apply weight-only transformation to embedding layers
# activation fake quantization is not supported for embedding layers
quantize_(
m,
IntXQuantizationAwareTrainingConfig(weight_config=weight_config),
filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding) or _is_linear(m),
IntXQuantizationAwareTrainingConfig(weight_config=weight_config),
filter_fn=lambda m, _: isinstance(m, torch.nn.Embedding)
)
```

Expand Down