Support for Activated LoRA (https://github.com/ggml-org/llama.cpp/issues/15212) #15213
kgreenewald
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Apologies if this is slightly out of order - I have created an issue #15212 requesting support for Activated LoRA adapters (see issue for details and motivation). These adapters are invoked by including an invocation sequence in the prompt, and only affect the weights for all tokens after the invocation sequence appears. This means that the adapter can re-use the KV cache from base model, leading to huge improvements in TTFT (compared to hot-swapping LoRA adapters) if you apply the adapter deep into a multi-turn interaction with the model. Appreciate any feedback or thoughts on this!
Our plan would be to start this integration work ourselves and submit a PR for this feature in the near future, building on the existing support for hot-swapping LoRA adapters.
This complements existing PRs to both Huggingface PEFT (huggingface/peft#2609) and vLLM (vllm-project/vllm#19710).
cc @gabe-l-hart
Beta Was this translation helpful? Give feedback.
All reactions