Description
Introduction
This proposal requires lots of works and will introduce many break changes. It should be discussed in detail before it's merged into master branch. Any suggestion will be appreciated! FYI @martindevans @SignalRT
This proposal is inspired by vllm and has already had a prototype implementation in #683. Though it's far from completed, the main ideas have been manifested. If you want to know further about this proposal, please follow the example in that PR to take a try of it. The example does not have a good UI to show the process of parallel inference, but it does execute multiple sequences at the same time. You could set breakpoints in LLM.RunEngine
to confirm that.
Motivations
At the very early stage of LLamaSharp, LLamaModel
class was used for dealing with all things of the model, including loading, state, inference and high-level. After v0.4.1, It was splitted to LLamaWeights
, LLamaExecutor
and ChatSession
, in which LLamaExecutor is the mid-level API to run the model and ChatSession is the high-level API.
Though this design once worked well both for developers and users, as time passed, the issues with this design have become increasingly evident. The main problems are described as following.
- Batched inference is not user-friendly: As you can see in Parallel Inferencing? #623, it requires users to understand how it works and write many codes to use it. What's more, even though the low-level API was added nearly half a year ago, there has been few users really taking use of it! Obviously, we need to provide easy-to-use APIs for users to use it because the batched inference is indeed a huge improvement on performance.
- Mid-level and high-level APIs need to be improved: We're currently providing executors as the mid-level APIs. However, though the design works well with chatting (regardless of batched inference), it does not support text-completion very well. As for high-level APIs. I believe we should follow the style of OpenAI APIs and [semantic-kernel(https://github.com/microsoft/semantic-kernel). However, the current design of mid-level APIs seems to make it difficult to implement this. Related issues: Garbled output from model in Unity #178, SemanticKernel ChatCompletion is Stateless #614, Create HTTP API server and provide API like OAI #269.
- The current abstractions bring some unnecessary difficulties for developers: According to my experience of PR review, several core developers understand the whole design and most of the details in LLamaSharp, while others don't. However, when developing mid-level and high-level APIs, it often requires the developer understand how LLamaSharp works with llama.cpp. In fact, some processes are not related with llama.cpp backend. For example, how to schedule the batched inference, how to sample the logits and how to decide whether the inference should be stopped. We should shield things related with llama.cpp backend as much as possible in mid-level APIs so that it will be easier for new contributors to add features or bug fix of mid & high level APIs. Besides, it will make it easy for us to borrow ideas from other good LLM projects, such as transformers, vllm and ollama.
Design
The full design is like below.
In which the llama.cpp backend is like below (see #670 for auto-downloading proposal).
The design is still separated into low-level, mid-level and high-level APIs. However, the low-level part contains multiple backends.
Don't get me wrong. I am not going to introduce other backends now (though it's possible). The purpose of this design is to better abstract llama.cpp related part. Thus, mid-level implementations will only need to take use of several APIs of llama.cpp model runner
, llama.cpp tokenizer
and llama.cpp kv-cache manager
. Some logics, such as scheduling, sampling and stopping, could be independent with the backend part..
Here is the explanation of the newly introduced components.
- Model Runner: the low-level class. Things related with llama.cpp inference should be put here. It's also possible to make a different runner with other libraries as backend, though at least in the near future I don't think I will do that.
- LLM Engine: a mid-level class. It defines how to process the request and generate the response, which is the core of mid-level APIs.
- KvCache Engine: a mid-level class. It defines how to manage the model state (kv-cache).
- Scheduler: a mid-level class. It defines how to schedule the requests and create the batch for inference.
- Sequence: a mid-level class, which is the abstraction of the text-completion request in mid-level APIs.
- Sampling methods: mid-level APIs, responsible for sampling tokens from logits.
- Stopping criteria: mid-level APIs, which defines when the sequence generation should be stop.
- Server Engine: a high-level class to provide efficient APIs for users to build their LLM server. The key feature of it is continuous batching.
- Text completion / chat session: Simple high-level class based on the LLM engine to provide easy-to-use APIs and supports parallel inference.
Text completion APIs
Here is what the APIs of text completion will be like (only show the key elements).
class LLM;
static LLM LLM.WithParams(TextCompletionParams param);
static LLM LLM.FromBackend(LLamaModelParams param);
RequestOutput[] LLM.Generate(IEnumerable<string> prompts, StoppingCriteria? stoppingCriteria = null, MultiModalData? multiModalData = null);
AsyncRequestOutput[] LLM.GenerateAsync(IEnumerable<string> prompts, StoppingCriteria? stoppingCriteria = null, MultiModalData? multiModalData = null);
When using it, the code is like below.
var llm = LLM.WithParams(...).FromBackend(...);
string[] inputs = {"text1...", "text2...", "text3..."};
var outputs = llm.Generate(inputs);
// The outputs are generated with batched inference.
foreach(var output in outputs){
// Deal with the output.
}
For APIs related with server, I'll update them after more investigations.
Conclusion
The proposal refactors most of the current designs of mid-level and high-level. Break change is the major risk of it. However, it seems that the current executors could be implemented with the mid-level APIs provided in this proposal. LLM
is actually a StatelessExecutor
with scheduler and better abstractions. As for InteractiveExecutor
, it could be implemented with LLMEngine
+ KvCacheManager
, because LLM chatting could be regarded as text completion with roles and kv-cache management. In this way, it's possible for us to make the changes smoothly.
It will have so many impacts that I won't rush for it. I'll leave enough time for the community to discuss about it, to correct the unreasonable parts. It's completely okay to drop it if most of the users & developers don't like it.
I prefer to aiming to make LLamaSharp a library to run LLM efficiently with easy-to-use APIs, instead of a simple wrapper of llama.cpp. That's also why we spent lots of time on performance improvement and dynamic native library selection. If we could agree on this, I believe we'll work it out soon. :)
Again, any suggestions and discussions about this proposal will be appreciated. 🤗