[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp

# Introduction

This proposal requires lots of works and will introduce many break changes. It should be discussed in detail before it's merged into master branch. Any suggestion will be appreciated! FYI @martindevans @SignalRT 

This proposal is inspired by [vllm](https://github.com/vllm-project/vllm) and has already had a prototype implementation in #683. Though it's far from completed, the main ideas have been manifested. If you want to know further about this proposal, please follow the example in that PR to take a try of it. The example does not have a good UI to show the process of parallel inference, but it does execute multiple sequences at the same time. You could set breakpoints in `LLM.RunEngine` to confirm that.

# Motivations

At the very early stage of LLamaSharp, `LLamaModel` class was used for dealing with all things of the model, including loading, state, inference and high-level. After v0.4.1, It was splitted to `LLamaWeights`, `LLamaExecutor` and `ChatSession`, in which LLamaExecutor is the mid-level API to run the model and ChatSession is the high-level API.

Though this design once worked well both for developers and users, as time passed, the issues with this design have become increasingly evident. The main problems are described as following.

- Batched inference is not user-friendly: As you can see in #623, it requires users to understand how it works and write many codes to use it. What's more, even though the low-level API was added nearly half a year ago, there has been few users really taking use of it! Obviously, we need to provide easy-to-use APIs for users to use it because the batched inference is indeed a huge improvement on performance.
- Mid-level and high-level APIs need to be improved: We're currently providing executors as the mid-level APIs. However, though the design works well with chatting (regardless of batched inference), it does not support text-completion very well. As for high-level APIs. I believe we should follow the style of [OpenAI APIs](https://github.com/openai/openai-python) and [semantic-kernel(https://github.com/microsoft/semantic-kernel). However, the current design of mid-level APIs seems to make it difficult to implement this. Related issues: #178, #614, #269.
- The current abstractions bring some unnecessary difficulties for developers: According to my experience of PR review, several core developers understand the whole design and most of the details in LLamaSharp, while others don't. However, when developing mid-level and high-level APIs, it often requires the developer understand how LLamaSharp works with llama.cpp. In fact, some processes are not related with llama.cpp backend. For example, how to schedule the batched inference, how to sample the logits and how to decide whether the inference should be stopped. We should shield things related with llama.cpp backend as much as possible in mid-level APIs so that it will be easier for new contributors to add features or bug fix of mid & high level APIs. Besides, it will make it easy for us to borrow ideas from other good LLM projects, such as [transformers](https://github.com/huggingface/transformers), [vllm](https://github.com/vllm-project/vllm) and [ollama](https://github.com/ollama/ollama).

# Design

The full design is like below.
![LLamaSharp Refactor](https://github.com/SciSharp/LLamaSharp/assets/47343601/9623eea4-2a07-4376-820f-bd1ef349ae34)

In which the llama.cpp backend is like below (see #670 for auto-downloading proposal). 
![llama cpp backend](https://github.com/SciSharp/LLamaSharp/assets/47343601/bbe4718d-5474-4de0-a8c8-31533eef0d8a)


The design is still separated into low-level, mid-level and high-level APIs. However, the low-level part contains multiple backends.

**Don't get me wrong. I am not going to introduce other backends now (though it's possible). The purpose of this design is to better  abstract llama.cpp related part. Thus, mid-level implementations will only need to take use of several APIs of `llama.cpp model runner`, `llama.cpp tokenizer` and `llama.cpp kv-cache manager`. Some logics, such as scheduling, sampling and stopping, could be independent with the backend part.**.

Here is the explanation of the newly introduced components.

1. Model Runner: the low-level class. Things related with llama.cpp inference should be put here. It's also possible to make a different runner with other libraries as backend, though at least in the near future I don't think I will do that. 
2.  LLM Engine: a mid-level class. It defines how to process the request and generate the response, which is the core of mid-level APIs.
3.  KvCache Engine: a mid-level class. It defines how to manage the model state (kv-cache). 
4.  Scheduler: a mid-level class. It defines how to schedule the requests and create the batch for inference. 
5. Sequence: a mid-level class, which is the abstraction of the text-completion request in mid-level APIs.
6. Sampling methods: mid-level APIs, responsible for sampling tokens from logits.
7. Stopping criteria: mid-level APIs, which defines when the sequence generation should be stop.
8. Server Engine: a high-level class to provide efficient APIs for users to build their LLM server. The key feature of it is continuous batching.
9. Text completion / chat session: Simple high-level class based on the LLM engine to provide easy-to-use APIs and supports parallel inference.


# Text completion APIs

Here is what the APIs of text completion will be like (only show the key elements).

```cs
class LLM;

static LLM LLM.WithParams(TextCompletionParams param);

static LLM LLM.FromBackend(LLamaModelParams param);

RequestOutput[] LLM.Generate(IEnumerable<string> prompts, StoppingCriteria? stoppingCriteria = null, MultiModalData? multiModalData = null);

AsyncRequestOutput[] LLM.GenerateAsync(IEnumerable<string> prompts, StoppingCriteria? stoppingCriteria = null, MultiModalData? multiModalData = null);
```

When using it, the code is like below.

```cs
var llm = LLM.WithParams(...).FromBackend(...);
string[] inputs = {"text1...", "text2...", "text3..."};
var outputs = llm.Generate(inputs);
// The outputs are generated with batched inference.
foreach(var output in outputs){
    // Deal with the output.
}
```

For APIs related with server, I'll update them after more investigations.

# Conclusion

The proposal refactors most of the current designs of mid-level and high-level. Break change is the major risk of it. However, it seems that the current executors could be implemented with the mid-level APIs provided in this proposal. `LLM` is actually a `StatelessExecutor` with scheduler and better abstractions. As for `InteractiveExecutor`, it could be implemented with `LLMEngine` + `KvCacheManager`,  because LLM chatting could be regarded as text completion with roles and kv-cache management. In this way, it's possible for us to make the changes smoothly.

It will have so many impacts that I won't rush for it. I'll leave enough time for the community to discuss about it, to correct the unreasonable parts. It's completely okay to drop it if most of the users & developers don't like it.

I prefer to aiming to make LLamaSharp a library to run LLM efficiently with easy-to-use APIs, instead of a simple wrapper of llama.cpp. That's also why we spent lots of time on performance improvement and dynamic native library selection. If we could agree on this, I believe we'll work it out soon. :)

Again, any suggestions and discussions about this proposal will be appreciated. 🤗 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp #684

Introduction

Motivations

Design

Text completion APIs

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp #684

Description

Introduction

Motivations

Design

Text completion APIs

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions