中文版: README_CN
ncnn_llm provides Large Language Model (LLM) support for the ncnn framework.
ncnn is a high-performance neural network inference framework specifically optimized for mobile and embedded devices. By integrating LLMs into ncnn, this project enables the execution of complex natural language processing tasks in resource-constrained environments (edge devices, mobile phones, IoT).
This project originated from nihui's implementation of the kvcache feature for ncnn, which opened the door for running LLMs on the framework. Motivated by the spirit of open-source contribution, this repository organizes and expands upon that functionality into an independent project.
The goal is to provide a complete pipeline, making it easier for developers to use LLMs on ncnn and contribute to the ecosystem.
⚠️ Important Note: ncnn's support forkvcacheis currently in an experimental stage. You must compile ncnn from themasterbranch to ensure you have the latest features required for this project to run.
The project is currently in active development. Below is the current compatibility status of various models.
These models run smoothly with the implemented tokenizer and inference pipeline.
- MiniCPM4-0.5B
- Qwen3 (0.6B)
- Qwen2.5-VL
- NLLB (No Language Left Behind)
These models can be loaded and run, but may experience bugs or suboptimal performance.
- Hunyuan 0.5B
These models should theoretically work but are currently failing or unverified in the current build.
- Qwen3-VL-2B-Instruct
- TinyLlama-1.1B-Chat-v1.0
- Qwen2.5-0.5B
- Llama-3.2-1B-Instruct
- DeepSeek-R1-Distill-Qwen-1.5b
- Hunyuan OCR
- PaddleOCR-VL
This project uses xmake for building.
git clone https://github.com/futz12/ncnn_llm.git
cd ncnn_llm
xmake build
Ensure you have downloaded the model weights (see below) before running.
xmake run minicpm4_main
* Executing task: xmake run minicpm4_main
Chat with MiniCPM4-0.5B! Type 'exit' or 'quit' to end the conversation.
User: Hello
Assistant:
Hello, I am your intelligent assistant. I can help you check the weather, news, music, translation, etc. Is there anything you need help with?
User: Do you know what OpenCV is?
Assistant: OpenCV (Open Source Computer Vision Library) is an open-source computer vision and machine learning software library. It contains many algorithms and tools for image and video processing...
llm_ncnn_run is a unified example that supports:
- Two modes: CLI chat (
--mode cli) and OpenAI-style HTTP server (--mode openai) - Built-in tools (random/add) + external MCP tools
- Automatic model download from https://mirrors.sdu.edu.cn/ncnn_modelzoo/ (by parsing
model.json)
xmake build llm_ncnn_runxmake run llm_ncnn_run --mode cli --model qwen2.5_vl_3bNotes:
- If
--modelis a bare name (no path separators), it downloads to./assets/<name>. - You can also pass an explicit path:
--model ./assets/qwen3_0.6b.
xmake run llm_ncnn_run --mode openai --port 8080 --model qwen3_0.6bEndpoints:
http://localhost:8080/(web chat)http://localhost:8080/v1/chat/completions(OpenAI-style API)
CLI mode:
xmake run llm_ncnn_run --mode cli --mcp-server "./my_mcp_server --flag"OpenAI mode:
xmake run llm_ncnn_run --mode openai --port 8080 --mcp-server "./my_mcp_server --flag"Common MCP flags:
--mcp-transport lsp|jsonl--mcp-debug--mcp-timeout-ms <n>
If download fails due to TLS certificate errors, set CA path:
NCNN_LLM_CA_FILE=/etc/ssl/certs/ca-certificates.crt \
xmake run llm_ncnn_run --mode openai --model qwen2.5_vl_3bYou can download the converted ncnn-compatible model weights from the following mirror:
We are committed to improving ncnn_llm. Our future plans include:
- Upstream Optimization: Submitting optimization patches directly to the upstream ncnn repository to improve core LLM support.
- Expanded Support: Adding support for more model architectures and tokenizers.
- Performance: Optimizing inference speed and reducing memory footprint.
- INT8 Quantization: Implementing INT8 quantization support.
- Documentation: Improving the export pipeline docs and adding more C++ usage examples.
Note: While we provide a complete export pipeline, older pipelines may become obsolete as the library evolves. Please refer to the latest example code for adjustments.
We welcome everyone to pay attention to and participate in this project to jointly promote the development of ncnn in the field of Large Language Models!
- QQ Group:
767178345
Apache 2.0