An Unreal plugin for llama.cpp to support embedding local LLMs in your projects.
Fork is modern re-write from upstream to support latest API, including: GPULayers, advanced sampling (MinP, Miro, etc), Jinja templates, chat history, partial rollback & context reset, regeneration, and more. Defaults to Vulkan build on windows for wider hardware support at about ~10% perf loss compared to CUDA backend on token generation speed.
- Download Latest Release Ensure to use the
Llama-Unreal-UEx.x-vx.x.x.7z
link which contains compiled binaries, not the Source Code (zip) link. - Create new or choose desired unreal project.
- Browse to your project folder (project root)
- Copy Plugins folder from .7z release into your project root.
- Plugin should now be ready to use.
Everything is wrapped inside a ULlamaComponent
or ULlamaSubsystem
which interfaces in a threadsafe manner to llama.cpp code internally via FLlamaNative
. All core functionality is available both in C++ and in blueprint.
- In your component or subsystem, adjust your
ModelParams
of typeFLLMModelParams
. The most important settings are:
PathToModel
- where your *.gguf is placed. If path begins with a . it's considered relative to Saved/Models path, otherwise it's an absolute path.SystemPrompt
- this will be autoinserted on load by defaultMaxContextLength
- this should match your model, default is 4096GPULayers
- how many layers to offload to GPU. Specifying more layers than the model needs works fine, e.g. use 99 if you want all of them to be offloaded for various practical model sizes. NB: Typically an 8B model will have about 33 layers. Loading more layers will eat up more VRAM, fitting the entire model inside of your target GPU will greatly increase generation speed.
-
Call
LoadModel
. Consider listening to theOnModelLoaded
callback to deal with post loading operations. -
Call
InsertTemplatedPrompt
with your message and role (typically User) along with whether you want your prompt to generate a response or not. Optionally useInsertRawPrompt
if you're doing raw input style without chat formatting. Note that you can safely chain requests and they will queue up one after another, responses will return in order. -
You should receive replies via
OnResponseGenerated
when full response has been generated. If you need streaming information, listen toOnNewTokenGenerated
and optionallyOnPartialGenerated
which will provide token and sentance level streams respectively.
Explore LlamaComponent.h for detailed API. Also if you need to modify sampling properties you find them in FLLMModelAdvancedParams
.
If you're running the inference in a high spec game fully loaded into the same GPU that renders the game, expect about ~1/3-1/2 of the performance due to resource contention; e.g. an 8B model running at ~90TPS might have ~40TPS speed in game. You may want to use a smaller model or apply pressure easing strategies to manage perfectly stable framerates.
To do custom backends or support platforms not currently supported you can follow these build instruction. Note that these build instructions should be run from the cloned llama.cpp root directory, not the plugin root.
- clone Llama.cpp
- build using commands given below e.g. for Vulkan
mkdir build
cd build/
cmake .. -DGGML_VULKAN=ON -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose
- Include: After build
- Copy
{llama.cpp root}/include
- Copy
{llama.cpp root}/ggml/include
- into
{plugin root}/ThirdParty/LlamaCpp/Include
- Copy
{llama.cpp root}/common/
common.h
andsampling.h
- into
{plugin root}/ThirdParty/LlamaCpp/Include/common
- Libs: Assuming
{llama.cpp root}/build
as{build root}
.
- Copy
{build root}/src/Release/llama.lib
, - Copy
{build root}/common/Release/common.lib
, - Copy
{build root}/ggml/src/Release/
ggml.lib
,ggml-base.lib
&ggml-cpu.lib
, - Copy
{build root}/ggml/src/Release/ggml-vulkan/Release/ggml-vulkan.lib
- into
{plugin root}/ThirdParty/LlamaCpp/Lib/Win64
- Dlls:
- Copy
{build root}/bin/Release/
ggml.dll
,ggml-base.dll
,ggml-cpu.dll
,ggml-vulkan.dll
, &llama.dll
- into
{plugin root}/ThirdParty/LlamaCpp/Binaries/Win64
- Build plugin
Current Plugin Llama.cpp was built from git has/tag: b5215
NB: use -DGGML_NATIVE=OFF
to ensure wider portability.
With the following build commands for windows.
mkdir build
cd build/
cmake .. -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose
see https://github.com/ggml-org/llama.cpp/blob/b4762/docs/build.md#git-bash-mingw64
e.g. once Vulkan SDK has been installed run.
mkdir build
cd build/
cmake .. -DGGML_VULKAN=ON -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose
ATM CUDA 12.4 runtime is recommended.
- Ensure
bTryToUseCuda = true;
is set in LlamaCore.build.cs to add CUDA libs to build (untested in v0.9 update)
mkdir build
cd build
cmake .. -DGGML_CUDA=ON -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose
mkdir build
cd build/
cmake .. -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -j --verbose
For Android build see: https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md#cross-compile-using-android-ndk
mkdir build-android
cd build-android
export NDK=<your_ndk_directory>
cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make
Then the .so or .lib file was copied into e.g. ThirdParty/LlamaCpp/Win64/cpu
directory and all the .h files were copied to the Includes
directory.