Llama Unreal

An Unreal plugin for llama.cpp to support embedding local LLMs in your projects.

Fork is modern re-write from upstream to support latest API, including: GPULayers, advanced sampling (MinP, Miro, etc), Jinja templates, chat history, partial rollback & context reset, regeneration, and more. Defaults to Vulkan build on windows for wider hardware support at about ~10% perf loss compared to CUDA backend on token generation speed.

Discord Server

Install & Setup

Download Latest Release Ensure to use the Llama-Unreal-UEx.x-vx.x.x.7z link which contains compiled binaries, not the Source Code (zip) link.
Create new or choose desired unreal project.
Browse to your project folder (project root)
Copy Plugins folder from .7z release into your project root.
Plugin should now be ready to use.

How to use - Basics

Everything is wrapped inside a ULlamaComponent or ULlamaSubsystem which interfaces in a threadsafe manner to llama.cpp code internally via FLlamaNative. All core functionality is available both in C++ and in blueprint.

In your component or subsystem, adjust your ModelParams of type FLLMModelParams. The most important settings are:

PathToModel - where your *.gguf is placed. If path begins with a . it's considered relative to Saved/Models path, otherwise it's an absolute path.
SystemPrompt - this will be autoinserted on load by default
MaxContextLength - this should match your model, default is 4096
GPULayers - how many layers to offload to GPU. Specifying more layers than the model needs works fine, e.g. use 99 if you want all of them to be offloaded for various practical model sizes. NB: Typically an 8B model will have about 33 layers. Loading more layers will eat up more VRAM, fitting the entire model inside of your target GPU will greatly increase generation speed.

Call LoadModel. Consider listening to the OnModelLoaded callback to deal with post loading operations.
Call InsertTemplatedPrompt with your message and role (typically User) along with whether you want your prompt to generate a response or not. Optionally use InsertRawPrompt if you're doing raw input style without chat formatting. Note that you can safely chain requests and they will queue up one after another, responses will return in order.
You should receive replies via OnResponseGenerated when full response has been generated. If you need streaming information, listen to OnNewTokenGenerated and optionally OnPartialGenerated which will provide token and sentance level streams respectively.

Explore LlamaComponent.h for detailed API. Also if you need to modify sampling properties you find them in FLLMModelAdvancedParams.

Note on speed

If you're running the inference in a high spec game fully loaded into the same GPU that renders the game, expect about ~1/3-1/2 of the performance due to resource contention; e.g. an 8B model running at ~90TPS might have ~40TPS speed in game. You may want to use a smaller model or apply pressure easing strategies to manage perfectly stable framerates.

Llama.cpp Build Instructions

To do custom backends or support platforms not currently supported you can follow these build instruction. Note that these build instructions should be run from the cloned llama.cpp root directory, not the plugin root.

Basic Build Steps

clone Llama.cpp
build using commands given below e.g. for Vulkan

mkdir build
cd build/
cmake .. -DGGML_VULKAN=ON -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose

Include: After build

Copy {llama.cpp root}/include
Copy {llama.cpp root}/ggml/include
into {plugin root}/ThirdParty/LlamaCpp/Include
Copy {llama.cpp root}/common/ common.h and sampling.h
into {plugin root}/ThirdParty/LlamaCpp/Include/common

Libs: Assuming {llama.cpp root}/build as {build root}.

Copy {build root}/src/Release/llama.lib,
Copy {build root}/common/Release/common.lib,
Copy {build root}/ggml/src/Release/ ggml.lib, ggml-base.lib & ggml-cpu.lib,
Copy {build root}/ggml/src/Release/ggml-vulkan/Release/ggml-vulkan.lib
into {plugin root}/ThirdParty/LlamaCpp/Lib/Win64

Dlls:

Copy {build root}/bin/Release/ ggml.dll, ggml-base.dll, ggml-cpu.dll, ggml-vulkan.dll, & llama.dll
into {plugin root}/ThirdParty/LlamaCpp/Binaries/Win64

Build plugin

Current Version

Current Plugin Llama.cpp was built from git has/tag: b5215

NB: use -DGGML_NATIVE=OFF to ensure wider portability.

Windows build

With the following build commands for windows.

CPU Only

mkdir build
cd build/
cmake .. -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose

Vulkan

see https://github.com/ggml-org/llama.cpp/blob/b4762/docs/build.md#git-bash-mingw64

e.g. once Vulkan SDK has been installed run.

mkdir build
cd build/
cmake .. -DGGML_VULKAN=ON -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose

CUDA

ATM CUDA 12.4 runtime is recommended.

Ensure bTryToUseCuda = true; is set in LlamaCore.build.cs to add CUDA libs to build (untested in v0.9 update)

mkdir build
cd build
cmake .. -DGGML_CUDA=ON -DGGML_NATIVE=OFF
cmake --build . --config Release -j --verbose

Mac build

mkdir build
cd build/
cmake .. -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -j --verbose

Android build

For Android build see: https://github.com/ggerganov/llama.cpp/blob/master/docs/android.md#cross-compile-using-android-ndk

mkdir build-android
cd build-android
export NDK=<your_ndk_directory>
cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make

Then the .so or .lib file was copied into e.g. ThirdParty/LlamaCpp/Win64/cpu directory and all the .h files were copied to the Includes directory.

Name		Name	Last commit message	Last commit date
Latest commit History 336 Commits
Source/LlamaCore		Source/LlamaCore
ThirdParty		ThirdParty
.gitignore		.gitignore
LICENSE		LICENSE
Llama.uplugin		Llama.uplugin
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Llama Unreal

Install & Setup

How to use - Basics

Note on speed

Llama.cpp Build Instructions

Basic Build Steps

Current Version

Windows build

CPU Only

Vulkan

CUDA

Mac build

Android build

About

Uh oh!

Releases 8

Packages

Contributors 5

Languages

License

getnamo/Llama-Unreal

Folders and files

Latest commit

History

Repository files navigation

Llama Unreal

Install & Setup

How to use - Basics

Note on speed

Llama.cpp Build Instructions

Basic Build Steps

Current Version

Windows build

CPU Only

Vulkan

CUDA

Mac build

Android build

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 5

Languages

Packages