yzma lets you use Go for local inference with Vision Language Models (VLMs), Large Language Models (LLMs), Small Language Models (SLMs), and Tiny Language Models (TLMs) using llama.cpp libraries all running on your own hardware.
You can use VLMs and other language models with full hardware acceleration on Linux, on macOS, and on Windows. It uses the purego and ffi packages so CGo is not needed. This means that yzma works with the very latest llama.cpp releases.
This example uses the SmolLM-135M model:
package main
import (
	"fmt"
	"os"
	"github.com/hybridgroup/yzma/pkg/llama"
)
var (
	modelFile            = "./models/SmolLM-135M.Q2_K.gguf"
	prompt               = "Are you ready to go?"
	libPath              = os.Getenv("YZMA_LIB")
	responseLength int32 = 12
)
func main() {
	llama.Load(libPath)
	llama.Init()
	model := llama.ModelLoadFromFile(modelFile, llama.ModelDefaultParams())
	lctx := llama.InitFromModel(model, llama.ContextDefaultParams())
	vocab := llama.ModelGetVocab(model)
	// call once to get the size of the tokens from the prompt
	count := llama.Tokenize(vocab, prompt, nil, true, false)
	// now get the actual tokens
	tokens := make([]llama.Token, count)
	llama.Tokenize(vocab, prompt, tokens, true, false)
	batch := llama.BatchGetOne(tokens)
	sampler := llama.SamplerChainInit(llama.SamplerChainDefaultParams())
	llama.SamplerChainAdd(sampler, llama.SamplerInitGreedy())
	for pos := int32(0); pos+batch.NTokens < count+responseLength; pos += batch.NTokens {
		llama.Decode(lctx, batch)
		token := llama.SamplerSample(sampler, lctx, -1)
		if llama.VocabIsEOG(vocab, token) {
			fmt.Println()
			break
		}
		buf := make([]byte, 36)
		len := llama.TokenToPiece(vocab, token, buf, 0, true)
		fmt.Print(string(buf[:len]))
		batch = llama.BatchGetOne([]llama.Token{token})
	}
	fmt.Println()
}Produces the following output:
$ go run ./examples/hello/ 2>/dev/null
The first thing you need to do is to get your hands on a computer.What's with the 2>/dev/null at the end? That is the "easy way" to suppress the logging from llama.cpp.
Didn't get any output? Run it again without the 2>/dev/null to see any errors.
You will need to download the llama.cpp prebuilt libraries for your platform.
See INSTALL.md for detailed information on installation on Linux, macOS, and Windows.
This example uses the Qwen2.5-VL-3B-Instruct-Q8_0 VLM model to process both a text prompt and an image, then displays the result.
$ go run ./examples/vlm/ -model ./models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf -mmproj ./models/mmproj-Qwen2.5-VL-3B-Instruct-Q8_0.gguf -image ./images/domestic_llama.jpg -p "What is in this picture?" 2>/dev/null
Loading model ./models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf
encoding image slice...
image slice encoded in 966 ms
decoding image batch 1/1, n_tokens_batch = 910
image decoded (batch 1/1) in 208 ms
The picture shows a white llama standing in a fenced-in area, possibly a zoo or a wildlife park. The llama is the main focus of the image, and it appears to be looking to the right. The background features a grassy area with trees and a fence, and there are some vehicles visible in the distance.You can use yzma to do inference on text language models. This example uses the qwen2.5-0.5b-instruct-fp16.gguf  model for an interactive chat session.
$ go run ./examples/chat/ -model ./models/qwen2.5-0.5b-instruct-fp16.gguf
Enter prompt: Are you ready to go?
Yes, I'm ready to go! What would you like to do?
Enter prompt: Let's go to the zoo
Great! Let's go to the zoo. What would you like to see?
Enter prompt: I want to feed the llama 
Sure! Let's go to the zoo and feed the llama. What kind of llama are you interested in feeding?See the examples directory for more examples of how to use yzma.
yzma currently has support for nearly 80% of llama.cpp functionality. See ROADMAP.md for a complete list.
You can already use VLMs and other language models with full hardware acceleration on Linux, on macOS, and on Windows.
Here are some advantages of yzma over other Go packages for llama.cpp:
- Compile Go programs that use yzmawith the normalgo buildandgo runcommands. No C compiler needed!
- Use the llama.cpplibraries with whatever hardware acceleration is available for your configuration. CUDA, Vulkan, etc.
- Download llama.cppprecompiled libraries directly from Github, or include them with your application.
- Update the llama.cpplibraries without recompiling your Go program, as long asllama.cppdoes not make any breaking changes.
The idea is to make it easier for Go developers to use language models as part of "normal" applications without having to use containers or do anything other than the normal GOOS and GOARCH env variables for cross-complication.
yzma borrows definitions from the https://github.com/dianlight/gollama.cpp package then modifies them rather heavily. Thank you!
