Releases · ggml-org/llama.cpp

29 Jun 08:36

bd9c981

b5775 Latest

Latest

vulkan: Add fusion support for RMS_NORM+MUL (#14366)

* vulkan: Add fusion support for RMS_NORM+MUL

- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.

* extract some common fusion logic

* fix -Winconsistent-missing-override

* move ggml_can_fuse to a common function

* build fix

* C and C++ versions of can_fuse

* move use count to the graph to avoid data races and double increments when used in multiple threads

* use hash table lookup to find node index

* change use_counts to be indexed by hash table slot

* minimize hash lookups

style fixes

* last node doesn't need single use.
fix type.
handle mul operands being swapped.

* remove redundant parameter

---------

Co-authored-by: slaren <[email protected]>

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6
373 MB 2025-06-29T08:36:46Z
llama-b5775-bin-macos-arm64.zip

sha256:9fd2fc38eea5d07efc81f81d25b41d0d4781eb7cc761a1627d71dcbad83dc008
10.5 MB 2025-06-29T08:36:59Z
llama-b5775-bin-macos-x64.zip

sha256:bb83180999a3d9cdbe5609b97c443a57dcf5888747193c22d62eb583b041f27c
26.2 MB 2025-06-29T08:37:00Z
llama-b5775-bin-ubuntu-vulkan-x64.zip

sha256:222d9f1e3ea1b79cfaf0d91eb4f77fa3a42920c823a98ed7629fe79e71ac9978
19.9 MB 2025-06-29T08:37:02Z
llama-b5775-bin-ubuntu-x64.zip

sha256:d4c2a15cb0c41b6cbd05f94eb6bef3605e1bd809c997142c3296b999f2181a1d
12.3 MB 2025-06-29T08:37:03Z
llama-b5775-bin-win-cpu-arm64.zip

sha256:c780dc43fe11942cead022fdaed47cb3b22f84d841a20f7c0ec5e12af43fa9e8
10.8 MB 2025-06-29T08:37:04Z
llama-b5775-bin-win-cpu-x64.zip

sha256:7f0e7f79dc052edb1a33bafe0792b573f4030423025d82b6331cfea767119d49
13.5 MB 2025-06-29T08:37:06Z
llama-b5775-bin-win-cuda-12.4-x64.zip

sha256:26875fd4273111a42b8d4b63ac338f8b4bd8847f04266f3653bd049ad7ef28a2
128 MB 2025-06-29T08:37:07Z
llama-b5775-bin-win-hip-radeon-x64.zip

sha256:c81bf3d923d4bff0b6c7334aecdb67f58d73e5b24deb82799f8c92a235371f4f
298 MB 2025-06-29T08:37:14Z
llama-b5775-bin-win-opencl-adreno-arm64.zip

sha256:6192a4b9f4e70296b4764876a7d97bb78e2d418b563a4dd78476f2e4cc9809d2
11.1 MB 2025-06-29T08:37:25Z
Source code (zip)

2025-06-29T07:43:36Z
Source code (tar.gz)

2025-06-29T07:43:36Z

28 Jun 18:00

github-actions

b5774

27208bf

b5774

CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)

* CUDA: add bf16 and f32 support to cublas_mul_mat_batched

* Review: add type traits and make function more generic

* Review: make check more explicit, add back comments, and fix formatting

* Review: fix formatting, remove useless type conversion, fix naming for bools

Assets 15

28 Jun 16:05

github-actions

b5773

63a7bb3

b5773

vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipelin…

Assets 15

28 Jun 15:40

github-actions

b5772

00d5282

b5772

vulkan: lock accesses of pinned_memory vector (#14333)

Assets 15

28 Jun 14:25

github-actions

b5771

566c16f

b5771

model : add support for ERNIE 4.5 0.3B model (#14408)

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang <[email protected]>

Assets 15

28 Jun 09:55

github-actions

b5770

b25e927

b5770

fix async_mode bug (#14432)

Assets 15

28 Jun 08:57

github-actions

b5769

6609507

b5769

ci : fix windows build and release (#14431)

Assets 15

26 Jun 14:18

github-actions

b5760

e8215db

b5760

metal : add special-case mat-vec mul for ne00 == 4 (#14385)

ggml-ci

Assets 15

26 Jun 13:56

github-actions

b5759

5783ae4

b5759

metal : batch rows copy in a single threadgroup (#14384)

* metal : batch rows copy in a single threadgroup

ggml-ci

* metal : handle some edge cases when threadgroup size is not a power of 2

ggml-ci

Assets 15

26 Jun 05:21

github-actions

b5757

716301d

b5757

musa: enable fp16 mma (all) and cublas on qy2 (#13842)

* musa: enable fp16 mma (all) and cublas on qy2

Signed-off-by: Xiaodong Ye <[email protected]>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler <[email protected]>

* Address review comments

Signed-off-by: Xiaodong Ye <[email protected]>

* Address review comments

Signed-off-by: Xiaodong Ye <[email protected]>

* musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues

Signed-off-by: Xiaodong Ye <[email protected]>

---------

Signed-off-by: Xiaodong Ye <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b5775

Uh oh!

b5774

Uh oh!

b5773

Uh oh!

b5772

Uh oh!

b5771

Uh oh!

b5770

Uh oh!

b5769

Uh oh!

b5760

Uh oh!

b5759

Uh oh!

b5757

Uh oh!