<img width="567" alt="Image" src="https://github.com/user-attachments/assets/09d641ef-ed28-4c1a-a8b4-f8efb69fce89" /> The matmul result is obviously wrong. Then I execute the next cell with (512,512), which produce a correct result surprisingly. <img width="703" alt="Image" src="https://github.com/user-attachments/assets/00731cc2-eb8a-4f60-8eae-5f6b2c86c769" /> I suspect there are some bug with the index or missing guard clause. See another example with (16,16), which match the batch size also passed. <img width="1011" alt="Image" src="https://github.com/user-attachments/assets/38787c1f-9771-4431-bd95-c2b3890a51eb" /> p.s. I do not have a GPU and running this on a Mac so everything is run in CPU mode and `TRITON_INTERPRET=1`