[FA performance] Improve the Q matrix load stategy #3966

mfrancepillois · 2025-04-18T11:23:15Z

In the Cutlass implementation, the Q matrix is prefetched outside the loop but loaded inside the loop.
This improves FA performance by avoiding register spilling.
A similar strategy should be evaluated for FA in Triton.

mfrancepillois · 2025-05-19T11:53:57Z

The performance assessment of the new pass is blocked by the performance regression on FA: #4239

mfrancepillois self-assigned this Apr 18, 2025

mfrancepillois linked a pull request Apr 18, 2025 that will close this issue

New pass Reduce variable liveness #3965

Open

mfrancepillois added performance codegen: attention labels Apr 18, 2025

vlad-penkin added this to the 4. [Performance] Core milestone Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FA performance] Improve the Q matrix load stategy #3966

[FA performance] Improve the Q matrix load stategy #3966

mfrancepillois commented Apr 18, 2025

mfrancepillois commented May 19, 2025

Uh oh!

[FA performance] Improve the Q matrix load stategy #3966

[FA performance] Improve the Q matrix load stategy #3966

Comments

mfrancepillois commented Apr 18, 2025

mfrancepillois commented May 19, 2025

Uh oh!