Skip to content

Conversation

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Jan 9, 2026

This PR changes the Triton base from 28c538a to 7abb0be (Dec 23).

Pass rate: 98.28%->98.26%

Jokeren and others added 12 commits December 22, 2025 13:50
…ined operands (#8732)

Consider the following example IR:

```
%y_16 = tt.descriptor_load %y_desc[%c0_i32, %y] {loop.cluster = 1 : i32, loop.stage = 0 : i32} : !tt.tensordesc<tensor<64x64xbf16, #shared>> -> tensor<64x64xbf16, #blocked>
%y_17 = ttg.local_alloc %y_16 {loop.cluster = 1 : i32, loop.stage = 0 : i32} : (tensor<64x64xbf16, #blocked>) -> !ttg.memdesc<64x64xbf16, #shared, #smem>
%acc_18 = ttng.tc_gen5_mma %x_12, %y_17, %acc_13[%acc_15], %acc, %true {loop.cluster = 1 : i32, loop.stage = 0 : i32, tt.self_latency = 1 : i32} : !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable>
%acc_19, %acc_20 = ttng.tmem_load %acc_13[%acc_18] {loop.cluster = 0 : i32, loop.stage = 1 : i32} : !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable> -> tensor<64x64xf32, #blocked1>
```
The loop lower step will attempt to determine the barrier location to
mark the MMA as "done" based on the earliest of the TMEM load or a
non-pipelined operand. However, the current implementation leverages
`schedule.isOpBefore`, which is inaccurate because its informing which
operand happens first, not which operation happens first in the body of
the loop. For example it would indicate `tt.descriptor_load` comes
before `ttng.tmem_load`.

We need to update this check so it account for the fact that the
operands may occur before the MMA and therefore the location comparison
should be invocation "after" the first MMA operation.
* Use the same link cpp scr except hipStrean/CUstream etc.
* Add a link.h prelude for AMD/Nvidia to adapt for the difference.
* Enable test_aot.py for AMD.
* Also rename AMD's compile.cpp to compile.c.
Currently the behavior of fp4_padded is different between
`triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if
the data is int8, while the latter is indexed by individual fp4
elements, which is what the TMA hardware expects.

This now gets leaked into gluon, which isn't ideal. So, this PR moves
the translation into the lowerings. Along the way, this probably fixes
quite a few bugs as there were several places the translation was
missing.
std::make_tuple here will copy the arguments into a tuple so it creates
a copy of SmallVector subsliceOffsets and then passes back a tuple with
an ArrayRef. The SmallVector object is then out of scope. Bypassing
make_tuple means that it uses the underlying AllocationSlice's reference
to subsliceOffsets rather than the temporary copy created by make_tuple.
…090)

Reverts triton-lang/triton#9080 as it cause some tmem allocation
regression due to simplistic hoisting logic
Enables `ttg.async_copy_global_to_local` for pipelined loads by default
on `gfx950` and `gfx1250`.

This increases LDS consumption because we replace one register buffer
with an additional LDS buffer. After this change, the number of LDS
buffers is equal to `num_stages` (previously it was `num_stages - 1`).
Therefore, some test configs need to be skipped because we run out of
shared memory capacity on `gfx950`.
---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Enables an approach similar to #8752 in the AMD backend for buffer ops.
This helps to preserve vectorization based on kernel annotations when
converting to buffer_load/store on the AMD backend.
@whitneywhtsang whitneywhtsang marked this pull request as ready for review January 9, 2026 20:20
Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 63b387c Merge OpenAI Triton commit a0e769f Jan 10, 2026
Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit a0e769f Merge OpenAI Triton commit 7abb0be Jan 10, 2026
@whitneywhtsang whitneywhtsang merged commit 2b2bf12 into main Jan 10, 2026
27 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch January 10, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.