Merge OpenAI Triton commit `7abb0be` #5794

whitneywhtsang · 2026-01-09T06:03:59Z

This PR changes the Triton base from 28c538a to 7abb0be (Dec 23).

Pass rate: 98.28%->98.26%

…ined operands (#8732) Consider the following example IR: ``` %y_16 = tt.descriptor_load %y_desc[%c0_i32, %y] {loop.cluster = 1 : i32, loop.stage = 0 : i32} : !tt.tensordesc<tensor<64x64xbf16, #shared>> -> tensor<64x64xbf16, #blocked> %y_17 = ttg.local_alloc %y_16 {loop.cluster = 1 : i32, loop.stage = 0 : i32} : (tensor<64x64xbf16, #blocked>) -> !ttg.memdesc<64x64xbf16, #shared, #smem> %acc_18 = ttng.tc_gen5_mma %x_12, %y_17, %acc_13[%acc_15], %acc, %true {loop.cluster = 1 : i32, loop.stage = 0 : i32, tt.self_latency = 1 : i32} : !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable> %acc_19, %acc_20 = ttng.tmem_load %acc_13[%acc_18] {loop.cluster = 0 : i32, loop.stage = 1 : i32} : !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable> -> tensor<64x64xf32, #blocked1> ``` The loop lower step will attempt to determine the barrier location to mark the MMA as "done" based on the earliest of the TMEM load or a non-pipelined operand. However, the current implementation leverages `schedule.isOpBefore`, which is inaccurate because its informing which operand happens first, not which operation happens first in the body of the loop. For example it would indicate `tt.descriptor_load` comes before `ttng.tmem_load`. We need to update this check so it account for the fact that the operands may occur before the MMA and therefore the location comparison should be invocation "after" the first MMA operation.

* Use the same link cpp scr except hipStrean/CUstream etc. * Add a link.h prelude for AMD/Nvidia to adapt for the difference. * Enable test_aot.py for AMD. * Also rename AMD's compile.cpp to compile.c.

Currently the behavior of fp4_padded is different between `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if the data is int8, while the latter is indexed by individual fp4 elements, which is what the TMA hardware expects. This now gets leaked into gluon, which isn't ideal. So, this PR moves the translation into the lowerings. Along the way, this probably fixes quite a few bugs as there were several places the translation was missing.

std::make_tuple here will copy the arguments into a tuple so it creates a copy of SmallVector subsliceOffsets and then passes back a tuple with an ArrayRef. The SmallVector object is then out of scope. Bypassing make_tuple means that it uses the underlying AllocationSlice's reference to subsliceOffsets rather than the temporary copy created by make_tuple.

…090) Reverts triton-lang/triton#9080 as it cause some tmem allocation regression due to simplistic hoisting logic

Enables `ttg.async_copy_global_to_local` for pipelined loads by default on `gfx950` and `gfx1250`. This increases LDS consumption because we replace one register buffer with an additional LDS buffer. After this change, the number of LDS buffers is equal to `num_stages` (previously it was `num_stages - 1`). Therefore, some test configs need to be skipped because we run out of shared memory capacity on `gfx950`. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>

Enables an approach similar to #8752 in the AMD backend for buffer ops. This helps to preserve vectorization based on kernel annotations when converting to buffer_load/store on the AMD backend.

…80)" This reverts commit 63b387c.

Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

Jokeren and others added 12 commits December 22, 2025 13:50

[PROTON] Ignore metric kernels' timing data in the final profile (#9058)

1c5f9f3

[KERNELS] Add reduce_forward metadata, improve perf. (#9068)

63867d8

[Backend][Test] Fix inspect stages unit test (#9081)

96e075c

[LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080)

63b387c

[TOOLS] Add hip support for link.py (#9084)

a0e769f

* Use the same link cpp scr except hipStrean/CUstream etc. * Add a link.h prelude for AMD/Nvidia to adapt for the difference. * Enable test_aot.py for AMD. * Also rename AMD's compile.cpp to compile.c.

Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9…

606eebc

…090) Reverts triton-lang/triton#9080 as it cause some tmem allocation regression due to simplistic hoisting logic

[PROTON][TEST] Test proton on tensor descriptor and two-cta mode (#9070)

f40d53e

[AMD] Use contiguity hint for buffer ops (#9089)

7abb0be

Enables an approach similar to #8752 in the AMD backend for buffer ops. This helps to preserve vectorization based on kernel annotations when converting to buffer_load/store on the AMD backend.

whitneywhtsang requested a review from anmyachev January 9, 2026 06:07

whitneywhtsang mentioned this pull request Jan 9, 2026

Reland upstream commit 63b387c #5795

Closed

whitneywhtsang marked this pull request as ready for review January 9, 2026 20:20

whitneywhtsang added 2 commits January 9, 2026 20:21

Merge commit '63b387ce02dd7789d18482dc15e58912c30234fc'

1b63bfe

Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3 (#90…

1e49bba

…80)" This reverts commit 63b387c.

whitneywhtsang force-pushed the whitneywhtsang/merge branch from 3222ab1 to 1e49bba Compare January 9, 2026 20:21

Merge commit 'a0e769fc26e4adefca6522598ab3969b089be45c'

0284b40

whitneywhtsang force-pushed the whitneywhtsang/merge branch from 6ad1fad to 4283fbe Compare January 9, 2026 23:34

[Intel] Update link.py after a0e769f

b483233

Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

whitneywhtsang force-pushed the whitneywhtsang/merge branch from 4283fbe to b483233 Compare January 9, 2026 23:35

whitneywhtsang requested a review from chengjunlu January 10, 2026 01:03

whitneywhtsang changed the title ~~Merge OpenAI Triton commit 63b387c~~ Merge OpenAI Triton commit a0e769f Jan 10, 2026

whitneywhtsang force-pushed the whitneywhtsang/merge branch from e917bb3 to fcad63c Compare January 10, 2026 01:33

Merge commit '7abb0be809e0b2c4fe734b1840750008bd590c7c'

f763e81

whitneywhtsang force-pushed the whitneywhtsang/merge branch from fcad63c to f763e81 Compare January 10, 2026 03:16

[TEST] Mark HW specific tests as xfail

57d6689

Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

whitneywhtsang changed the title ~~Merge OpenAI Triton commit a0e769f~~ Merge OpenAI Triton commit 7abb0be Jan 10, 2026

anmyachev approved these changes Jan 10, 2026

View reviewed changes

whitneywhtsang merged commit 2b2bf12 into main Jan 10, 2026
27 checks passed

whitneywhtsang deleted the whitneywhtsang/merge branch January 10, 2026 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge OpenAI Triton commit `7abb0be` #5794

Merge OpenAI Triton commit `7abb0be` #5794

Uh oh!

whitneywhtsang commented Jan 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Merge OpenAI Triton commit 7abb0be #5794

Merge OpenAI Triton commit 7abb0be #5794

Uh oh!

Conversation

whitneywhtsang commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Merge OpenAI Triton commit `7abb0be` #5794

Merge OpenAI Triton commit `7abb0be` #5794

whitneywhtsang commented Jan 9, 2026 •

edited

Loading