-
Notifications
You must be signed in to change notification settings - Fork 83
Merge OpenAI Triton commit 7abb0be
#5794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ined operands (#8732)
Consider the following example IR:
```
%y_16 = tt.descriptor_load %y_desc[%c0_i32, %y] {loop.cluster = 1 : i32, loop.stage = 0 : i32} : !tt.tensordesc<tensor<64x64xbf16, #shared>> -> tensor<64x64xbf16, #blocked>
%y_17 = ttg.local_alloc %y_16 {loop.cluster = 1 : i32, loop.stage = 0 : i32} : (tensor<64x64xbf16, #blocked>) -> !ttg.memdesc<64x64xbf16, #shared, #smem>
%acc_18 = ttng.tc_gen5_mma %x_12, %y_17, %acc_13[%acc_15], %acc, %true {loop.cluster = 1 : i32, loop.stage = 0 : i32, tt.self_latency = 1 : i32} : !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable>
%acc_19, %acc_20 = ttng.tmem_load %acc_13[%acc_18] {loop.cluster = 0 : i32, loop.stage = 1 : i32} : !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable> -> tensor<64x64xf32, #blocked1>
```
The loop lower step will attempt to determine the barrier location to
mark the MMA as "done" based on the earliest of the TMEM load or a
non-pipelined operand. However, the current implementation leverages
`schedule.isOpBefore`, which is inaccurate because its informing which
operand happens first, not which operation happens first in the body of
the loop. For example it would indicate `tt.descriptor_load` comes
before `ttng.tmem_load`.
We need to update this check so it account for the fact that the
operands may occur before the MMA and therefore the location comparison
should be invocation "after" the first MMA operation.
* Use the same link cpp scr except hipStrean/CUstream etc. * Add a link.h prelude for AMD/Nvidia to adapt for the difference. * Enable test_aot.py for AMD. * Also rename AMD's compile.cpp to compile.c.
Currently the behavior of fp4_padded is different between `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if the data is int8, while the latter is indexed by individual fp4 elements, which is what the TMA hardware expects. This now gets leaked into gluon, which isn't ideal. So, this PR moves the translation into the lowerings. Along the way, this probably fixes quite a few bugs as there were several places the translation was missing.
std::make_tuple here will copy the arguments into a tuple so it creates a copy of SmallVector subsliceOffsets and then passes back a tuple with an ArrayRef. The SmallVector object is then out of scope. Bypassing make_tuple means that it uses the underlying AllocationSlice's reference to subsliceOffsets rather than the temporary copy created by make_tuple.
…090) Reverts triton-lang/triton#9080 as it cause some tmem allocation regression due to simplistic hoisting logic
Enables `ttg.async_copy_global_to_local` for pipelined loads by default on `gfx950` and `gfx1250`. This increases LDS consumption because we replace one register buffer with an additional LDS buffer. After this change, the number of LDS buffers is equal to `num_stages` (previously it was `num_stages - 1`). Therefore, some test configs need to be skipped because we run out of shared memory capacity on `gfx950`. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Enables an approach similar to #8752 in the AMD backend for buffer ops. This helps to preserve vectorization based on kernel annotations when converting to buffer_load/store on the AMD backend.
3222ab1 to
1e49bba
Compare
6ad1fad to
4283fbe
Compare
Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
4283fbe to
b483233
Compare
63b387ca0e769f
e917bb3 to
fcad63c
Compare
fcad63c to
f763e81
Compare
Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
a0e769f7abb0be
anmyachev
approved these changes
Jan 10, 2026
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the Triton base from 28c538a to 7abb0be (Dec 23).
Pass rate: 98.28%->98.26%