Skip to content

Implement a PipelineBarrierBatchBuilder for batching calls to vkCmdPipelineBarrier #568

@IAmNotHanni

Description

@IAmNotHanni

Is your feature request related to a problem?

In Vulkan, batching is very important: Instead of calling a Vulkan function several times repeatedly, we can sometimes batch all arguments into one function call, if the Vulkan function takes a pointer to an array of arguments. This is the case for vkCmdPipelineBarrier.

void vkCmdPipelineBarrier(
    VkCommandBuffer                             commandBuffer,
    VkPipelineStageFlags                        srcStageMask,
    VkPipelineStageFlags                        dstStageMask,
    VkDependencyFlags                           dependencyFlags,
    uint32_t                                    memoryBarrierCount,
    const VkMemoryBarrier*                      pMemoryBarriers,
    uint32_t                                    bufferMemoryBarrierCount,
    const VkBufferMemoryBarrier*                pBufferMemoryBarriers,
    uint32_t                                    imageMemoryBarrierCount,
    const VkImageMemoryBarrier*                 pImageMemoryBarriers);

Barrier placement is very hard to do right in Vulkan. In general, you want to keep the number of barriers as small as possible, but you also need a minimum of barriers to ensure correctness. You also must make sure to be very tight with parameters of the barrier for optimal performance. If you can, you should batch barriers into one call of vkCmdPipelineBarrier, which is the core idea behind this issue.

Description

We could implement a builder pattern which abstracts collecting barriers and placing the barrier. The build method is simply one call to vkCmdPipelineBarrier with all barriers batched:

vkCmdPipelineBarrier(
    cmd,
    srcStageMask,
    dstStageMask,
    0,
    static_cast<uint32_t>(memoryBarriers.size()), memoryBarriers.data(),
    static_cast<uint32_t>(bufferBarriers.size()), bufferBarriers.data(),
    static_cast<uint32_t>(imageBarriers.size()), imageBarriers.data()
);

There are a few things we need to look out for here:

  • We would need to reorganize the rendergraph code for updates of buffers or barriers so that the barriers can be batched by type effectively. This will be discussed in another issue.
  • We will mainly need the buffer memory barriers and image memory barriers, as raw memory barriers should be avoided (depending on the exact use case).
  • Without VK_KHR_synchronization2, the srcStageMask and dstStageMask applies to all barriers. This is not optimal because there could be cases where individual barriers have different stage masks, which means we could have to call vkCmdPipelineBarrier repeatedly for every combination of access masks, even in cases where we could batch it more tightly. With sync2, which is part of Vulkan 1.3 core, we can have a different approach:
// With VK_KHR_synchronization2, the access masks are part of the barrier itself
typedef struct VkBufferMemoryBarrier2 {
    VkStructureType            sType;            
    const void*                pNext;
    VkPipelineStageFlags2      srcStageMask;
    VkAccessFlags2             srcAccessMask;
    VkPipelineStageFlags2      dstStageMask;
    VkAccessFlags2             dstAccessMask;
    uint32_t                   srcQueueFamilyIndex;  
    uint32_t                   dstQueueFamilyIndex;
    VkBuffer                   buffer;
    VkDeviceSize               offset;
    VkDeviceSize               size;
} VkBufferMemoryBarrier2;

void build(VkCommandBuffer cmd) {
    if (memoryBarriers.empty() && bufferBarriers.empty() && imageBarriers.empty())
        return;

    VkDependencyInfo depInfo{};
    depInfo.sType = VK_STRUCTURE_TYPE_DEPENDENCY_INFO;
    depInfo.memoryBarrierCount = static_cast<uint32_t>(memoryBarriers.size());
    depInfo.pMemoryBarriers = memoryBarriers.data();
    depInfo.bufferMemoryBarrierCount = static_cast<uint32_t>(bufferBarriers.size());
    depInfo.pBufferMemoryBarriers = bufferBarriers.data();
    depInfo.imageMemoryBarrierCount = static_cast<uint32_t>(imageBarriers.size());
    depInfo.pImageMemoryBarriers = imageBarriers.data();

    // In summary, this allows for more fine-grained synchronization
    vkCmdPipelineBarrier2(cmd, &depInfo);

    // Clear after use
    memoryBarriers.clear();
    bufferBarriers.clear();
    imageBarriers.clear();
}
  • Initially, I thought we could record all pipeline barriers into one batched call to vkCmdPipelineBarrier and maybe cache this as a secondary command buffer, which could be reused. The problem here is that this is almost impossible because the buffer memory barriers require the size of the buffer to be specified. This is not easy to expose as a parameter in a recorded command buffer, because after recording, they are immutable.
  • There is vkCmdUpdateBuffer, but this is limited in size

The additional cost of this functionality compared to buffer to buffer copies means it should only be used for very small amounts of data, and is why it is limited to at most 65536 bytes

Alternatives

If we don't use a PipelineBarrerBatchBuilder, and if we don't batch any pipeline barriers at all, we might have serious performance implications at some point. This might not be important for a small renderer, but since we want to have a scalable engine, this will be important for the future.

Affected Code

The rendergraph and wrapper code for command buffers

Operating System

All

Additional Context

Initially, I thought about introducing this in rendergraph2, but this would be too much for this pull request, which is already very big.

Metadata

Metadata

Assignees

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions