Description
Is your feature request related to a problem?
In Vulkan, batching is very important: Instead of calling a Vulkan function several times repeatedly, we can sometimes batch all arguments into one function call, if the Vulkan function takes a pointer to an array of arguments. This is the case for vkCmdPipelineBarrier.
void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
VkPipelineStageFlags srcStageMask,
VkPipelineStageFlags dstStageMask,
VkDependencyFlags dependencyFlags,
uint32_t memoryBarrierCount,
const VkMemoryBarrier* pMemoryBarriers,
uint32_t bufferMemoryBarrierCount,
const VkBufferMemoryBarrier* pBufferMemoryBarriers,
uint32_t imageMemoryBarrierCount,
const VkImageMemoryBarrier* pImageMemoryBarriers);
Barrier placement is very hard to do right in Vulkan. In general, you want to keep the number of barriers as small as possible, but you also need a minimum of barriers to ensure correctness. You also must make sure to be very tight with parameters of the barrier for optimal performance. If you can, you should batch barriers into one call of vkCmdPipelineBarrier
, which is the core idea behind this issue.
Description
We could implement a builder pattern which abstracts collecting barriers and placing the barrier. The build method is simply one call to vkCmdPipelineBarrier
with all barriers batched:
vkCmdPipelineBarrier(
cmd,
srcStageMask,
dstStageMask,
0,
static_cast<uint32_t>(memoryBarriers.size()), memoryBarriers.data(),
static_cast<uint32_t>(bufferBarriers.size()), bufferBarriers.data(),
static_cast<uint32_t>(imageBarriers.size()), imageBarriers.data()
);
There are a few things we need to look out for here:
- We would need to reorganize the rendergraph code for updates of buffers or barriers so that the barriers can be batched by type effectively. This will be discussed in another issue.
- We will mainly need the buffer memory barriers and image memory barriers, as raw memory barriers should be avoided (depending on the exact use case).
- Without
VK_KHR_synchronization2
, thesrcStageMask
anddstStageMask
applies to all barriers. This is not optimal because there could be cases where individual barriers have different stage masks, which means we could have to callvkCmdPipelineBarrier
repeatedly for every combination of access masks, even in cases where we could batch it more tightly. With sync2, which is part of Vulkan 1.3 core, we can have a different approach:
// With VK_KHR_synchronization2, the access masks are part of the barrier itself
typedef struct VkBufferMemoryBarrier2 {
VkStructureType sType;
const void* pNext;
VkPipelineStageFlags2 srcStageMask;
VkAccessFlags2 srcAccessMask;
VkPipelineStageFlags2 dstStageMask;
VkAccessFlags2 dstAccessMask;
uint32_t srcQueueFamilyIndex;
uint32_t dstQueueFamilyIndex;
VkBuffer buffer;
VkDeviceSize offset;
VkDeviceSize size;
} VkBufferMemoryBarrier2;
void build(VkCommandBuffer cmd) {
if (memoryBarriers.empty() && bufferBarriers.empty() && imageBarriers.empty())
return;
VkDependencyInfo depInfo{};
depInfo.sType = VK_STRUCTURE_TYPE_DEPENDENCY_INFO;
depInfo.memoryBarrierCount = static_cast<uint32_t>(memoryBarriers.size());
depInfo.pMemoryBarriers = memoryBarriers.data();
depInfo.bufferMemoryBarrierCount = static_cast<uint32_t>(bufferBarriers.size());
depInfo.pBufferMemoryBarriers = bufferBarriers.data();
depInfo.imageMemoryBarrierCount = static_cast<uint32_t>(imageBarriers.size());
depInfo.pImageMemoryBarriers = imageBarriers.data();
// In summary, this allows for more fine-grained synchronization
vkCmdPipelineBarrier2(cmd, &depInfo);
// Clear after use
memoryBarriers.clear();
bufferBarriers.clear();
imageBarriers.clear();
}
- Initially, I thought we could record all pipeline barriers into one batched call to
vkCmdPipelineBarrier
and maybe cache this as a secondary command buffer, which could be reused. The problem here is that this is almost impossible because the buffer memory barriers require the size of the buffer to be specified. This is not easy to expose as a parameter in a recorded command buffer, because after recording, they are immutable. - There is vkCmdUpdateBuffer, but this is limited in size
The additional cost of this functionality compared to buffer to buffer copies means it should only be used for very small amounts of data, and is why it is limited to at most 65536 bytes
Alternatives
If we don't use a PipelineBarrerBatchBuilder
, and if we don't batch any pipeline barriers at all, we might have serious performance implications at some point. This might not be important for a small renderer, but since we want to have a scalable engine, this will be important for the future.
Affected Code
The rendergraph and wrapper code for command buffers
Operating System
All
Additional Context
Initially, I thought about introducing this in rendergraph2, but this would be too much for this pull request, which is already very big.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status