Hi,
Currently, I develope my project from OpenGL to migrate to Vulkan, due to Vulkan multi-thread capability, and had perfromance issue which costs more than 6~10ms compared with OpenGL, I used Snapdragon profiler to see the GPU usage and found out each draw call costs more time. I get stuck from here.
Here is my design, I only create one queue to receive command buffer, and use 4 threads to record the command buffers, each command buffer recording is as below, and then flush to the queue.
1. start command
2. begin render pass
3 set view port and set scissor
4. bind pipeline and descriptorsets
5. bind vertex and index buffer
6. draw
7. end render pass
8. end command
I suspect "begin render pass" in each draw call command that cause more time in GPU, but do not know how to improve and why perfromance is bad compared with OpenGL.
Hope someone give me tips or suggestions.
Hi Andy...
In general, your Vulkan rendering steps seem appropriate. Performance optimization generally requires you to figure out what are the bottlenecks in your rendering process. Snapdragon Profiler allow you ot use overrides to modify the way the GPU functions so that potential bottlenecks (vertice geometry, complex shaders, texture memory, render target resolution, etc) can be identified.
Are you able do any captures in Snapdragon Profiler with GPU clocks enabled to spot which draw calls are the most expensive?
Are you able to capture a trace in Snapdragon Profiler with OpenGLES per drawcall and rendering stages enabled to see if the GPU is indeed always busy?
Are you rendering the same amount of geometry vs OpenGL with similar shaders, resolution, and number of draw calls?
-mark
Hi mhfeldma,
Thanks for your reply, I spend more time to explore the possible problem, and find out beginRenderPass will cost more rendering time,
it should try to group drawcalls in render pass block as more as possible, that can decrease the draw call bounds. And another problem is I have total 4 threads, each thread I expect they are concurreny, but most of time it is sequential execution due to CPU scheduling, and CPU frequency is also a problem.
For Snapdragon Profiler currently only can use trace to profile coarse granularity, can not see each draw calls cost time in detail, which is different from OpenGL profiler, hope a future new version to dump draw call times
We understand your comment about coarsely profiling. Vulkan and it's opaque command buffers provides some new challenges that we didn't have with OpenGl apps. We're investigating this shortcoming and hope to improve Snapdragon Profiler in the future to address.
Hi mhfeldma,
Thank for your reply. I have some problems, and not sure if it is normal behavior
A. the problem draw call sequence is as follow
ex. I have 10 draw calls, each draw call calls sequence as follows, then this case vkCmdBeginRenderPass calls 10 times
1. start command // it is primary command buffer
vkCmdBeginRenderPass
// use dynamic state
vkCmdSetViewport(cmdBuffer, 0, 1, &viewport);
vkCmdSetScissor(cmdBuffer, 0, 1, &scissor);
if(hasStencil) {
vkCmdSetStencilReference(cmdBuffer, VK_STENCIL_FRONT_AND_BACK, reference);
}
vkCmdBindPipeline
vkCmdBindVertexBuffers
vkCmdBindIndexBuffer
vkCmdBindDescriptorSets
vkCmdDrawIndexed or vkCmdDraw
vkCmdEndRenderPass
end command
then finally flush commands
In this case it seems very slow compared with below draw call sequences
2. start command // this is primary command buffer
vkCmdBeginRenderPass
// then draw call commands use secondary command buffer
// start command // secondary command buffer
vkCmdBindPipeline ...
vkCmdDrawIndexed or vkCmdDraw ..
// end command // secondary command buffer
vkCmdExecuteCommands // execute secondary command buffer
vkCmdEndRenderPass
end command // primary command buffer
why the second case is better than first case?
B. I found if use multplie thread, then command buffer and descriptset generation should follow thread local storage, otherwise it will sometimes crash (means sometimes work, but crash unpredictable)
this design guideline show in nvidia ppt, but in vulkan spec, it does not seems tell about that, and very difficult to develope if I do not know https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/m...
C. I use pipeline cache, ans save this cache in the file, but it seems file size cache only can save 1MB, otherwise crash
use this vkGetPipelineCacheData API
D. I found I use Snapdragon profiler to trace GPU side(it means Surface render time), it will cost more than 1ms ~ 2ms compared with OpenGL, I am very confused about this symptom.
the device 1080*1920, 18 bins, each bin 320*384, I can see 18 times mem load for tile rendering.
Is Vulkan different from OpenGL when drawing in GPU side, or both use same coding flow?
Thanks for your watching, and thanks for your answer.
Hi mhfeldma,
Thank for your reply. I have some problems, and not sure if it is normal behavior
A. the problem draw call sequence is as follow
ex. I have 10 draw calls, each draw call calls sequence as follows, then this case vkCmdBeginRenderPass calls 10 times
1. start command // it is primary command buffer
vkCmdBeginRenderPass
// use dynamic state
vkCmdSetViewport(cmdBuffer, 0, 1, &viewport);
vkCmdSetScissor(cmdBuffer, 0, 1, &scissor);
if(hasStencil) {
vkCmdSetStencilReference(cmdBuffer, VK_STENCIL_FRONT_AND_BACK, reference);
}
vkCmdBindPipeline
vkCmdBindVertexBuffers
vkCmdBindIndexBuffer
vkCmdBindDescriptorSets
vkCmdDrawIndexed or vkCmdDraw
vkCmdEndRenderPass
end command
then finally flush commands
In this case it seems very slow compared with below draw call sequences
2. start command // this is primary command buffer
vkCmdBeginRenderPass
// then draw call commands use secondary command buffer
// start command // secondary command buffer
vkCmdBindPipeline ...
vkCmdDrawIndexed or vkCmdDraw ..
// end command // secondary command buffer
vkCmdExecuteCommands // execute secondary command buffer
vkCmdEndRenderPass
end command // primary command buffer
why the second case is better than first case?
B. I found if use multplie thread, then command buffer and descriptset generation should follow thread local storage, otherwise it will sometimes crash (means sometimes work, but crash unpredictable)
this design guideline show in nvidia ppt, but in vulkan spec, it does not seems tell about that, and very difficult to develope if I do not know https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/m...
C. I use pipeline cache, ans save this cache in the file, but it seems file size cache only can save 1MB, otherwise crash
use this vkGetPipelineCacheData API
D. I found I use Snapdragon profiler to trace GPU side(it means Surface render time), it will cost more than 1ms ~ 2ms compared with OpenGL, I am very confused about this symptom.
the device 1080*1920, 18 bins, each bin 320*384, I can see 18 times mem load for tile rendering.
Is Vulkan different from OpenGL when drawing in GPU side, or both use same coding flow?
E. when "continue device rotating", it will give new surface to create swap chain, but sometimes vkAcquireNextImageKHR error
ret == VK_ERROR_OUT_OF_DATE_KHR || ret == VK_SUBOPTIMAL_KHR, then I need to recreate swap chain for this error.
F. In swap chain, I use VK_PRESENT_MODE_MAILBOX_KHR, but in vulkan sample use VK_PRESENT_MODE_FIFO_KHR,
what recommend to use?
G. when calls vkAcquireNextImageKHR sometimes very slow, but it is already dequeue buffer in Surfaceflinger
but when create swap chain use this wapchainCreateInfo.minImageCount = surfaceCapabilities.minImageCount + 1; // plus one
it quickly return, why cause very slow if not plus one?
Thanks for your watching, and thanks for your answer.
Hi Andy - It might be good to know the device, build, and Adreno driver version (log file would tell this), that you're running Vulkan on. If possible, make sure you're using the latest drivers/build available for your device.
We would like to understand specific performance devices between using secondary buffers and otherwise that you are seeing. Generally performance savings with using secondary command buffers are due to being able to build them in parallel.
Not sure about your mulit-thread question - crash log would be helpful to see.
Also there isn't a limit with the pipeline cache size (we've used cache's much larger than 1 Meg).
You should see similar binning/tiling behavior in the profiler for the same screen size, geometry load and render buffers sizes/format. Make sure you clear the render buffer when you begin the renderPass
Hi mhfeldma,
Thanks for your reply
I am not sure if it new version? my adreno is 540, driver version is a79691d (dump by vulkan API)
Because I do not have qualcomm symbol so, even the source code, so crash log can not know what happened?
For multi-thread question, command buffer generation and descriptor set generation should follow the guideline in nvidia ptt for multiple-thread,
even when I generate command buffers with mutex lock, it also crash in driver so.
for pipeline cache size, seems driver not newest?
I am curious about some senario, ex only have 10~20 draw calls, very short draw calls, each draw calls is very simple and no lighting,
is Vulkan better than OpenGL for this simple draw calls scenario ?
and also some questions seems not answer
A. when "continue device rotating", it will give new surface to create swap chain, but sometimes vkAcquireNextImageKHR error
ret == VK_ERROR_OUT_OF_DATE_KHR || ret == VK_SUBOPTIMAL_KHR, then I need to recreate swap chain for this error.
B. In swap chain, I use VK_PRESENT_MODE_MAILBOX_KHR, but in vulkan sample use VK_PRESENT_MODE_FIFO_KHR,
what recommend to use?
C. when calls vkAcquireNextImageKHR sometimes very slow, but it is already dequeue buffer in Surfaceflinger
but when create swap chain use this wapchainCreateInfo.minImageCount = surfaceCapabilities.minImageCount + 1; // plus one
it quickly return, why cause very slow if not plus one?