Forums - Migrate to Vulkan, the fps is worse than OpenGL

8 posts / 0 new
Last post
Migrate to Vulkan, the fps is worse than OpenGL
andy_chen
Join Date: 17 Feb 13
Posts: 5
Posted: Wed, 2017-04-12 21:42

Hi,

Currently, I develope my project from OpenGL to migrate to Vulkan, due to Vulkan multi-thread capability, and had perfromance issue which costs more than 6~10ms compared with OpenGL, I used Snapdragon profiler to see the GPU usage and found out each draw call costs more time. I get stuck from here.

Here is my design, I only create one queue to receive command buffer, and use 4 threads to record the command buffers, each command buffer recording is as below, and then flush to the queue.

1. start command

2. begin render pass

3 set view port and set scissor

4. bind pipeline and descriptorsets

5. bind vertex and index buffer

6. draw

7. end render pass

8. end command

I suspect "begin render pass"  in each draw call command that cause more time in GPU, but do not know how to improve and why perfromance is bad compared with OpenGL.

Hope someone give me tips or suggestions.

 

  • Up0
  • Down0
mhfeldma Moderator
Join Date: 29 Nov 12
Posts: 310
Posted: Thu, 2017-04-13 06:52

Hi Andy...

In general, your Vulkan rendering steps seem appropriate. Performance optimization generally requires you to figure out what are the bottlenecks in your rendering process.  Snapdragon Profiler allow you ot use overrides to modify the way the GPU functions so that potential bottlenecks (vertice geometry, complex shaders, texture memory, render target resolution, etc) can be identified.

Are you able do any captures in Snapdragon Profiler with GPU clocks enabled to spot which draw calls are  the most expensive? 

Are you able to capture a trace in Snapdragon Profiler with OpenGLES per drawcall and rendering stages enabled to see if the GPU is indeed always busy?

Are you rendering the same amount of geometry vs OpenGL with similar shaders, resolution, and number of draw calls?

 

-mark

 

  • Up0
  • Down0
andy_chen
Join Date: 17 Feb 13
Posts: 5
Posted: Fri, 2017-04-21 02:47

Hi mhfeldma,

Thanks for your reply, I spend more time to explore the possible problem, and find out beginRenderPass will cost more rendering time, 

it should try to group drawcalls in render pass block as more as possible, that can decrease the draw call bounds. And another problem is I have total 4 threads, each thread I expect they are concurreny, but most of time it is sequential execution due to CPU scheduling, and CPU frequency is also a problem.

For Snapdragon Profiler currently only can use trace to profile coarse granularity, can not see each draw calls cost time in detail, which is different from OpenGL profiler, hope a future new version to dump draw call times

  • Up0
  • Down0
mhfeldma Moderator
Join Date: 29 Nov 12
Posts: 310
Posted: Mon, 2017-04-24 14:29

We understand your comment about coarsely profiling.  Vulkan and it's opaque command buffers provides some new challenges that we didn't have with OpenGl apps.  We're investigating this shortcoming and hope to improve Snapdragon Profiler in the future to address.

  • Up0
  • Down0
andy_chen
Join Date: 17 Feb 13
Posts: 5
Posted: Tue, 2017-04-25 20:09

Hi mhfeldma,

Thank for your reply. I have some problems, and not sure if it is normal behavior

A. the problem draw call sequence is as follow

ex. I have 10 draw calls,  each draw call  calls sequence as follows, then this case vkCmdBeginRenderPass calls 10 times

1. start command // it is primary command buffer

vkCmdBeginRenderPass

// use dynamic state

vkCmdSetViewport(cmdBuffer, 0, 1, &viewport);

 vkCmdSetScissor(cmdBuffer, 0, 1, &scissor);

if(hasStencil) {

vkCmdSetStencilReference(cmdBuffer, VK_STENCIL_FRONT_AND_BACK, reference);

}

vkCmdBindPipeline

vkCmdBindVertexBuffers

vkCmdBindIndexBuffer

vkCmdBindDescriptorSets

vkCmdDrawIndexed or vkCmdDraw

vkCmdEndRenderPass

end command

then finally flush commands

In this case it seems very slow compared with below draw call sequences

2. start command // this is primary command buffer

vkCmdBeginRenderPass

// then draw call commands use secondary command buffer

// start command // secondary command buffer

vkCmdBindPipeline ...

vkCmdDrawIndexed or vkCmdDraw ..

// end command // secondary command buffer

vkCmdExecuteCommands // execute secondary command buffer

vkCmdEndRenderPass

end command // primary command buffer

why the second case is better than first case?

 

B. I found if use multplie thread, then command buffer and descriptset generation should follow thread local storage, otherwise it will sometimes crash (means sometimes work, but crash unpredictable)

this design guideline show in nvidia ppt, but in vulkan spec, it does not seems tell about that, and very difficult to develope if I do not know https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/m...

 

C. I use pipeline cache, ans save this cache in the file, but it seems file size  cache only  can save 1MB, otherwise crash

use this vkGetPipelineCacheData API

 

D. I found I use Snapdragon profiler to trace GPU side(it means Surface render time), it will cost more than 1ms ~ 2ms compared with OpenGL, I am very confused about this symptom.

the device 1080*1920, 18 bins, each bin 320*384, I can see 18 times mem load for tile rendering.

Is Vulkan different from OpenGL when drawing in GPU side, or both use same coding flow? 

 

Thanks for your watching, and thanks for your answer.

 

 

 

 

 

 

 

 

 

 

  • Up0
  • Down0
andy_chen
Join Date: 17 Feb 13
Posts: 5
Posted: Tue, 2017-04-25 20:25

Hi mhfeldma,

Thank for your reply. I have some problems, and not sure if it is normal behavior

A. the problem draw call sequence is as follow

ex. I have 10 draw calls,  each draw call  calls sequence as follows, then this case vkCmdBeginRenderPass calls 10 times

1. start command // it is primary command buffer

vkCmdBeginRenderPass

// use dynamic state

vkCmdSetViewport(cmdBuffer, 0, 1, &viewport);

 vkCmdSetScissor(cmdBuffer, 0, 1, &scissor);

if(hasStencil) {

vkCmdSetStencilReference(cmdBuffer, VK_STENCIL_FRONT_AND_BACK, reference);

}

vkCmdBindPipeline

vkCmdBindVertexBuffers

vkCmdBindIndexBuffer

vkCmdBindDescriptorSets

vkCmdDrawIndexed or vkCmdDraw

vkCmdEndRenderPass

end command

then finally flush commands

In this case it seems very slow compared with below draw call sequences

2. start command // this is primary command buffer

vkCmdBeginRenderPass

// then draw call commands use secondary command buffer

// start command // secondary command buffer

vkCmdBindPipeline ...

vkCmdDrawIndexed or vkCmdDraw ..

// end command // secondary command buffer

vkCmdExecuteCommands // execute secondary command buffer

vkCmdEndRenderPass

end command // primary command buffer

why the second case is better than first case?

 

B. I found if use multplie thread, then command buffer and descriptset generation should follow thread local storage, otherwise it will sometimes crash (means sometimes work, but crash unpredictable)

this design guideline show in nvidia ppt, but in vulkan spec, it does not seems tell about that, and very difficult to develope if I do not know https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/m...

 

C. I use pipeline cache, ans save this cache in the file, but it seems file size  cache only  can save 1MB, otherwise crash

use this vkGetPipelineCacheData API

 

D. I found I use Snapdragon profiler to trace GPU side(it means Surface render time), it will cost more than 1ms ~ 2ms compared with OpenGL, I am very confused about this symptom.

the device 1080*1920, 18 bins, each bin 320*384, I can see 18 times mem load for tile rendering.

Is Vulkan different from OpenGL when drawing in GPU side, or both use same coding flow? 

 

E. when "continue device rotating", it will give new surface to create swap chain, but sometimes vkAcquireNextImageKHR error

ret == VK_ERROR_OUT_OF_DATE_KHR || ret == VK_SUBOPTIMAL_KHR, then I need to recreate swap chain for this error. 

 

F. In swap chain, I use VK_PRESENT_MODE_MAILBOX_KHR, but in vulkan sample use VK_PRESENT_MODE_FIFO_KHR, 

what recommend to use?

 

G. when calls vkAcquireNextImageKHR sometimes very slow, but it is already dequeue buffer in Surfaceflinger

but when create swap chain use this wapchainCreateInfo.minImageCount = surfaceCapabilities.minImageCount + 1; // plus one

it quickly return, why cause very slow if not plus one?

 

 

Thanks for your watching, and thanks for your answer.

 

 

 

 

 

 

 

 

 

 

  • Up0
  • Down0
mhfeldma Moderator
Join Date: 29 Nov 12
Posts: 310
Posted: Wed, 2017-04-26 14:56

Hi Andy - It might be good to know the device, build, and Adreno driver version (log file would tell this), that you're running Vulkan on. If possible, make sure you're using the latest drivers/build available for your device.

We would like to understand specific performance devices between using secondary buffers and otherwise that you are seeing.  Generally performance savings with using secondary command buffers are due to being able to build them in parallel.

Not sure about your mulit-thread question - crash log would be helpful to see.

Also there isn't a limit with the pipeline cache size (we've used cache's much larger than 1 Meg).

You should see similar binning/tiling behavior in the profiler for the same screen size, geometry load and render buffers sizes/format.  Make sure you clear the render buffer when you begin the renderPass

 

  • Up0
  • Down0
andy_chen
Join Date: 17 Feb 13
Posts: 5
Posted: Wed, 2017-04-26 20:15

Hi mhfeldma,

Thanks for your reply

 

I am not sure if it new version? my adreno is 540, driver version is a79691d (dump by vulkan API)

Because I do not have qualcomm symbol so, even the source code, so crash log can not know what happened?

 

For multi-thread question, command buffer generation and descriptor set generation should follow the guideline in nvidia ptt for multiple-thread,

even when I generate command buffers with mutex lock, it also crash in driver so. 

 

for pipeline cache size, seems driver not newest?

 

I am curious about  some senario, ex only have 10~20 draw calls, very short draw calls, each draw calls is very simple and no lighting, 

is Vulkan better than OpenGL for this simple draw calls scenario ?

 

and also some questions seems not answer

A. when "continue device rotating", it will give new surface to create swap chain, but sometimes vkAcquireNextImageKHR error

ret == VK_ERROR_OUT_OF_DATE_KHR || ret == VK_SUBOPTIMAL_KHR, then I need to recreate swap chain for this error. 

 

B. In swap chain, I use VK_PRESENT_MODE_MAILBOX_KHR, but in vulkan sample use VK_PRESENT_MODE_FIFO_KHR, 

what recommend to use?

 

C. when calls vkAcquireNextImageKHR sometimes very slow, but it is already dequeue buffer in Surfaceflinger

but when create swap chain use this wapchainCreateInfo.minImageCount = surfaceCapabilities.minImageCount + 1; // plus one

it quickly return, why cause very slow if not plus one?

  • Up0
  • Down0
or Register

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.