Forums - Poor depth test performance on Adreno 330

8 posts / 0 new
Last post
Poor depth test performance on Adreno 330
AshleyScirra
Join Date: 1 Jun 15
Posts: 6
Posted: Mon, 2015-06-01 06:58

I'm working on a front-to-back renderer for Construct 2, a major 2D HTML5 game engine with a WebGL renderer (basically OpenGL ES 2). Previously we have rendered everything back-to-front, but this presents performance problems with games using a lot of overdraw since they run in to fillrate limits. To try to solve this our front-to-back renderer works like this:

  1. Switch to an orthographic projection
  2. Create a depth buffer and enable depth test
  3. Start early Z pass: enable depth writes, disable color writes
  4. Render the scene front-to-back with a special shader that discards non-opaque fragments, thereby only filling the depth buffer where textures are opaque
  5. Switch back to back-to-front rendering: disable depth writes, enable color writes, but leave depth test enabled
  6. Render the scene back-to-front normally, so all background blending works as expected, but opaque areas are not filled in redundantly.

 

Some details of our implementation which may affect performance: we render everything to a texture first, and then at the end of the scene copy that to the backbuffer. There are two reasons for this: 1) background-sampling fragment shaders can only sample from a texture, not the backbuffer, so this is essential to be able to sample the background, and 2) sometimes during color rendering we need to render some layers to their own texture, so we want to share the depth buffer with those render-to-textures too (which the default depth-with-your-backbuffer can't do).

Also our early Z pass shader is simply this:

varying mediump vec2 vTex;
uniform lowp sampler2D samplerFront;
 
void main(void) {
    if (texture2D(samplerFront, vTex).a < 1.0)
        discard; // discarding non-opaque fragments
    
    // color writes are disabled but write an opaque value to prevent undefined behavior
    gl_FragColor = vec4(1.0, 1.0, 1.0, 1.0);
}
 
I made a performance test to compare the performance difference between the early Z pass and just the normal back-to-front overdrawing renderer. Since these run in WebGL you can test them here:
 
https://www.scirra.com/labs/fillperf/ (runs back-to-front with overdraw)
https://www.scirra.com/labs/fillperf-earlyz/ (runs front-to-back and depth tests to prevent overdraw)
 
This test simply draws an opaque texture at a very large size, and if you touch the screen it creates more all overlapping each other, so it stresses the fillrate as much as possible. However in theory the early Z version should be significantly faster, since it can avoid almost all the overdraw with the depth test. To compare results, I test how many sprites I can create and still get 30 FPS.
 
I get significant performance benefits on all platforms - but a performance DROP on the Adreno 330! I'm testing with a 2nd gen Moto X. My results look like this:
 
Windows 8.1 / nVidia GeForce GTX 660 / Chrome
fillperf: 274
fillperf-earlyz: 3772 (13.7x better!)
 
Nexus 9 / nVidia Tegra K1 / Chrome
fillperf: 44
fillperf-earlyz: 260 (5.9x better!)
 
iPad Air 2 / A8X / Safari
fillperf: 108
fillperf-earlyz: 207 (1.9x better!)
 
iPad 2 / A5 / Safari
fillperf: 39
fillperf-earlyz: 138 (3.5x better!)
 
Moto X (2nd gen) / Adreno 330 / Chrome
fillperf: 124
fillperf-earlyz: 71 (43% worse!!)
 
I have tried all sorts of permutations of the code and read up in the performance guides, trying techniques like adding clears after every BindFramebuffer. We already have a well-optimised batching engine that should submit buffer data only once per frame (even for both passes) and batching multiple draws of the same texture in to a single drawElements call. Nothing seems to make any difference at all - every time I tested it, the Adreno is still significantly slower. I have two theories:
 
1. My depth test code is incorrect, so the Adreno is still doing lots of overdraw and reduces performance due to the additional burden of filling the depth buffer and running depth tests. However it works on every other device with big performance gains, so I think the code is correct at least in principle.
 
2. I'm somehow accidentally hitting a slow path in the driver, and some horrible pattern of shadow copies or something like that is hammering performance. I guess this is happening in the early Z pass, since the back-to-front color filling pass is basically identical except depth test is enabled.
 
However I am at a loss as to how to figure out what is really happening in the driver. Does anyone have any ideas? Can anyone investigate or is there anything I can do to find the performance problem?
 
The performance gains are really nice on other devices so we really want to release this feature, but it would be a shame if it tanks Adreno performance and we don't know why.

 

  • Up0
  • Down0
AshleyScirra
Join Date: 1 Jun 15
Posts: 6
Posted: Mon, 2015-06-01 07:36
So I followed my suspicion about the early Z pass shader and I think I was right:
 
Normal back-to-front with overdraw: 124 sprites
Early Z, conditional discard (the one I posted): 71 sprites
Early Z, conditional only: 127 sprites
Early Z, always discard: 130 sprites
Early Z, empty shader: 128 sprites
 
Even an empty early Z shader does not appear to improve performance at all. This ought to just fill entire quads in the depth buffer ignoring any alpha and consequently save a lot of overdraw, but it is still no faster than a normal overdrawing back-to-front render. Clearly a conditional discard has a performance impact, but I don't know any better way to fill only opaque texture areas to the depth buffer. Is there a better way? Finally the depth test seems to fail completely for some reason - even the empty shader with no discard does not appear to prevent overdraw - or the performance bottleneck is simply somewhere else. Can anyone help?
  • Up0
  • Down0
AshleyScirra
Join Date: 1 Jun 15
Posts: 6
Posted: Mon, 2015-06-01 07:43

Oops, got the wrong numbers but it doesn't change it much, should have been:

Normal back-to-front with overdraw: 124 sprites
Early Z, conditional discard (the one I posted): 71 sprites
Early Z, conditional only: 105 sprites
Early Z, always discard: 56 sprites
Early Z, empty shader: 106 sprites
 
So, everything just variously slower, nothing speeds it up...
  • Up0
  • Down0
AshleyScirra
Join Date: 1 Jun 15
Posts: 6
Posted: Mon, 2015-06-01 08:06

Hey, does the Adreno do equivalent overdraw detection in hardware? I just realised that would explain why I'm not getting any performance boost! The early Z pass would just be extra baggage, especially since it does conditional discards. The Nexus 9 only gets 44 sprites with back-to-front, so maybe the Moto X is already detecting this.

If that is the case, is there any way to detect GPUs which do this so I know to disable this pass?

  • Up0
  • Down0
mhfeldma Moderator
Join Date: 29 Nov 12
Posts: 310
Posted: Mon, 2015-06-01 08:33

Ashley - Adreno hardware does implement an early Z reject, so as long as the rendering is performed in a front to back ordering, this optimization will occur.  Also the discard in your shader has some performance implications which should be avoided in any case.

 

 

  • Up0
  • Down0
AshleyScirra
Join Date: 1 Jun 15
Posts: 6
Posted: Mon, 2015-06-01 11:40

@mhfeldma - I believe I am already doing exactly that, the first pass is front-to-back and the second is back-to-front (but still depth testing) for correct background blending. Do you know why the example I posted is still not faster though?

Also are there any alternatives to avoid the performance implications of discard? I can't think of any other way to fill in just the opaque parts of textures in to a depth buffer other than alpha testing, which is not supported in OpenGL ES 2. Also even if it has a cost, ideally the overdraw savings would outweigh it so it would still be beneficial...

  • Up0
  • Down0
mhfeldma Moderator
Join Date: 29 Nov 12
Posts: 310
Posted: Mon, 2015-06-01 13:18

Many games organize the draw calls by material, so just rendering opaque draws are possible and then rendering transparent afterwards.  Perhaps using depth testing and setting gl_FragDepth accordingly could avoid the discard.

You might check our Adreno developers guide for more details on our early z culling feature..

 https://developer.qualcomm.com/mobile-development/maximize-hardware/mobile-gaming-graphics-adreno/tools-and-resources

 

  • Up0
  • Down0
AshleyScirra
Join Date: 1 Jun 15
Posts: 6
Posted: Tue, 2015-06-02 10:18

@mhfeldma - you're just describing what I've already done. I've read the guide too, and although the guide says "Adreno 3xx can reject occluded pixels at up to four times the drawn pixel fill rate", I see no performance improvement at all even when other devices are much faster. Can you help identify why early Z rejection is not improving performance in my given examples?

  • Up0
  • Down0
or Register

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.