Frequently Asked Questions

General

What is the optimal way to sort objects? Is front-to-back object submission needed or, given the tiling architecture, is that not necessary?

With LRZ and other binning optimizations included in A5X and newer Adreno GPUs, sorting is less impactful in performance. That said, it is still recommended to sort front-to-back when possible, and to maintain best performance on all Adreno platforms.

Are there “no-copy’ paths available for other Snapdragon hardware blocks?

Paths are available for the CPU <-> GPU <-> DSP through Android Native Hardware Buffers (https://developer.android.com/ndk/reference/group/a-hardware-buffer).

What is a good TEX to ALU ratio?

On A5X GPUs, one (1) texture per fragment at full rate is equivalent to 16 full precision ALUs.

What is the Triangle setup rate?

One (1) prim/clock.

Which is better: vertex stream or attribute fetching in VS?

Vertex stream is preferred. Adreno GPUs have specialized hardware for fetch and decode of these.

Is dedicated video memory possible?

There is no dedicated video memory which the application developer can control. The graphics driver will allocate sections of system memory that it owns and manages.

How many Occlusion Queries should I use?

No more than 512 should be active. There is usually a three (3) frame delay for results.

What is the Occlusion Query performance?

The performance of queries is coupled with the number of bins. The higher the bin count (from resolution, MSAA, etc.) the more expensive the queries will be.

The recommended usage of Occlusion queries in Adreno GPUs is to run them in direct mode whenever possible. One way to ensure this occurs is to issue all the queries for a frame in one batch after a flush, e.g., Render Opaque -> Render Translucent -> Flush -> Render Queries -> Switch FBO.

The driver has a heuristic which understands that only queries have been issued to the surface and switches into Direct mode. The overhead of queries will show up as a higher “% CP Busy” metric in Snapdragon Profiler.

In some test cases that have issued many queries to a binned surface, the CP overhead might jump to 20-40%, and drop to 4-6% in Direct mode.

How are timer queries calculated in Adreno GPUs?

Timer queries are calculated over the entire set of tiles and binning. For example, let’s assume that we have 50 draw calls and a render target with resolution that requires 8 tiles to render. Let’s also assume we want to measure draw call 10 and instrument it with timer queries.

The entire command stream of 50 draws will be captured and run through the binning process to generate the visibility streams (see Tile-based rendering). During the rendering pass, the draw calls will be rendered according to the visibility stream of each tile. Even if the geometry for draw call 10 only contributes to one tile, it will incur a small overhead on each tile (while processing the visibility stream). This overhead and the actual rendering time will be accumulated and presented in the resulting timer query.

Note

The overhead mentioned above is small (2-5µs) but can add up if the draw call count is high and draws are present in many tiles. Starting with A5X, GPU optimizations to the visibility stream have been added to reduce this overhead by “trimming” the end of the stream of draw calls that do not contribute to the tile. This optimization can be nullified if something like a full screen pass is issued as the last draw call to a render target.

What is the performance of user clip planes?

Bad. The one (1) prim/clock will turn into 50 cycles if you must clip the primitive. It will also stall the entire pipe.

Which has better performance, alpha test vs alpha blend?

From a throughput point of view, they are the same. For a single sample they are both conservatively rejected. However, due to BW limitations, tjeu are not rejected as soon as you turn on MSAA and only get a late Z test.

Best single pass stereo? Texture array? Need GS?

GL_OVR_multiview. The driver will capture the commands for you and replay them. There is no GPU benefit; this saves CPU time.

What is the behavior on dynamic branching in fragment shaders?

The wave will stall until every thread in it is ready to progress. Once every thread is ready, both branches are taken. The results are a selection using the mask generated from the branch.

LRZ

For more information about LRZ, refer to Low Resolution Z pass. LRZ is available on A5X and above.

What causes LRZ to be disabled?

Writing depth in fragment shader
Any condition where direct rendering is required
Use of secondary command buffers (Vulkan) (Snapdragon 865 and newer will not disable LRZ based on this criteria)

Can LRZ buffers and visibility streams be stored and reused?

LRZ buffers cannot be explicitly created or exported through Vulkan, DirectX, or OpenGL ES.

What is the effect on LRZ if discard/clip is used?

There will not be any effect in LRZ. It will go to full-resolution Z in all situations.

Textures and formats

What depth buffer format provides the best performance?

When possible, use D16. If more precision is needed and a stencil is not used, it is recommended to use D32 as D24. D24_S8 will take the same space as D32 in GMEM with less precision. Otherwise D24_S8 is a recommended as well. All these formats are supported by UBWC.

What cache levels do textures remain compressed/decompressed?

ASTC is compressed in L2 and decompressed in L1. ETC formats stay compressed in L1.

Is float or half float more performant for 8-bit texture lookup?

After filtering the sample instruction, it will have a 16-bit per-component value for an 8-bit texture. If you assign the result of your texture lookup to a highp vector it will be at full precision. However, what comes out of the texture pipe after filtering is still 16-bit.

That it does not cost anything to convert from half to float.

What is the performance of 1010102 vs. 111110 formats?

Both formats will perform better than FP16.

There is a hardware “fast path” for 1010102 which will allow it to perform slightly better than 111110.

Tiling architecture

Do you have any details of how the tiling and binning process works?

For a high-level overview of tiling in Adreno, please refer to Tile-based rendering.

What conditions trigger direct rendering with FlexRender?

There are some minor variations across GPU hardware revisions and driver versions, but the following are common:

Use of tessellation or geometry shaders
Small number of vertices and/or draws
High ratio of texture samples in vertex shaders to vertices

Note

Snapdragon Profiler ‘Rendering Stages’ metric in Trace capture can display per-surface information, which includes the rendering mode used for the given surface.

Is the full Vertex Shader used when performing binning?

In the binning process, a specialized shader is used. This is generated by the compiler from the original shader and only uses the portions that touch the position related data. The full VS shader is executed later in the Render Pass.

Is binning affected by fragment Z occlusion?

Binning will not be affected by full resolution Z occlusion. Starting on A5X GPUs, an LRZ pass happens in binning and can discard tile wide contributions.

What is the CPU cost of binning?

The CPU cost can be considered negligible. The binning process is performed in the GPU and the visibility streams generated (which dictate which drawcalls affect which bins) are placed in system memory for the GPU to be consumed in the render pass.

If a primitive spans multiple tiles, will the GPU insert synthetic vertices at tile boundaries?

The full primitive will be rasterized per tile (no added verts at tile bounds).

If lots of these cases exist (perhaps in ground level rendering, etc.), it is recommended to decimate the object.

Vulkan

Is there a performance impact of using Vulkan Secondary Command Buffers?

Yes. On A5X and later GPUs, secondary command buffers cause LRZ to be disabled.

Is there a performance benefit or cost to using push constants?

On A5X, GPUs push constants are not recommended.

Hardware changes on A6X GPUs resolved many of the issues, and push constants will perform better when used with frequently changing data.

Static vs. dynamic state?

There is not a significant difference in performance when using dynamic state, so either states can be used as needed.

What is the recommended usage SSBO vs. UBO vs. Texture fetch?

It depends on the usage and size of the buffers.

Never use descriptor types or STORAGE_BUFFER or STORAGE_IMAGE if the shader usage is “Read-only”. Reading data from SSBOs or Image buffers effectively become texture fetches so performance/latency would be similar.

There are 8k of constant memory per SP so reading from UBOs which fit within that 8k would perform better than SSBOs or Images. However, the situation shifts if UBO data is larger than 8k since UBO data will be read directly from system memory by each wave without the benefit of the texture cache which SSBOs and Images would have.

What is the recommended sampler type?

It is recommended to use VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER because of how the Adreno GPU works with Bindless mode.

When using a combined image sampler, the GPU can use Bindless mode which is more performant. When using separate samplers, it will fall back to a slower mode.

Performance deltas have shown a decrease by 2-5% in the fill rate for separate samplers.

How to ensure Vulkan Subpasses merged properly?

Specific setup and access flags must be used so that subpasses can merge properly. Visit the Vulkan subpasses to learn more about these flags.

To validate that a subpasses merged properly, use the Snapdragon Profiler ‘Rendering Stages’ metric and/or enable the Vulkan Adreno Layer which will flag subpasses that could not be merged.