Shaders

This section presents various tips and tricks to help optimize OpenGL ES and Vulkan applications on Adreno architectures.

General

Use built-ins

Built-in functions are an important part of the OpenGL ES Shading Language specification and should always be used in preference to writing custom implementation. These functions are often optimized for specific shader profiles and for the capabilities of the hardware for which the shader was compiled. As a result, they will usually be faster than any other implementation.

Note

gl_VertexID and gl_InstanceID are removed as per the GL_KHR_vulkan_glsl extension, but gl_InstanceIndex is available.

Note

gl_Position, gl_PointSize, gl_ClipDistance, gl_CullDistance are available in non-fragment stages

Refer to the GL_KHR_vulkan_glsl extension for details on changes to GLSL built-ins in Vulkan.

Use the appropriate data type

Using the most appropriate data type in code can enable the compiler and driver to optimize code, including the pairing of shader instructions.

Using a vec4 data type instead of float could prevent the compiler from performing optimizations. Small mistakes can have a significant impact on performance.

Another example is the following code should take a single instruction slot:

int4 ResultOfA(int4 a) {
    return a + 1;
}

Now suppose a slight error is introduced into the code. For the example, the floating-point constant value 1.0 is used, which is not the appropriate data type.

int4 ResultOfA(int4 a) {
    return a + 1.0;
}

The code could now consume eight instruction slots. The variable a is converted to vec4, then, the addition is done in floating point. Finally, the result is converted back to the return type int4.

Reduce type casting

It is also recommended to reduce the number of type cast operations performed. The following code might be suboptimal:

uniform sampler2D ColorTexture;
in vec2 TexC;
vec3 light(in vec3 amb, in vec3 diff)
{
    vec3 Color = texture(ColorTexture, TexC);
    Color *= diff + amb;
    return Color;
}

Here, the call to the texture function returns a vec4. There is an implicit type cast to vec3, which requires one instruction slot. Changing the code as follows might reduce the instruction numbering by one:

uniform sampler2D ColorTexture;
in vec2 TexC;
vec4 light(in vec4 amb, in vec4 diff)
{
    vec4 Color = texture(Color, TexC);
    Color *= diff + amb;
    return Color;
}

Pack scalar constants

Packing scalar constants into vectors consisting of four channels substantially improves the hardware fetch effectiveness. In the case of an animation system, this increases the number of available bones for skinning.

Consider the following code:

float scale, bias;
vec4 a = Pos * scale + bias;

By changing the code as follows it might take one less instruction, because the compiler can optimize the line to a more efficient instruction (mad):

vec2 scaleNbias;
vec4 a = Pos * scaleNbias.x + scaleNbias.y;

Keep shader length reasonable

Excessively long shaders can be inefficient. If there is a need to include many instruction slots in a shader relative to the number of texture fetches, consider splitting the algorithm into several parts. Values that are generated by one part of the algorithm for later reuse by another part can be stored into a texture and later retrieved via a texture fetch. However, this approach could be expensive in terms of memory bandwidth. Usage of trilinear, anisotropic filtering, wide texture formats, 3D and cube map textures, texture projection, texture lookup with gradients of different Lod, or gradients across a pixel quad may also increase texture sampling time and reduce the overall benefit.

Sample textures in an efficient way

To avoid texture stalls, follow these rules:

Avoid random access – Hardware operates on blocks of 2x2 fragments, so the shaders are more efficient if they access neighboring texels within a single block.
Avoid 3D textures – Fetching data from volume textures is expensive owing to the complex filtering that needs to be performed to compute the result value.
Limit the number of textures sampled from shaders – Usage of four samplers in a single shader is acceptable, but accessing more textures in a single shader stage could lead to performance bottlenecks.
Compress all textures – This allows better memory usage, translating to a lower number of texture stalls in the rendering pipeline.
Consider using mipmaps – Mipmaps help to coalesce texture fetches and can help improve performance at the cost of increased memory usage

Texture filtering can influence the speed of texture sampling. Filter performance is architecture/chip dependent, and developer might see a benefit in using bilinear or nearest filtering over trilinear or anisotropic in certain architectures. Mipmap clamping may reduce the cost of using trilinear filtering, so the average cost might be lower in real-world cases. Adding anisotropic filtering multiplies with the degree of anisotropy; that means, a 16x anisotropic lookup can be 16 times slower than a regular isotropic lookup. However, because anisotropic filtering is adaptive, this hit is taken only on fragments that require anisotropic filtering. It could be only a few fragments in all. A rule of a thumb for real world cases is that anisotropic filtering is, on average, less than double the cost of isotropic.

Cube map texture and projected texture lookups do not incur any extra cost, while shader-specific gradients, based on the dFdx and dFdy functions, cost extra. These shader-specific gradients cannot be stored across lookups. If a texture lookup is done again with the same gradients in the same sampler, it will incur the cost again.

Threads in flight/dynamic branching

Branching is crucial for the performance of the shader. Every time the branch encounters divergence, or where some elements of the thread branch one way and some elements branch in another, both branches will be taken with predicates using NULL out operations for the elements that do not take a given branch. This is true only if the data is aligned in a way that follows those conditions, which is rarely the case for fragment shaders. There are three types of branches, listed in order from best performance to worst on Adreno GPUs:

Branching on a constant, known at compile time
Branching on a uniform variable
Branching on a variable modified inside the shader

Branching on a constant may yield acceptable performance.

Pack shader interpolators

Shader-interpolated values or varyings require a GPR (general purpose register) to hold data being fed into a fragment shader. Therefore, minimize their use.

Use constants where a value is uniform. Pack values together as all varyings have four components, whether they are used or not. Putting two vec2 texture coordinates into a single vec4 value is a common practice, but other strategies employ more creative packing and on-the-fly data compression.

Note

OpenGL ES 3.0 and ES 3.1 introduce various intrinsic functions to carry out packing operations.

Minimize usage of shader GPRs

Minimizing the usage of GPRs can be an important means of optimizing performance. Inputting simpler shaders to the compiler helps guarantee optimal results. Modifying GLSL to save even a single instruction can save a GPR sometimes. Not unrolling loops can also save GPRs, but that is up to the shader compiler. Always profile shaders to make sure the final solution chosen is the most efficient one for the target platform. Unrolled loops tend to put texture fetches toward the shader top, resulting in a need for more GPRs to hold the multiple texture coordinates and fetched results simultaneously.

For example, if unrolling the loop presented below:

for (i = 0; i < 4; ++i) {
    diffuse += ComputeDiffuseContribution(normal, light[i]);
}

The code snippet would be replaced with:

diffuse += ComputeDiffuseContribution(normal, light[0]);
diffuse += ComputeDiffuseContribution(normal, light[1]);
diffuse += ComputeDiffuseContribution(normal, light[2]);
diffuse += ComputeDiffuseContribution(normal, light[3]);

Minimize shader instruction count

The compiler optimizes specific instructions, but it is not automatically efficient. Analyze shaders to save instructions wherever possible. Saving even a single instruction is worth the effort.

Avoid uber-shaders

Uber-shaders combine multiple shaders into a single shader that uses static branching. Using them makes sense if trying to reduce state changes and batch draw calls. However, this often increases GPR count, which has an impact on performance.

Avoid math on shader constants

Almost every shipped game since the advent of shaders has spent instructions performing unnecessary math on shader constants. Identify these instructions in shaders and move those calculations off to the CPU. It may be easier to identify math on shader constants in the postcompiled microcode.

Avoid discarding pixels in the fragment shader

Some developers believe that manually discarding, also known as killing, pixels in the fragment shader boosts performance. The rules are not that simple for two reasons:

If some pixels in a thread are killed and others are not, the shader still executes.
It depends on how the shader compiler generates microcode.

In theory, if all pixels in a thread are killed, the GPU will stop processing that thread as soon as possible. In practice, discard operations can disable hardware optimizations.

If a shader cannot avoid discard operations, attempt to render geometry, which depends on them after opaque draw calls.

Avoid modifying depth in fragment shaders

Similar to discarding fragments, modifying depth in the fragment shader can disable hardware optimizations.

Avoid texture fetches in vertex shaders

Adreno is based on a unified shader architecture, which means the vertex processing performance is similar to the fragment processing performance. However, for optimal performance, it is important to ensure that texture fetches in vertex shaders are localized and always operate on compressed texture data.

Break up draw calls

If a shader is heavy on GPRs and/or heavy on texture cache demands, increased performance can result from breaking up the draw calls into multiple passes. Whether the results will be positive is hard to predict, so using real-world measurements both ways is the best method to decide. Ideally, a two-pass draw call would combine its results with simple alpha blending, which is not heavy on Adreno GPUs because of the graphics memory (GMEM).

Some developers may consider using a true deferred rendering algorithm, but that approach has many drawbacks, e.g., the GMEM must be resolved for a previous pass to be used as input to a successive pass. Because resolves are not free, it is a performance cost that must be recouped elsewhere in the algorithm.

Note

Vulkan: Ideally the use of Vulkan’s RenderPass will help minimize GMEM resolves, so restructuring the rendering algorithm to use as many subpasses as possible will be the optimal approach.

Use medium precision where possible

On Adreno, 16-bit operations tend to be faster and more power-efficient than 32-bit operations. QTI recommends setting the default precision to mediump and promoting only those values that require higher precision.

However, there may be situations when highp must be used for certain varyings, e.g., texture coordinates, in shaders. These situations can be handled with a conditional statement and a preprocessor-based macro definition, as follows:

precision mediump float;
#ifdef GL_FRAGMENT_PRECISION_HIGH
#define NEED_HIGHP highp
#else
#define NEED_HIGHP mediump
#endif
varying vec2 vSmallTexCoord;
varying NEED_HIGHP vec2 vLargeTexCoord;

Favor vertex shader calculations over fragment shader calculations

Typically, vertex count is significantly less than fragment count. It is possible to reduce GPU workload by moving calculations from the fragment shader to the vertex shader. This helps to eliminate redundant computations.

Measure, test, and verify results

Finding bottlenecks is necessary for optimization, whether the application is vertex bound, fragment bound, or texture fetch bound. Measure performance before attempting to make the code faster. Use tools to take these measurements, e.g., the Snapdragon Profiler or even software timers.

Do not assume something runs faster based solely on intuition. When code is modified to perform better, it can disable compiler/hardware optimizations that are more beneficial. Always measure timing before and after changes to assess the impact of modifications performed for the sake of optimization.

Prefer uniform buffers over shader storage buffers

As long as read-only access is sufficient for needs and the amount of space Uniform Buffers offer is enough, always prefer them over Shader Storage Buffers. They are likely to perform better under the Adreno architecture. This is true if the Uniform Buffer Objects are statically indexed in GLSL and are small enough that the driver or compiler can map them into the same hardware constant RAM that is used for the default uniform block uniforms.

Also prefer uniform buffers over push constants on Adreno hardware for performance reasons.

Eliminate subpixel triangles during tessellation

Tessellation allows for increased levels of detail and can reduce memory bandwidth and CPU cycles by allowing other game subsystems to operate on low-resolution representations of meshes. However, high levels of tessellation can generate subpixel triangles, which cause poor rasterizer utilization. It is important to utilize distance, screen space size, or other adaptive metrics for computing tessellation factors that avoid subpixel triangles.

Do back-face culling during tessellation

Hardware back-face culling occurs after the tessellation stage, which potentially wastes GPU resources tessellating back-facing primitives. These can be identified in the tessellation control shader stage and culled by setting their edge tessellation factors to 0.

Note

Include a slight “fudge” factor in this calculation if displacement mapping will be used in the tessellation evaluation shader stage, as this technique may change the visibility of primitives.

Disable tessellation whenever it is not needed

Whenever possible, disable the tessellation control shader and tessellation evaluation shader stages if the tessellation factor for the mesh would be ~1. This eliminates the use of unnecessary GPU stages during the rendering process.

Keep UBOs as small as possible

UBOs which can fit in the 8k of constant memory will perform better than larger UBOs where each wave has to fetch from main memory.

OpenGL ES Specific

Compile and link during initialization

The compilation and linking of shaders is a time-consuming process. It is expensive compared to other calls in OpenGL ES. It is recommended that shaders are loaded and compiled during initialization, and that glUseProgram is then invoked to switch between shaders as necessary during the rendering phase.

For OpenGL ES 2.0, ES 3.0, and ES 3.1 contexts, the use of blob binaries is recommended. After compiling and linking a program object, it is possible to retrieve the binary representation, or blob, using one of the following functions:

glGetProgramBinaryOES – If using an OpenGL ES 2.0 context with the GL_OES_get_program_binary extension available
glGetProgramBinary – If using an OpenGL ES 3.0 or 3.1 context (core functionality)

The blob can then be saved to persistent storage. The next time the application is launched, it is not necessary to recompile and relink the shader. Instead, read the blob from persistent storage and load it directly into the program object using glProgramBinaryOES or glProgramBinary. This can significantly speed up application launch times.

Warning

Many OpenGL ES implementations have a habit of incorporating build time GL state into program objects when they are linked. If that program is then used for a draw call issued in the context of a different GL state configuration, then the OpenGL ES implementation is obliged transparently to rebuild the program on-the-fly. This behavior is legal in the OpenGL ES specification. However, it can cause serious problems for developers. It often takes a significant amount of time to rebuild the program object, which can lead to severe frame drops. The rebuild was not requested by the application, so the delay is unexpected and the reason for it may not be apparent.

On Adreno platforms, this is not an issue. The Adreno drivers never recompile shaders. It is safe to assume that program objects will never be rebuilt other than at a specific request.

Invalidate frame buffer contents as early as possible

An application should use glInvalidateFramebuffer and glInvalidateSubFramebuffer API calls to inform the driver that it is free to drop the contents (or regions thereof) of the current draw frame buffer. This is important for tiled rendering modes, because these hints can be used by the driver to reduce the amount of work the hardware has to perform.

Optimize vertex buffer object updates

If needing to modify Vertex Buffer Object contents on-the-fly when rendering a frame, be sure to batch all the VBO updates (glBufferSubData calls) before issuing any draw calls that use the modified VBO region. If using multiple VBOs, batch the updates for all the VBOs first, and then issue all the draw calls.

Failure to follow these recommendations can cause the driver to maintain multiple copies of an entire VBO, which results in reduced performance.