Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.
One of the main uses of a GPU is image (texture) sampling and processing. GPUs have specialized, built-in hardware to perform nearest-neighbor, bilinear, and bicubic filtering (with the VK_EXT_filter_cubic extension). However, some use cases require sampling with even larger kernels or with customized kernel weights. These cases can be manually implemented in a fragment or compute shader using the existing sampling instructions. However, that requires many round trips between the texture and shader units, which is not ideal from a power or performance perspective.
In this post, we’ll describe how the latest Qualcomm Adreno GPUs have dedicated hardware for sampling with large kernels and customized kernel weights. Our Vulkan driver team has released a new VK_QCOM_image_processing extension to expose that hardware. In our testing, we’ve seen run time drop by more than 75 percent and energy usage drop by almost 90 percent compared to manually coded implementations.
You can refer to the manual page and spec proposal for additional details and to try the extension in your own applications.
Four new GLSL functions
The VK_QCOM_image_processing extension exposes four new OpenGL Shading Language (GLSL) functions.
Function 1 – textureWeightedQCOM
vec4 textureWeightedQCOM(sampler2D tex, vec2 uv, sampler2DArray weights)
This function multiplies a 2D kernel of weights with a region of the texture centered at uv and sums the result into the returned vec4 value. Note that the weight texture here is an array of 2D values. Each layer of the array is called a phase. Phases allow for using different kernel discretizations when the target sampling is offset from the exact center of a pixel. This feature is important for non-linear kernel functions (like sinc) whose details are not well captured in a kernel of limited resolution.
Many common 2D image filtering kernels can be expressed as two 1D filtering kernels. These are known as separable filters and can offer significant performance savings. When creating a separable filter, the horizontal weights are placed in layer 0 and the vertical weights are packed into layer 1 of the texture array. 1D weights must be arranged in groups of 4. Here’s an example of how you would pack a two- phase separable filter of size 3x3. H means horizontal, P# is the phase number, and [n] is the nth element of the separated filter.
Layer 0:
HP0[0] | HP0[1] | HP0[2] | empty | HP1[0] | HP1[1] | HP1[2] | empty |
Layer 1:
VP0[0] | VP0[1] | VP0[2] | empty | VP1[0] | VP1[1] | VP1[2] | empty |
Function 2 – textureBoxFilterQCOM
vec4 textureBoxFilterQCOM(sampler2D tex, vec2 uv, vec2 boxSize)
This function takes an average of texels within a box. The average is weighted by coverage so that off-center sampling will be accurate. The center of the box is given by uv, the width is given by boxSize.x, and the height is given by boxSize.y.
Function 3 – textureBlockMatchSADQCOM
vec4 textureBlockMatchSADQCOM(sampler2D target, uvec2 targetCoord, sampler2D reference, uvec2 refCoord, uvec2 blockSize)
This function measures the correlation (similarity) between a subsection of the target and a subsection of the reference. targetCoord and refCoord specify the bottom-left corner of the block and the return value is the Sum of Absolute Differences (SAD) for each component:
Function 4 – textureBlockMatchSSDQCOM
vec4 textureBlockMatchSSDQCOM(sampler2D target, uvec2 targetCoord, sampler2D reference, uvec2 refCoord, uvec2 blockSize)
This function works the same as textureBlockMatchSADQCOM, but instead of returning the Sum of Absolute Differences, it returns the Sum of Square Differences (SSD) for each component:
All four of the foregoing functions operate only on 2D images. They do not currently support mipmaps, multi-layer, multi-sampled, or depth/stencil textures.
Setting up the code
This section describes the new parameters necessary for creating the textures and samplers for use with this extension. It also includes an example fragment program in GLSL. Please see the spec proposal for examples and more detailed information.Parameters
Box Filtering (Function 2)
Target sampler |
|
Target texture |
|
Block Matching (Functions 3 and 4)
Target sampler |
|
Target and Reference textures |
|
Target and Reference texture Descriptors |
|
Weighted Image Sampling (Function 1)
Target sampler |
|
Weight texture |
|
Weight Descriptor |
|
Weight Image View |
|
Target Texture |
|
Sample fragment program using separate texture and sampler:
#version 450
#extension GL_QCOM_image_processing : require
#extension GL_EXT_samplerless_texture_functions : require
// Inputs and outputs for fragment shader
layout (location = 0) in vec2 uv;
layout (location = 0) out vec4 fragColor;
// Texture and sampler inputs
layout(set = 0, binding = 0) uniform highp texture2D inputTex;
layout(set = 0, binding = 1) uniform highp texture2DArray kernelTex;
layout(set = 0, binding = 3) uniform highp sampler samplerLinear;
layout(set = 0, binding = 4) uniform highp sampler samplerKernel;
void main()
{
fragColor = textureWeightedQCOM(sampler2D(inputTex, samplerLinear), uv, sampler2DArray(kernelTex, samplerKernel));
}
Sample fragment program using combined texture and sampler
#version 450
#extension GL_QCOM_image_processing : require
#extension GL_EXT_samplerless_texture_functions : require
// Inputs and outputs for fragment shader
layout (location = 0) in vec2 uv;
layout (location = 0) out vec4 fragColor;
// Texture and sampler inputs
layout(set = 0, binding = 0) uniform highp sampler2D inputTexSampler;
layout(set = 0, binding = 1) uniform highp sampler2DArray kernelTexSampler;
void main()
{
fragColor = textureWeightedQCOM(inputTexSampler, uv, kernelTexSampler);
}
Results: Shorter run time and lower power consumption
Theis chart below shows the large benefits in both performance and energy usage from using the VK_QCOM_image_processing extension instead of manually implementing the same operations in a fragment program.
Our tests on SM8550 Android device showed that the extension resulted in 75 percent shorter run time and 90 percent lower power consumption than manual implementations.
The workload we measured used an 8x8 weighted kernel to perform 3 iterations of a 4x downscaling algorithm for thumbnail generation. The values in the chart are normalized to the manual implementation of a fragment or computer shader.
How you can use the functions
Here are some example applications where these functions can come in handy:
Use TextureWeightedQCOM (Function 1) in image blurring, sharpening, edge detection, feature detection, application of high-order filter kernels, upsampling, downsampling, and mipmap generation. The support of different reduction modes like min and max can be useful for filters such as dilation or erosion.
Use TextureBoxFilterQCOM (Function 2) to blur images and to detect features like average exposure levels, bright areas, and dark areas.
Use TextureBlockMatchSADQCOM (Function 3) and TextureBlockMatchSSDQCOM (Function 4) in feature detection, motion tracking, and image alignment applications. However, for motion estimation or optical flow applications, we recommend using the higher-level, well-optimized extensions developed specifically for this purpose: GL_QCOM_motion_estimation and VK_NV_optical_flow.
Your turn – Try the extension
As you can see, using this extensions wherever possible can deliver big benefits, in both performance and battery life.
It’s available to you here:
- SDK – Vulkan SDK from version 1.3.222 and newer
- Hardware – Support in SM8550 and newer
- Adreno GPU driver – Version 512.649 and newer
You can find additional details in the public manual page and spec proposal as well. We're keen to see what other applications you come up with for these extensions. Try it out and let us know what you discover!
We have also prepared a simple example program that performs a bloom operation. It includes shaders that use the extension and shaders with a manual implementation for comparison. It can be downloaded here.
Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.