High-Order Filtering and Block Matching: New Image Processing Extension for Vulkan Optimizes Performance and Power Usage

Sunday 2/11/24 01:00am

Posted By Wade Lutgen

Blog vote up/down

Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.

One of the main uses of a GPU is image (texture) sampling and processing. GPUs have specialized, built-in hardware to perform nearest-neighbor, bilinear, and bicubic filtering (with the VK_EXT_filter_cubic extension). However, some use cases require sampling with even larger kernels or with customized kernel weights. These cases can be manually implemented in a fragment or compute shader using the existing sampling instructions. However, that requires many round trips between the texture and shader units, which is not ideal from a power or performance perspective.

In this post, we’ll describe how the latest Qualcomm Adreno GPUs have dedicated hardware for sampling with large kernels and customized kernel weights. Our Vulkan driver team has released a new VK_QCOM_image_processing extension to expose that hardware. In our testing, we’ve seen run time drop by more than 75 percent and energy usage drop by almost 90 percent compared to manually coded implementations.

You can refer to the manual page and spec proposal for additional details and to try the extension in your own applications.

Four new GLSL functions

The VK_QCOM_image_processing extension exposes four new OpenGL Shading Language (GLSL) functions.

vec4 textureWeightedQCOM(sampler2D tex, vec2 uv, sampler2DArray weights)

This function multiplies a 2D kernel of weights with a region of the texture centered at uv and sums the result into the returned vec4 value. Note that the weight texture here is an array of 2D values. Each layer of the array is called a phase. Phases allow for using different kernel discretizations when the target sampling is offset from the exact center of a pixel. This feature is important for non-linear kernel functions (like sinc) whose details are not well captured in a kernel of limited resolution.

Many common 2D image filtering kernels can be expressed as two 1D filtering kernels. These are known as separable filters and can offer significant performance savings. When creating a separable filter, the horizontal weights are placed in layer 0 and the vertical weights are packed into layer 1 of the texture array. 1D weights must be arranged in groups of 4. Here’s an example of how you would pack a two- phase separable filter of size 3x3. H means horizontal, P# is the phase number, and [n] is the nth element of the separated filter.

Layer 0:

HP0[0]

HP0[1]

HP0[2]

empty

HP1[0]

HP1[1]

HP1[2]

empty

Layer 1:

VP0[0]

VP0[1]

VP0[2]

empty

VP1[0]

VP1[1]

VP1[2]

empty

vec4 textureBoxFilterQCOM(sampler2D tex, vec2 uv, vec2 boxSize)

This function takes an average of texels within a box. The average is weighted by coverage so that off-center sampling will be accurate. The center of the box is given by uv, the width is given by boxSize.x, and the height is given by boxSize.y.

vec4 textureBlockMatchSADQCOM(sampler2D target, uvec2 targetCoord, sampler2D reference, uvec2 refCoord, uvec2 blockSize)

This function measures the correlation (similarity) between a subsection of the target and a subsection of the reference. targetCoord and refCoord specify the bottom-left corner of the block and the return value is the Sum of Absolute Differences (SAD) for each component:

∑

_i,j=uv^{i,j=uv+blockSize.xy}

|T_i,j

-R_i,j|

, where T is the Target texture and R is the Reference texture. The width and height of the box are given by blockSize.x and blockSize.y respectively. The uv coordinates here specify the bottom-left corner of the block (not the center, as in the weighted and box filter functions above).

vec4 textureBlockMatchSSDQCOM(sampler2D target, uvec2 targetCoord, sampler2D reference, uvec2 refCoord, uvec2 blockSize)

This function works the same as textureBlockMatchSADQCOM, but instead of returning the Sum of Absolute Differences, it returns the Sum of Square Differences (SSD) for each component:

∑

_i,j=uv^{i,j=uv+blockSize.xy}

(T_i,j

-R_i,j)²

All four of the foregoing functions operate only on 2D images. They do not currently support mipmaps, multi-layer, multi-sampled, or depth/stencil textures.

This section describes the new parameters necessary for creating the textures and samplers for use with this extension. It also includes an example fragment program in GLSL. Please see the spec proposal for examples and more detailed information.

Box Filtering (Function 2)

Target sampler	Create with VK_SAMPLER_CREATE_IMAGE_PROCESSING_BIT_QCOM unnormalizedCoordinates can be VK_TRUE or VK_FALSE addressModes can be CLAMP_TO_EDGE or CLAMP_TO_BORDER (border color must be VK_BORDER_COLOR_TRANSPARENT_BLACK) Reduction modes can be MIN, MAX, or AVERAGE
Target texture	Format must support VK_FORMAT_FEATURE_2_BOX_FILTER_SAMPLED_BIT_QCOM

Block Matching (Functions 3 and 4)

Target sampler	Create with VK_SAMPLER_CREATE_IMAGE_PROCESSING_BIT_QCOM unnormalizedCoordinates must be VK_TRUE addressModes can be CLAMP_TO_EDGE or CLAMP_TO_BORDER (border color must be VK_BORDER_COLOR_TRANSPARENT_BLACK) Reduction modes can be MIN, MAX, or AVERAGE
Target and Reference textures	Must be created with VK_IMAGE_USAGE_SAMPLE_BLOCK_MATCH_BIT_QCOM Format must support VK_FORMAT_FEATURE_2_BLOCK_MATCHING_BIT_QCOM Layout must be VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL or VK_IMAGE_LAYOUT_GENERAL
Target and Reference texture Descriptors	Must be created with VK_DESCRIPTOR_TYPE_BLOCK_MATCH_IMAGE_QCOM

Weighted Image Sampling (Function 1)

Target sampler	Create with VK_SAMPLER_CREATE_IMAGE_PROCESSING_BIT_QCOM unnormalizedCoordinates can be VK_TRUE or VK_FALSE addressModes can be CLAMP_TO_EDGE or CLAMP_TO_BORDER (border color must be VK_BORDER_COLOR_TRANSPARENT_BLACK) Reduction modes can be MIN, MAX, or AVERAGE
Weight texture	Must be created with VK_IMAGE_USAGE_SAMPLE_WEIGHT_BIT_QCOM Format must support VK_FORMAT_FEATURE_2_WEIGHT_IMAGE_BIT_QCOM Layout must be VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL or VK_IMAGE_LAYOUT_GENERAL Image type must be either VK_IMAGE_TYPE_1D or VK_IMAGE_TYPE_2D
Weight Descriptor	Must be created with VK_DESCRIPTOR_TYPE_SAMPLE_WEIGHT_IMAGE_QCOM
Weight Image View	Must extend the create struct with VkImageViewSampleWeightCreateInfoQCOM ViewType must be VK_IMAGE_VIEW_TYPE_1D_ARRAY or VK_IMAGE_VIEW_TYPE_2D_ARRAY
Target Texture	Format must support VK_FORMAT_FEATURE_2_WEIGHT_SAMPLED_IMAGE_BIT_QCOM

#version 450

#extension GL_QCOM_image_processing : require
#extension GL_EXT_samplerless_texture_functions : require

// Inputs and outputs for fragment shader
layout (location = 0) in vec2 uv;
layout (location = 0) out vec4 fragColor;

// Texture and sampler inputs
layout(set = 0, binding = 0) uniform highp texture2D inputTex;
layout(set = 0, binding = 1) uniform highp texture2DArray kernelTex;
layout(set = 0, binding = 3) uniform highp sampler samplerLinear;
layout(set = 0, binding = 4) uniform highp sampler samplerKernel;

void main()
{
  fragColor = textureWeightedQCOM(sampler2D(inputTex, samplerLinear), uv, sampler2DArray(kernelTex, samplerKernel));
}

#version 450

#extension GL_QCOM_image_processing : require
#extension GL_EXT_samplerless_texture_functions : require

// Inputs and outputs for fragment shader
layout (location = 0) in vec2 uv;
layout (location = 0) out vec4 fragColor;

// Texture and sampler inputs
layout(set = 0, binding = 0) uniform highp sampler2D inputTexSampler;
layout(set = 0, binding = 1) uniform highp sampler2DArray kernelTexSampler;

void main()
{
  fragColor = textureWeightedQCOM(inputTexSampler, uv, kernelTexSampler);
}

Results: Shorter run time and lower power consumption

Theis chart below shows the large benefits in both performance and energy usage from using the VK_QCOM_image_processing extension instead of manually implementing the same operations in a fragment program.

Our tests on SM8550 Android device showed that the extension resulted in 75 percent shorter run time and 90 percent lower power consumption than manual implementations.

The workload we measured used an 8x8 weighted kernel to perform 3 iterations of a 4x downscaling algorithm for thumbnail generation. The values in the chart are normalized to the manual implementation of a fragment or computer shader.

How you can use the functions

Here are some example applications where these functions can come in handy:

Use TextureWeightedQCOM (Function 1) in image blurring, sharpening, edge detection, feature detection, application of high-order filter kernels, upsampling, downsampling, and mipmap generation. The support of different reduction modes like min and max can be useful for filters such as dilation or erosion.

Use TextureBoxFilterQCOM (Function 2) to blur images and to detect features like average exposure levels, bright areas, and dark areas.

Use TextureBlockMatchSADQCOM (Function 3) and TextureBlockMatchSSDQCOM (Function 4) in feature detection, motion tracking, and image alignment applications. However, for motion estimation or optical flow applications, we recommend using the higher-level, well-optimized extensions developed specifically for this purpose: GL_QCOM_motion_estimation and VK_NV_optical_flow.

Your turn – Try the extension

As you can see, using this extensions wherever possible can deliver big benefits, in both performance and battery life.

It’s available to you here:

SDK – Vulkan SDK from version 1.3.222 and newer
Hardware – Support in SM8550 and newer
Adreno GPU driver – Version 512.649 and newer

You can find additional details in the public manual page and spec proposal as well. We're keen to see what other applications you come up with for these extensions. Try it out and let us know what you discover!

We have also prepared a simple example program that performs a bloom operation. It includes shaders that use the extension and shaders with a manual implementation for comparison. It can be downloaded here.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

High-Order Filtering and Block Matching: New Image Processing Extension for Vulkan Optimizes Performance and Power Usage

Four new GLSL functions

Function 1 – textureWeightedQCOM

Function 2 – textureBoxFilterQCOM

Function 3 – textureBlockMatchSADQCOM

Function 4 – textureBlockMatchSSDQCOM

Setting up the code

Parameters

Sample fragment program using separate texture and sampler:

Sample fragment program using combined texture and sampler

Results: Shorter run time and lower power consumption

How you can use the functions

Your turn – Try the extension

About the Blogger

Blog Topics

Most Recent Blogs

Four new GLSL functions

Function 1 – textureWeightedQCOM

Function 2 – textureBoxFilterQCOM

Function 3 – textureBlockMatchSADQCOM

Function 4 – textureBlockMatchSSDQCOM

Setting up the code

Parameters

Sample fragment program using separate texture and sampler:

Sample fragment program using combined texture and sampler

Results: Shorter run time and lower power consumption

How you can use the functions

Your turn – Try the extension

Related Blogs:

Related Tags:

About the Blogger

Blog Topics

Most Recent Blogs

Sort By

Filter Results