Qualcomm products mentioned within this post are offered by
Qualcomm Technologies, Inc. and/or its subsidiaries.
Co-written with Hongqiang Wang and Alex Bourd
Continuing our series on OpenCL optimization on the Qualcomm® Adreno™ GPU, we describe a multi-step optimization for apps that use the Epsilon filter. We demonstrate that OpenCL optimizations of the Epsilon filter on the GPU are device-specific.
You can apply the optimization techniques we describe below to boost performance in other image processing applications and, indeed, in any general, memory-bound application.
Introduction to the Epsilon filter
The Epsilon filter is used in many image processing and computer vision algorithms for the reduction of mosquito noise, a video compression artifact observed in high-frequency regions such as edges in images. The filtering operation itself consists of two steps as shown in Figure 1:
- Given a pixel to be filtered, calculate the absolute difference between the central pixel and each pixel in its neighboring 9x9 area.
- If the absolute difference is below a threshold, the neighboring pixel value is used for averaging. The threshold is usually a constant.
Often, the filter is applied only to the intensity (Y) component of YUV images, because the noise is mainly visible there. We do the same in our implementation.
Figure 1: Epsilon filter algorithm
As in our previous post about accelerating the Sobel filter on Adreno GPU, we will show how the Adreno GPU can efficiently perform this highly parallel filtering operation using OpenCL. Moreover, we will examine a slew of optimization techniques that can be used to extract the best performance from the hardware.
Much as in our naïve Sobel filter version, we compute a single, filtered output per work-item in this naïve implementation. Here are some details:
- Use image object instead of buffer object. This makes the out-of-boundary check unnecessary, and more importantly takes advantage of the level 1 (L1) cache in the Adreno GPU.
- Use CL_R image format, since we care about the Y component only.
- Use CL_UNORM_INT8 image data type. This means that the pixels read into the signal processor (SP) are normalized to [0, 1] by the built-in texture pipe in the Adreno GPU.
With the settings given above, processing a WxH sized image requires WxH work-items, each of which loads 81 floating point pixels to compute just a single output pixel.
The problem is that this naïve implementation makes the algorithm heavily memory-bound and results in poor performance. For example, processing a 3264x2448-sized image takes 39.5 milliseconds (ms) on Adreno 630 (Snapdragon 845) and 78.3 ms on Adreno 530 (Snapdragon 820). Both are clocked at their highest respective frequencies of 720 MHz and 600 MHz.
OpenCL optimizations to accelerate the Epsilon filter
In this section, we will introduce several optimization steps in an incremental manner to see how the performance improves over the baseline (naïve) approach. The performance of the naïve implementation on the Adreno 630 is normalized to 1.
1. Data type and data pack optimization
First, we improve our data-load efficiency by taking two simple steps:
- Use 16-bit short data type instead of 32-bit floating (FP32) data type. Using FP32 to represent image data is a waste of memory, and Adreno GPUs have native hardware support for 16-bit operations. On the host side, the input image object adopts CL_UNSIGNED_INT8 image data type, while inside the OpenCL kernel, the pixel load is using convert_short4(read_imageui()) instead of read_imagef. OpenCL does not have a function that directly loads 16-bit short data type from image object, but using convert_short4(read_imageui()) allows Adreno to directly load 16-bit short data type vector through texture processor (TP).
- Use CL_RGBA to pack four pixels into the four channels and thus better utilize the bandwidth of the TP, though the pixels are Y component.
As shown in Figure 2, each work-item loads 9 rows, and three short4 vectors per row, while outputs one processed pixel. The number of load operations per output pixel is reduced from 9x9 = 81 (FP32/load) to 9x3 = 27 (short4/load).
After this optimization step, the algorithm runs 1.6x faster on Adreno 630. (See Summary below for actual test results on all optimization steps.)
Figure 2: Data pack using 16-bit short
2. Vectorized load/store optimization and increased data reuse
Next, we explore the possibility of reusing the neighboring data that is already loaded. Specifically, with the help of vectorized load, one work-item can compute four output pixels without increasing the number of load operations, as illustrated in Figure 3.
Figure 3: Filter 4 pixels per work-item
To put this more concretely:
- Each work-item loads three short4 vectors and outputs four pixels. The number of loads per output pixel then falls by 4x, from 27 to 6.75.
- The global work size is appropriately decreased to (W/4)xH.
- We unroll the loop over rows.
This is what the pseudocode looks like:
Read center pixel c; for row = 1 to 9, do: read data p1; Perform 1 computation with pixel c; read data p2; Perform 4 computations with pixel c; read data p3; Perform 4 computations with pixel c; end for write results back to pixel c;
With the above changes in place, the performance improves by 2.7x over the baseline.
3. Further increase in workload per work-item
As seen in Figure 4, by loading just one more tile of 4x9 pixels, we can compute as twice the number of output pixels as in step 2. It therefore stands to reason that we may expect another performance boost by increasing the workload per work-item. Overall:
- Each work-item loads 9 rows of four short4 vectors and generates 8 output pixels. The number of memory accesses per output pixel drops from 3x9/4 = 6.5 (short4) to 4x9/8 = 4.5 (short4).
- The global work size is set to (W/8, H).
On Adreno 530, this optimization yields a slight performance improvement, from 3.2x to 3.4x of Adreno 530’s baseline performance. However, the performance on Adreno 630 drops off slightly, from 2.7x to 2.6x of the Adreno 630’s baseline performance. It is worth noting that due to various factors, such as hardware micro-architecture changes and software/compiler updates, the effectiveness of optimization techniques may vary for the same kernels.
Figure 4: Process 8 pixels per work-item
4. Use of local memory
In Adreno GPU, local memory (LM) physically resides inside the chip and has much lower memory latency than global memory (GM), which is separate from the chip. Therefore, it would save a lot of memory traffic between GM and GPU, if we can load the pixels into LM that are repeatedly used. This is a standard practice that could not only improve performance, but also significantly reduce power consumption. Here are the implementation details for this use case:
- Divide the global size into workgroups with a local size of WLxHL. The pixels needed to compute the output corresponding to each workgroup are cooperatively loaded into LM by the work-items that constitute the workgroup.
- For local size of WLxHL, each workgroup requires (HL + 8) rows of (WL + 2) short4 pixels. Take a look at Figure 5 to convince yourself.
- Each work-item of a workgroup loads a single short4 pixel to LM. The work-items near the boundary, however, must load extra pixels corresponding to the padding pixels. That said, no work-item would load more than 4 short4 pixels. Figure 6 illustrates this.
- Use a local memory barrier to ensure that all work-items in the workgroup proceed to the computation part only after all the work-items of a workgroup are done with the data load.
Figure 5: Data to be loaded into LM for 8x16 workgroup
Figure 6: Extra pixels to be loaded by work-items at workgroup boundary. The arrows point towards the extra pixels that any group of work-items will have to load.
We try two different local size configurations — 8x16 and 8x32 — and show the results in Table 1. Both sizes yield a substantial improvement of 2x compared to the baseline, but it falls short of beating our implementation in step 2. This is likely due to two reasons:
- Use of barrier synchronization
- An already high cache-hit ratio with GM in step 2
Also, increasing the local size does not improve performance much because, although the padding overhead decreases, the overhead of synchronizing across a larger workgroup increases. These competing forces balance each other out. In fact, we saw similar performance for 16x16, 16x32, and 32x32 configurations.
|LM size (bytes)||10x24x16 = 3840||10x32x16 = 5120|
Table 1: Performance improvement using LM
5. Branch operations optimization
The Epsilon filter needs to compare pixels as follows:
cond = fabs(c - p) <= (half4)(T); sum += cond ? p : consth0; cnt += cond ? consth1 : consth0;
This causes some control flow divergence because not all fibers in a wave go through the same execution branch. The branching operation can be replaced by ALU operations as follows:
cond = convert_half4(-(fabs(c - p) <= (half4)(T)) ); sum += cond * p; cnt += cond;
When we apply this optimization on top of the ones in step 3, performance improves from 2.7x to 3x of the baseline. The main difference is that the old code uses some costly hardware logic to handle the divergence. In the new code, all fibers in a wave execute the same ALU instructions, even though the data (the variable cond) may be different.
Summary: Tune your OpenCL kernels to the target hardware
The optimization steps we’ve followed in this post and their respective performance improvements on Adreno 630 are summarized in Table 2.
In the naïve implementation, the algorithm is memory-bound. By data-packing, vectorized load and data reuse, it becomes more ALU-bound. The techniques introduced in this post can be applied to other image processing applications and, indeed, to any general, memory-bound application.
|2||CL_RGBA, short4, vectorized loads-stores||4||2.7|
|3a||Step 2 with more work per item||8||2.6|
|3b||Step 2 with even more work per work-item||16||1.5|
|4a||LM, LSIZE = (8,16)||4||2.0|
|4b||LM, LSIZE = (8,32)||4||2.0|
|5a||Step 2 with branch operation optimization||4||3.0|
|5b||Step 3a with branch operation optimization||8||2.8|
Table 2: Summary of optimization steps and performance on Adreno 630
Figure 7 shows the comparative performance of all these optimizations across two generations of Adreno GPUs — Adreno 530 and Adreno 630 — with the performance of the naïve implementation on Adreno 530 normalized to 1.
Figure 7: Comparison across Adreno 530 and Adreno 630. Performance of naïve implementation on Adreno 530 is normalized to 1.
Note that on the Adreno 630, the greatest performance boost came from step 5a, whereas on the Adreno 530, the greatest boost came from step 3a.
Thus, the main takeaway is that when the goal is to squeeze the maximal performance out of an OpenCL kernel, you need to tune the kernel according to the target hardware. This means not only tuning for a specific vendor, but also for specific devices provided by the vendor.
If GPU programming is on your radar, then keep an eye out for upcoming blog posts as we continue exploring OpenCL optimization.
Meanwhile, have a look through the Qualcomm® Snapdragon™ Mobile Platform OpenCL General Programming and Optimization Guide for more ideas, and download the Adreno GPU SDK to see how you can accelerate your own algorithms.
Do you have an OpenCL optimization you’d like to see? Send me questions in the comments below or through the Adreno GPU SDK support forum.