OpenCL Optimization: Stop Leaving Compute Cycles on the Table

Friday 9/7/18 01:59pm
Posted By Hongqiang Wang
  • Up1
  • Down0

Co-written by Hongqiang Wang, Raga Ramachandra, and Alex Bourd

Have you started programming on the Qualcomm® Adreno™ GPU yet? For compute-intensive operations, you’ll find cycles in the GPU that you can’t afford to leave on the table.

Your apps can get higher performance with lower power consumption when you optimize your code for specific GPUs. With heterogeneous computing you can offload tasks like image/video processing and machine learning inference from CPU to GPU.

Besides rendering graphics, a general-purpose GPU (GPGPU) like the Adreno GPU is ideal for performing data-parallel computations. It has long supported OpenCL, an open standard designed to help developers exploit heterogeneous computing on desktop and mobile. In this post, I’ll describe the benefits and use cases of OpenCL general programming and optimization on Qualcomm® Snapdragon™ mobile platforms and the Adreno GPU. Then, in upcoming posts we’ll walk you through code you can use in your own applications.

Some Basics on OpenCL, GPU Programming and Optimization

OpenCL is one of the most commonly used APIs for compute-intensive tasks in linear algebra, image/video processing, searching, physics/biology simulations, data mining, bioinformatics, machine learning and finance. Many software vendors and device manufacturers write their own OpenCL layers to solve problems on the Adreno GPU.

Heterogeneous System using OpenCLHeterogeneous System using OpenCL

To reap the power savings and performance boost of heterogeneous computing, look for tasks that can benefit from parallel programming and parallel execution. Parallelism is a great strength of OpenCL when, in image processing for example, all the pixels in a 4MB file can undergo the same, relatively simple calculation independently of one another.

OpenCL is a mature, open standard accessible to developers with C language programming experience. You use the OpenCL C language to write the kernels containing the tasks you want to run on the GPU. The OpenCL runtime API defines functions that run on the CPU to manage resources and dispatch the kernels.

As for program portability, unless an OpenCL application uses vendor-specific extensions or features, it should run well on other platforms. A certification program run by Khronos, the open-standards consortium behind OpenCL, delivers that portability.

And, to preserve your investment in last year’s code, OpenCL offers backward compatibility, even providing macros for deprecated APIs.

Performance portability, however, is a different matter.

That’s why we talk about GPU programming and optimization.

Why do I need to optimize? Why can’t I write once and get the same performance on every GPU?

Since OpenCL is a high-level computing standard, the application will run on different hardware. But its performance will not likely be the same.

Different hardware vendors have different device architectures, so don’t expect that the OpenCL application you write and optimize for the Adreno 500-series will deliver the same performance on other GPUs. For that matter, the same OpenCL application running on different generations of GPU hardware from the same vendor may vary in performance and require some optimization.

GPU programming offers better performance because it takes you down to the hardware level. The trade-off of working that close to the silicon is that you have to optimize your application to get similar performance on other vendors’ GPUs. That’s true for any programming.

Fortunately, we give you plenty to work with, including our Snapdragon Mobile Platform OpenCL General Programming and Optimization Guide, and we’ll walk you through multiple samples in this blog series.

Is my application a good candidate for OpenCL and GPU optimization?

Like most developers, you’ll probably want to answer that question for yourself through trial and error. Here, though, are several characteristics of a good candidate for OpenCL acceleration on GPU:

  • Appropriate size of data set — In general, the larger the input data set, the better. That is because, other factors being equal, the overhead of shuttling small input data sets between the CPU and GPU tends to offset any performance gains from using OpenCL.
  • High computational complexity — Even an application with a small input data set is a good candidate if it can occupy the GPU’s many arithmetic logic units (ALUs) and its peak computing power.
  • Data-parallelism — Suitable applications have a workload that can be subdivided into independent, block-based calculations for operations, such as max pooling layers in dynamic neural networks (DNN).
  • Limited divergent control flow — If the application depends heavily on conditional check and branching operations, then running it on the CPU may be more appropriate.

In essence, an OpenCL optimization problem is essentially a problem of how best to use memory and computing power. You’ll tune the ways your app is using global memory, local memory, registers and caches, and the ways it takes advantage of computing resources like the ALU and texture operations.

Next Steps

Whether you’re an old hand at OpenCL optimization or you’re just getting started, you’ll find the use cases and code samples in our upcoming blog posts useful. Look for posts on topics like these:

  • Workgroup size tuning
  • Memory optimization
  • Kernel optimization
  • Optimization examples (e.g., Epsilon filter, Sobel filter)

GPU programming can be a lot of work, so look for upcoming blogs designed to help you succeed at it. Meanwhile, have a look at our Adreno GPU SDK, the OpenCL General Programming and Optimization Guide and our white paper Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420.