Optimizing Image Processing Algorithms With Halide

Wednesday 12/9/20 09:00am
Posted By Hsin-I Hsu
  • Up0
  • Down0

Qualcomm products mentioned within this post are offered by
Qualcomm Technologies, Inc. and/or its subsidiaries.

Think about what it takes to process a 2D image. For example, performing a basic blur might involve iterating over an image’s pixels one row at a time, and then computing averages between contiguous pixel values. An algorithm hand coded in C++ might look something like this:

  void box_filter_3x3(const Image &in, Image &blury)
    Image blurx(in.width(), in.height()); //temporary array
    for(in y=0; y<in.height; y++)
      for(int x=0; x<in.width’ ++x)
        blurx(x,y)=(in(x-1, y) + in(x,y) + in(x+1, y))/3;
    for(int y=0; y<in.height(); y++)
      for(int x=0; x<in.width(); x++)
        blury(x,y)=(blurx(x,y-1) + blurx(x,y) + blurx(x, y+1))/3;

source: https://youtu.be/3uiEyEKji0M

The function above takes in a reference to a source image object to blur, and a target image object to write the blurred result to. The code first allocates a temporary image then proceeds to iterate over each row of pixels. It then computes the resulting blurred pixel values by averaging the current pixel with the neighboring pixel to the left and right. The function then uses this result to do a similar blurring pass amongst concurrent vertical pixels, and then writes the output to the target image.

This algorithm has a number of problems. First of all, the algorithm itself (i.e., how blurring is achieved) is interleaved with schedules (i.e., when the hardware will perform the various reads and writes), and as a result, relies on the compiler and underlying hardware to carry out the operations.

The code is also not portable and may execute differently across platforms and hardware. Furthermore, the code can be hard to understand and debug, and changing it may affect the algorithm, the schedules, or both. Given these issues, it may take a long time for the developer to iterate and find the optimal solution, if they find it at all. Optimization also requires inherent knowledge of how the underlying hardware works (e.g., number of available threads, cache sizes, etc.).

Solving these issues was a goal of Halide, an open-source programming language for writing high performance image processing algorithms.

In this blog we’ll take a look at how an image processing operation like this blurring example, can be rewritten using Halide, more easily processed and optimized with Halide, and how developers can take advantage of Halide on devices powered by Qualcomm® Snapdragon™ mobile platforms.

Intro to Halide

Halide is an open source, domain specific language (DSL) embedded in C++ through which developers can write C++ algorithms to perform image processing. The Halide team achieved this by making heavy use of C++ operator overloading to create a rich set of expressions and functions. The result is a functional DSL syntax for programmatically setting up in-memory image processing pipelines.

Perhaps the most significant aspect of Halide’s pipeline-oriented architecture is its inherent separation of algorithms from schedules. Through this design, developers can focus on how to implement their algorithm in isolation from how it should be best scheduled for execution, to optimize for locality, parallelism, and ultimately performance.

An image processing C++ pipeline using Halide must first be compiled. This is done by the Halide compiler that uses an LLVM compiler to generate a platform-specific runtime executable. Halide offers two modes to perform this compilation:

  • Ahead of Time (AOT): Halide is first compiled to a header and object file and then statically linked by the developer into their app.
  • Just in Time (JIT): performs compilation at runtime (i.e., while the app is running) but requires that the Halide compiler be hosted on the target platform.

AOT is generally preferred because it avoids the overhead of runtime compilation, uses a smaller binary (because developers don’t need to host the whole Halide compiler), and includes its own C runtime. AOT is therefore commonly used for mobile platforms.

The main benefit of the Halide compiler for developers is that it generates a platform-specific runtime, capable of optimizing image processing pipelines for specific platforms. This also reduces the need for developers to have specialized knowledge and skills for that platform’s hardware to make optimizations. Instead the developer can focus on high-level algorithmic design and schedules.

Blurring with Halide

Using Halide, the blurring algorithm example presented above can be easily rewritten and optimized in many different ways. For example, the following code shows Halide processing groups of pixels in parallel threads:

  Func box_filter3x3(Func in)
    Func blurx, blurry;
    Var x, y, xi, yi;
    // The algorithm
    blurx(x,y)=(in(x-1, y) + in(x, y) + in(x+1, y))/3;
    blury(x,y)=(blurx(x,y-1) + blurx(x,y) + blurx(x, y+1))/3;
    //The schedule
    blury.tile(x, y, xi, yi, 256, 32).
      vectorize(xi, 8).parallel(y);
    blurx.compute_at(blur_y, x).vectorize(x, 8);
    return blury;

source: https://youtu.be/3uiEyEKji0M

The function takes in an input function capable of returning input pixel values. The code then defines new functions for blurring contiguous horizontal and vertical pixels. These functions define the algorithm for performing blurs.

The code then defines how to schedule the algorithm for execution. Here, tiles of 256x32 pixels are processed horizontally, as vectors of 8 pixels at a time, using parallel threads across the rows of pixels. This is followed by a blurring of the intermediate output for vertical pixels, also using 8-pixel vectors.

Optimization is a trade-off between how much storage (e.g., cache levels and RAM) is used and how granular the computation should be (i.e., how many pixels are processed at a time), with the optimal solution striking just the right balance. Of course the blurring example is a relatively simple one, with just two stages. In practice, image processing algorithms can involve multiple, interdependent stages, which should provide an indication as to just how significant, complex, and helpful the Halide solution is for developers.

In the Halide example above, it’s easy to see how the algorithm can be easily changed independently from schedules, and more generally, how much easier it is to figure out what the code does. Also, using Halide can result in a performance increase of an order of magnitude or higher than with hand-optimized code.

A complete list of Halide’s operations for building a pipeline can found in the Halide API documentation.

Halide on Snapdragon Platforms

For developers implementing image processing on devices built using mobile platforms, Qualcomm Technologies, Inc. (QTI) provides a customized version of Halide that can take advantage of the Qualcomm® Hexagon™ DSP. This version of Halide is available as part of the Hexagon SDK or as a standalone binary installation found on our CreatePoint site.

It’s important to note the QTI version differs from the standard version available on Halide’s GitHub site and is required for hardware-accelerated image processing on Hexagon. The QTI Halide distribution also only supports AOT compilation as this is the most suitable for mobile and embedded systems.

After you install the Hexagon SDK, the Halide files can be found in:


Check check out the Examples subdirectory that includes a number of examples for processing images on Hexagon using Halide. Also be sure to check out the docs subdirectory that includes information on how to set up and develop with our Halide distribution on Hexagon. This information is also available on CreatePoint.


Halide gives developers a rich DSL for processing images that separates the algorithm from schedules. The code is modular, portable, and often rivals or exceeds the performance of hand-coded C++ code. Perhaps most importantly, developers can easily iterate by trying out different algorithms and schedulers to arrive at an optimal solution much quicker than with hand-coded C++ code.

For more information about Halide, we recommend checking out the resources listed below. And stay tuned to QDN for exciting future advancements planned from our team to help improve efficiency!

Qualcomm Snapdragon and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries.