NOTICE: Qualcomm Snapdragon Heterogeneous Compute is in the process of being sunsetted from Qualcomm Developer Network. The SDK currently works with Snapdragon 855 and previous chipsets, but there is no planned support for future chip releases. If you have development needs around heterogeneous computing, we encourage you to explore using our Qualcomm Hexagon SDK. The Heterogeneous Compute pages will still be available until Friday, December 4, but please note our Heterogeneous Compute forums will not be actively monitored during this sunsetting process.

What are the benefits of the Qualcomm® Snapdragon™ Heterogeneous Compute SDK?

Snapdragon Heterogeneous Compute SDK (HetCompute) is engineered to support a parallel programming model that allows programmers to express the concurrency in their applications. HetCompute’s powerful abstractions are designed to ease the burden of parallel programming through a design that builds on dynamic concurrency from the ground up. At the high level, HetCompute provides a set of parallel programming patterns that capture many of the existing parallel building blocks, and adds dataflow and work cancellation as first-class primitives that are optimized to improve programmer productivity.

It integrates heterogeneous execution into a concurrent task graph, and removes the burden of managing data transfers and explicit data copies between kernels executing on different devices. At the low level, HetCompute provides cutting-edge algorithms for work stealing and power optimizations that allow it to hide hardware idiosyncrasies allowing for the development of portable applications. In addition, HetCompute is designed to support dynamic mapping to heterogeneous execution units. Moreover, expert programmers can take charge of the execution through a carefully designed system of attributes and directives that provide the runtime system with additional semantic information about the patterns, tasks, and buffers that HetCompute uses as building blocks.

HetCompute embeds the programming model in C++ and provides a C++ library API. C++ is a familiar language for a large number of performance-oriented programmers, thus helping developers pick up the abstractions quickly. C++ embedding also allows incremental development of existing applications because HetCompute interoperates with existing libraries, such as pthreads and OpenGL.

How should I write a HetCompute application?

Integrating HetCompute into your application is quite easy, as long as you understand the principles of parallel programming and have a good idea of where the concurrency is in your application. The fundamental design goal of HetCompute is to easily express a parallel algorithm and incrementally build a parallel application.

The designed workflow of a HetCompute application from start to completion is as follows:

  • Identify the algorithm to be parallelized and design a parallel version of the algorithm.
  • Encode the algorithm using HetCompute abstractions: If the algorithm matches one of the HetCompute patterns, use the pattern directly to leverage the speedups. More complex applications will require the use of multiple patterns or they may exhibit parallelism that does not match one of the existing patterns. In this case, use HetCompute’s task and group building blocks to partition the algorithm into tasks, setting dependencies between the tasks (building the execution task graph), and launching the tasks for execution. Also, partitioning the data should be considered for concurrent data access.
  • Patterns and tasks are interoperable, as the HetCompute library maps patterns to tasks. Thus, a HetCompute application consists of a forest of DAGs. The runtime system schedules the tasks once their dependencies are satisfied.
  • HetCompute task graphs execute across different devices when the programmer provides device kernels. To execute on the GPU, kernels in OpenCL are written. To run on the DSP, kernels in C99 are written. These kernels are integrated into the task graph just like other tasks that are designed for the CPU.
HetCompute Workflow

Which parallel algorithms exist in HetCompute?

A fast way to help you build a HetCompute application is by using the HetCompute patterns. These patterns include:

  • pfor_each: Parallel iteration where the same function is applied to different pieces of data.
  • preduce: Parallel reduction processes a list of elements using a join function object and computes a return value.
  • ptransform: Parallel data transformation where a given function object outputs a range and stores the result in another range, or in-place.
  • pscan: HetCompute implements a Sklansky-style, in-place parallel prefix operation.
  • psort: Parallel sorting of data using a given input range.
  • pdivide_and_conquer: A parallel application of divide and conquer for use with algorithms such as quicksort and tree building/traversal.

What mechanism exists for tuning parallel execution?

The default pattern implementations should cover the majority of use cases. However, no single implementation is the best fit for all workload types. For that reason, HetCompute offers programmers a collection of commonly used algorithm parameters to serve as the performance tuning knobs (tuner).

  • set_max_doc: Maximum degree of concurrency.
  • set_chunk_size: Work stealing granularity.
  • set_static/set_dynamic: Simple static chunking or dynamic workstealing algorithm for parallelization.
  • set_serial: Serial execution, convenient for performance comparisons and calculation speedup.

How can I chain parallel algorithms together?

The HetCompute Pipeline pattern supports the pipeline parallel programming model, which is often used in streaming applications.

The HetCompute Pipeline API allows the programmer to describe a linear chain of processing stages such that the output of each stage is the input of the next. The programmer associates a C++ stage function with each stage, and can specify a basic C++ type or a user-defined data-type for handing over data between stages. Once launched, the Pipeline stage repeatedly executes the stage function over a data stream. A successor stage starts executing on one data unit after its predecessor stage finishes processing the same unit. While the stages in the pipeline executes one data unit sequentially (from the first stage to the last), they can execute different data units at the same time.

What if I want fine grained control of my parallel execution?

HetCompute programmers can partition their applications into independent units of work that can be executed asynchronously in the CPU, the GPU or the Qualcomm® Hexagon™ DSP. These units of work are called tasks.

In HetCompute, tasks can have predecessors and successors, forming directed acyclic task graphs. The predecessors of a task t are the tasks that must complete before t can execute. Conversely, the successors of a task t are the set of tasks that will execute only after t has completed its execution. Programmers can specify a predecessor/successor relationship between two tasks by creating a dependency between them.

How can I decide where my task will run?

A HetCompute task contains work that can be executed on any device in a system: the CPU, the GPU, or the Hexagon DSP. HetCompute tasks use kernels to achieve this computational heterogeneity. A kernel contains the computation, that is, the actual device code, a task executes. This could be CPU code, GPU code, or Hexagon DSP code, resulting in three different types of kernels.

The affinity APIs allow the programmer to change execution properties of program statements (arbitrary functions), HetCompute tasks, and device threads. These properties include:

  • location: the CPUs where the program constructs should run.
  • pinning: whether HetCompute device threads should migrate freely among cores (also known as thread binding).
  • mode: override local affinity settings.

How do I share data between tasks?

Tasks on the CPU, GPU, and DSP can share data using a HetCompute buffer. A HetCompute buffer is a contiguous array of a user-defined data-type T. The buffer is ref-counted: the HetCompute runtime will deallocate the buffer when there are no more buffer pointers pointing to it.

What is the difference between the Qualcomm Symphony SDK and the Snapdragon Heterogeneous Compute SDK?

The Symphony SDK is being deprecated as a power management API and is being replaced with the Snapdragon Heterogeneous Compute SDK. Migrating from Symphony SDK 1.x Heterogeneous Compute SDK is a fairly straightforward task as the interface has largely remained the same.

What are the System requirements?

  • Development OS: Windows 7 or later, Mac OS X 10.10 (Yosemite) or later, or Ubuntu 14.04 or later.
  • Android: Android 6.0 (Marshmallow) and Android NDK r13b or later.
  • Processor:
    • Snapdragon 425/430/435
    • Snapdragon 630/650/652/653/660
    • Snapdragon 808/810/820/821/835/845

Qualcomm Snapdragon and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries.