What are the benefits of the Qualcomm® Snapdragon™ Heterogeneous Compute SDK?
Snapdragon Heterogeneous Compute SDK (HetCompute) is engineered to support a parallel programming model that allows programmers to express the concurrency in their applications. HetCompute’s powerful abstractions are designed to ease the burden of parallel programming through a design that builds on dynamic concurrency from the ground up. At the high level, HetCompute provides a set of parallel programming patterns that capture many of the existing parallel building blocks, and adds dataﬂow and work cancellation as ﬁrst-class primitives that are optimized to improve programmer productivity.
It integrates heterogeneous execution into a concurrent task graph, and removes the burden of managing data transfers and explicit data copies between kernels executing on different devices. At the low level, HetCompute provides cutting-edge algorithms for work stealing and power optimizations that allow it to hide hardware idiosyncrasies allowing for the development of portable applications. In addition, HetCompute is designed to support dynamic mapping to heterogeneous execution units. Moreover, expert programmers can take charge of the execution through a carefully designed system of attributes and directives that provide the runtime system with additional semantic information about the patterns, tasks, and buffers that HetCompute uses as building blocks.
HetCompute embeds the programming model in C++ and provides a C++ library API. C++ is a familiar language for a large number of performance-oriented programmers, thus helping developers pick up the abstractions quickly. C++ embedding also allows incremental development of existing applications because HetCompute interoperates with existing libraries, such as pthreads and OpenGL.
How should I write a HetCompute application?
Integrating HetCompute into your application is quite easy, as long as you understand the principles of parallel programming and have a good idea of where the concurrency is in your application. The fundamental design goal of HetCompute is to easily express a parallel algorithm and incrementally build a parallel application.
The designed workﬂow of a HetCompute application from start to completion is as follows:
- Identify the algorithm to be parallelized and design a parallel version of the algorithm.
- Encode the algorithm using HetCompute abstractions: If the algorithm matches one of the HetCompute patterns, use the pattern directly to leverage the speedups. More complex applications will require the use of multiple patterns or they may exhibit parallelism that does not match one of the existing patterns. In this case, use HetCompute’s task and group building blocks to partition the algorithm into tasks, setting dependencies between the tasks (building the execution task graph), and launching the tasks for execution. Also, partitioning the data should be considered for concurrent data access.
- Patterns and tasks are interoperable, as the HetCompute library maps patterns to tasks. Thus, a HetCompute application consists of a forest of DAGs. The runtime system schedules the tasks once their dependencies are satisﬁed.
- HetCompute task graphs execute across different devices when the programmer provides device kernels. To execute on the GPU, kernels in OpenCL are written. To run on the DSP, kernels in C99 are written. These kernels are integrated into the task graph just like other tasks that are designed for the CPU.
Which parallel algorithms exist in HetCompute?
A fast way to help you build a HetCompute application is by using the HetCompute patterns. These patterns include:
- pfor_each: Parallel iteration where the same function is applied to different pieces of data.
- preduce: Parallel reduction processes a list of elements using a join function object and computes a return value.
- ptransform: Parallel data transformation where a given function object outputs a range and stores the result in another range, or in-place.
- pscan: HetCompute implements a Sklansky-style, in-place parallel preﬁx operation.
- psort: Parallel sorting of data using a given input range.
- pdivide_and_conquer: A parallel application of divide and conquer for use with algorithms such as quicksort and tree building/traversal.
What mechanism exists for tuning parallel execution?
The default pattern implementations should cover the majority of use cases. However, no single implementation is the best ﬁt for all workload types. For that reason, HetCompute offers programmers a collection of commonly used algorithm parameters to serve as the performance tuning knobs (tuner).
- set_max_doc: Maximum degree of concurrency.
- set_chunk_size: Work stealing granularity.
- set_static/set_dynamic: Simple static chunking or dynamic workstealing algorithm for parallelization.
- set_serial: Serial execution, convenient for performance comparisons and calculation speedup.
How can I chain parallel algorithms together?
The HetCompute Pipeline pattern supports the pipeline parallel programming model, which is often used in streaming applications.
The HetCompute Pipeline API allows the programmer to describe a linear chain of processing stages such that the output of each stage is the input of the next. The programmer associates a C++ stage function with each stage, and can specify a basic C++ type or a user-deﬁned data-type for handing over data between stages. Once launched, the Pipeline stage repeatedly executes the stage function over a data stream. A successor stage starts executing on one data unit after its predecessor stage ﬁnishes processing the same unit. While the stages in the pipeline executes one data unit sequentially (from the ﬁrst stage to the last), they can execute different data units at the same time.
What if I want fine grained control of my parallel execution?
HetCompute programmers can partition their applications into independent units of work that can be executed asynchronously in the CPU, the GPU or the Hexagon DSP. These units of work are called tasks.
In HetCompute, tasks can have predecessors and successors, forming directed acyclic task graphs. The predecessors of a task t are the tasks that must complete before t can execute. Conversely, the successors of a task t are the set of tasks that will execute only after t has completed its execution. Programmers can specify a predecessor/successor relationship between two tasks by creating a dependency between them.
How can I decide where my task will run?
A HetCompute task contains work that can be executed on any device in a system: the CPU, the GPU, or the Hexagon DSP. HetCompute tasks use kernels to achieve this computational heterogeneity. A kernel contains the computation, that is, the actual device code, a task executes. This could be CPU code, GPU code, or Hexagon DSP code, resulting in three different types of kernels.
The afﬁnity APIs allow the programmer to change execution properties of program statements (arbitrary functions), HetCompute tasks, and device threads. These properties include:
- location: the CPUs where the program constructs should run.
- pinning: whether HetCompute device threads should migrate freely among cores (also known as thread binding).
- mode: override local afﬁnity settings.
How do I share data between tasks?
Tasks on the CPU, GPU, and DSP can share data using a HetCompute buffer. A HetCompute buffer is a contiguous array of a user-deﬁned data-type T. The buffer is ref-counted: the HetCompute runtime will deallocate the buffer when there are no more buffer pointers pointing to it.
What is the difference between the Qualcomm Symphony SDK and the Snapdragon Heterogeneous Compute SDK?
The Symphony SDK is being deprecated as a power management API and is being replaced with the Snapdragon Heterogeneous Compute SDK. Migrating from Symphony SDK 1.x Heterogeneous Compute SDK is a fairly straightforward task as the interface has largely remained the same.
What are the System requirements?
- Development OS: Windows 7 or later, Mac OS X 10.10 (Yosemite) or later, or Ubuntu 14.04 or later.
- Android: Android 6.0 (Marshmallow) and Android NDK r13b or later.
- Snapdragon 425/430/435
- Snapdragon 630/650/652/653/660
- Snapdragon 808/810/820/821/835/845