Snapdragon Neural Processing Engine SDK
Reference Guide
Overview of UDO

Introduction

SNPE provides the ability for users to plug in custom neural network operations that may not be inherently supported by the runtime engine in the form of User-defined Operations (hereafter referred to as UDO). These could be operations defined in popular training frameworks such as Tensorflow or custom operations that are built based as framework extensions but not available in the SNPE SDK. They can be natively executed on any of the supported hardware accelerators for which they are implemented. SNPE provides the infrastructure to execute these operations in a seamless fashion with little to no overhead compared to executing internally supported operations.

Comparing UDO with UDL

SNPE has existing support for users to include custom operations to be executed on CPU at runtime with User-defined layers, abbreviated as UDL (see User-Defined Layers (UDL) Tutorial). User-defined operations can be considered an enhancement over UDLs for the following reasons:

(i) Integration with improved visibility
UDLs are implemented in ways that make attributes of the layers completely opaque to SNPE. This requires them to be handled in total isolation from the rest of the layers in the network and in their own subnets. On the other hand, UDOs are designed to be able to express their attributes with SNPE components while preserving opacity with the actual implementation kernels. This allows SNPE to understand their properties such as input and output tensor dimensions and connect them with their neighboring operations in the network, resulting in optimal network partitions.

(ii) Extended targets for execution
The UDL mechanism allows users to register a callback to switch the execution context during runtime from SNPE into their proprietary method that implements the operation of their custom layer on the CPU. SNPE partitions the network model into subnets that separate native layers from user-defined ones. In cases when other subnets are scheduled to execute on other hardware accelerators such as the DSP or the GPU, this mechanism induces significant overhead in switching the execution context over to the ARM CPU, which is the only target on which a UDL can run. This mechanism is illustrated in the figure below with the example of a network scheduled to run on the GPU:

UDL


The UDO mechanism improves on this approach by allowing users to integrate their custom operations on any supported hardware accelerator by compiling for specific targets such as the GPU or DSP in addition to ARM CPU. The UDO mechanism allows SNPE to construct subnets that may encompass UDOs without having to switch to the CPU by default. It reduces context switching overheads as well as allows users to take advantage of executing their operations on desired accelerators and get superior performance. This mechanism is illustrated in the figure below with the example of a network scheduled to run on the GPU and the UDO compiled for GPU as well:

UDO


(iii) Additional Training Frameworks supported
UDL is supported only with Caffe-based networks. UDOs are supported on Tensorflow and ONNX models with support for Caffe to be added soon.

Anatomy of a UDO package

SNPE allows users to provide UDO implementations in the form of dynamic libraries that can be queried, loaded and exercised to execute inference using kernels defined within them. SNPE promotes the notion of a 'UDO package' with which a user can easily express the association between the different components of a UDO. This notion is central to all the tools that enable users to create UDO packages to be used in network inference. However, it is to be noted that SNPE still directly interfaces with the various UDO libraries at runtime and not with the UDO package construct. Thus users are free to just build standalone libraries without being strictly bound to this notion of a package.
The figure below illustrates the concept of a UDO package:

UDO


As seen from the picture a UDO package consists of a registration component and an implementation component. They are usually expressed separately with one registration library and a set of implementation libraries, one for each hardware accelerator for which an implementation kernel is available. Users can optionally build both components into a single library if they so wish.

The registration library consists of methods that specify all user-defined operations and the hardware cores they are designed for. It also consists of methods that allow operations to be validated for sanity at the time of network creation. The registration library is loaded and executed on the ARM CPU.

The hardware-specific implementation libraries expose several other methods that implement operation instance creation, execution, profiling and destruction. These are implemented with programming constructs supported from corresponding software platforms, such as OpenCL for GPU and Hexagon-NN SDK for DSP. While core-specific implementation files may differ entirely in source they are all required to interface with SNPE using a set of C APIs defined in $SDK_ROOT/share/SnpeUdo/include/SnpeUdo. The complete details on these APIs can be obtained from C++ API.

UDO workflow

SNPE recommends the following workflow in developing and integrating a UDO into the runtime:

UDO


The first step in the workflow is to identify the operations in the model that need to be expressed as User-defined operations and describing their attributes through a configuration file. The format and contents of this file are described in Defining a UDO.

The next set of steps produce the components of a UDO package by creating source files for the UDO kernels and compiling them against appropriate tool-chains to generate dynamic libraries specific to hardware cores such as the GPU and DSP. SNPE provides a tool called snpe-udo-package-generator that assists users in creating common skeleton code for interfacing with SNPE UDO APIs and leaves placeholders for users to fill in the kernel implementation. It also generates makefiles for common targets such as X86 and Android and for runtimes per target specified in the config file.
For more details on the package generation refer to Creating a UDO Package. For details on compiling the UDO package for a specific runtime refer to Compiling a UDO package.

The config file created in the first step is also required to be used by the SNPE model conversion tools along with the actual trained model to allow interpretation of the user-defined operations using definitions from the file. The resulting DLC files can then be inspected using tools like snpe-dlc-info to probe the attributes of the UDOs in the model, which is not possible with opaque representations such as with UDLs. For details on creating (and optionally quantizing) DLCs with UDOs refer to Preparing a model with UDO. Optionally, models with UDOs can also be quantized using SNPE quantization tools to be used with fixed-point runtimes such as DSP. The quantizer tool estimates quantization ranges for activations from all layers in the network including UDOs. Since the tool runs offline on an x86 host machine it is required to have a CPU implementation for the UDO in order to perform inference through the entire network. This is also illustrated in dotted lines in the workflow diagram. Refer to Quantizing a DLC with UDO for details on the quantization process.

The final step in this workflow is to be able to actually execute network models with UDOs. SNPE applications use the UDO package to register UDO implementations within the process that runs inference on select network models. It should be noted that these UDOs can be exercised by multiple instances of SNPE simultaneously without race conditions, which increases the overall throughput for network inference. For more details on the UDO package registration process refer to Running a model with UDO.

If the DSP implementation library of the UDO is not signed for execution on a signed process domain (the default for a SNPE applicaiton), it is required to request the use of an unsigned process domain. Unsigned process domains apply only to the DSP target, and allow SNPE to use unsigned UDO implementation libraries. To see how to utilize an unsigned process domain with the SNPE application, refer to Running a model with UDO.