Running Inference Using a Pre-trained Neural Network

Tuesday 6/25/19 09:00am
Posted By Felix Baum
  • Up0
  • Down0

Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.

In a previous artificial intelligence blog, AI Machine Learning Algorithms – How a Neural Network Works, we looked at the basics of how a neural network is structured with layers and how such a network is trained. And while there are many steps and a lot to think about, thankfully there are many AI frameworks and pre-built networks out there that can do a lot of the heavy lifting for you.

So now that we’ve run through how a neural network functions, let’s explore how a neural network that was generated from an AI framework can be run on a Snapdragon® mobile platform.

The Snapdragon platform is well suited for running inference on neural networks at the edge because it offers developers the flexibility to run an inference make sure it properly integrates to meet the performance and power requirements of its Qualcomm® Hexagon™ DSP, Qualcomm® Adreno™ GPU, or Qualcomm® Kryo™ CPU. Moreover, it provides support for neural networks created in two popular frameworks: Caffe2 and TensorFlow. It also supports the open ONNX format, which means it can also use neural networks trained by any other framework that outputs to ONNX.

This is possible because of the tools and APIs provided by the Qualcomm® Neural Processing SDK which includes:

  • incorporating our Neural Processing SDK into an app
  • converting the neural network to a deep learning container format (DLC)
  • loading and running the neural network using our Neural Processing SDK API

For more information on our Neural Processing SDK, we welcome you to review our documentation.

The following diagram illustrates the entire workflow including the generation of a trained neural network:

Let’s take a look at each phase in more detail.

Incorporating our Neural Processing SDK

The steps to set up and use the SDK development environment on Ubuntu[1] are documented in detail in our Neural Processing SDK Reference Guide. The general steps are:

  • Set up the AI framework (Caffe2, TensorFlow, or ONNX)
  • Set up Python
  • Install the Android NDK (if building native CPP applications)
  • Install the Android SDK (required to build the Android APK included in the SDK)
  • Get the SDK and unzip it
  • Run the dependencies check for Python packages
  • Run the set-up script(s)
  • Set up the appropriate environment variables

Once this has been set up, developers can then load and run the neural network from code to perform inference.

Converting the Output to DLC

AI frameworks output their trained neural networks as a data file or collection of files depending on the framework and the type of underlying neural network. For example, TensorFlow can output a frozen TensorFlow model (.pb file) or a pair of checkpoint and graph meta files. Neural network files such as these contain a variety of information about the trained neural network including the network structure, definition, the weights and biases established during training, and others.

Once the files from the framework have been created, the next step is to set up a pipeline to convert them into a format suitable for execution on a Snapdragon platform. To support this, we developed our Deep Learning Container (DLC) format, which is the file format our Neural Processing SDK’s API loads and runs on the Snapdragon platform. Our Neural Processing SDK also includes command line tools for Caffe2, TensorFlow, and ONNX to perform the conversion of the neural network files from the respective framework into DLC.

During the conversion, developers have the option to perform quantization to reduce the file size and to potentially reduce the processing power and power consumption required to run inference. Quantizing achieves this by transforming the representations of network parameters from 32-bit floats to 8-bit fixed point. Developers also have the option at this point to add HTA sections to take advantage of the Hexagon Tensor Accelerator found on devices such as the Snapdragon® 855 Mobile Platform.

Loading and Running the Trained Neural Network

Our Neural Processing SDK’s runtime library provides a number of components (see diagram below) for taking advantage of the Snapdragon hardware to run inference of a trained model:

The main components of our Neural Processing SDK’s runtime library are:

  • DL Container Loader: Loads a DLC file created by one of the Neural Processing SDK’s conversion tools.
  • Model Validation: Validates that the loaded DLC file is supported[2] by the required runtime.
  • Runtime Engine: Executes a loaded model on the requested runtime(s), including gathering profiling information and supporting user-defined layers (UDLs).
  • Partitioning Logic: Processes the model, including UDLs and validation information, and creates partitioning of the model to sub-models if needed. For UDLs, this logic breaks execution before and after the UDL to allow the UDL execution. If CPU fallback is enabled, the model is partitioned between layers supported by the target runtime, and layers supported by the CPU runtime.
  • CPU Runtime: Runs the model on the Kryo CPU, supporting 32-bit floating point or 8-bit quantized execution.
  • GPU Runtime: Runs the model on the Adreno GPU, supporting hybrid or full 16-bit floating point modes.
  • DSP Runtime: Runs the model on the Hexagon DSP using Hexagon Vector Extensions (HVX[3]) which are well suited for the vector operations commonly used by machine learning algorithms.
  • AI Processor (AIP) Runtime: Runs the model on the Hexagon DSP using Q6, Hexagon NN, and HTA.

Of particular note is the AIP, which is a runtime software abstraction of Q6, HVX, and HTA into a single entity (AIP), that controls execution of a neural network model across all three hardware features, as shown in the diagram below:

Developers who load a model using our Neural Processing SDK API and select the AIP runtime as the target, will have parts of the model running on HTA, and parts on HVX, using the HexNN acceleration library with HVX acceleration.

The general steps for making this all work at runtime using our Neural Processing SDK API are as follows:

  • Get and select the available runtime cores to execute the network on the Hexagon DSP, Adreno GPU, or Kryo CPU
  • Load the DLC file
  • Execute the network
  • Read the network’s output and process according to the application’s business logic

Our Neural Processing SDK package provides a number of tutorials demonstrating this. The most notable are the C++ and Android tutorials which show how to use our Neural Processing SDK API to load a network from a DLC file and execute that network to run inference.


While a good understanding of neural networks is important, a toolchain such as that provided by our Neural Processing SDK helps to simplify the process of bringing a trained neural network onto the Snapdragon mobile platform down to a few simple steps. Namely, importing a trained neural network into the DLC format that runs on Snapdragon processors, selecting a runtime core, executing the network to run inference, and using the output. With these basic tasks taken care of, developers can then focus on the business logic of their app.

We hope you find it pretty straight forward to bring your neural networks on to mobile devices powered by Snapdragon. We’d love to hear how you used these tools to build your mobile AI projects.


[1] Ubuntu 14.04 is currently the only supported development environment for the SDK.

[2] See Supported Network Layers for a list of supported layer types and which Snapdragon processors can run them.

[3] HVX is a co-processor on the Hexagon DSP that allows compute workloads such as those for imaging and computer vision to be processed on the DSP instead of the CPU.