Snapdragon Neural Processing Engine SDK
Reference Guide
Performance Tips

Performance Tips for Using Tensors

UserBuffer

By default SNPE creates networks that accept tensors, where for each SNPE::execute() there is an additional copy to get data into/out of SNPE. In addition, depending on the data format required by the underlying target runtime, SNPE may perform format conversion such as quantization or float expansion.

An alternative is to create networks that accept user buffers, by calling build with SNPEBuilder::build() with the setUseUserSuppliedBuffers() setter. This creates networks that will use UserBuffers for execute(). By utilizing UserBuffer, a user can specify the format (encoding) of the buffer and its dimensionality. If the dimensions and stride of the buffer matches the network's, SNPE can potentialy read from and write to the buffers directly, saving data copies into / out of tensors for each execute.

Copy Tensors

SNPE supports a STL compatible tensor class that is used to send data into the network and return the output. While this provides a great deal of flexibility and ability to leverage STL functions to manipulate the data, it does come at a cost. For tensors that contain relatively little data, exactly how the user manipulates the data inside a tensor or gets data into the tensor doesn’t really matter. However, for tensors that need to contain a large amount of data (e.g. a 1080p input image or very large outputs), the user should be aware of the following guideline when moving data into a tensor: std::copy() is far more efficient for moving data into or out of a tensor than direct usage of the iterators (by at least an order of magnitude more). So rather than doing something like the following:

// Assume we have access to the following two variables
// std::shared_ptr<zdl::DlSystem::ITensor> tensor;
// std::vector<float>& vec;
vec.resize(tensor->getSize());
size_t idx = 0;
for (auto it = tensor->begin(); it != tensor->end(); it++)
{
        vec[idx++] = *it;
}

The user should do this instead:

std::copy(tensor->begin(), tensor->end(), vec.begin())

This is true whether getting data from a tensor (as in the example above) or putting data into a tensor.

In addition, if the tensor data needs to be modified (e.g. pre-processed before going into the network or post-processed after), it is better to do the manipulation in a user buffer than in the tensor directly using the iterators (and then just use std::copy() to move the modified data in/out of the tensor).

Performance Tips for Executing Networks

  • Optimizing TensorFlow Graphs for Inference

    • This applies only for TensorFlow, not Caffe.
    • TensorFlow provides a tool that can be used to convert a model into one that is optimized for inference.
    • It is strongly recommended to optimize TensorFlow graphs prior to converting them to a DLC file.
    • For an example of optimizing for inference, see $SNPE_ROOT/models/inception_v3/scripts/setup_inceptionv3.py.

  • Balancing Performance and Power

    • SNPE supports five performance profiles, "DEFAULT", "BALANCED, "HIGH_PERFORMANCE", "POWER_SAVER" and "SYSTEM_SETTINGS". (See setPerformanceProfile API description.)
    • The DEFAULT performance profile is less power intensive, at the expense of performance.
    • The BALANCED performance profile is the same as DEFAULT. (DEFAULT is going to be deprecated.)
    • The POWER_SAVER performance profile attempts to provide more power saving than the BALANCED performance profile, which may result in lower performance.
    • For optimal performance, use the setPerformanceProfile API to select HIGH_PERFORMANCE.
      • When HIGH_PERFORMANCE is selected, SNPE will attempt to maximize performance at the expense of increased power consumption.
    • The SYSTEM_SETTINGS profile causes SNPE to leave all power and performance settings alone. No calls to any power or performance related APIs will be invoked by SNPE.
      • Users of this profile can use other APIs (out of the scope of SNPE) if they want to control performance or power.

  • Minimizing Profiling in Production Environments

    • SNPE supports the SNPEBuilder::setProfilingLevel() API to configure the level of profiling information.
    • While the overhead of collecting profiling information is small, it will still add to the inference time.
    • Disabling profiling information in production environments will result in extra performance.

  • Running on the GPU

    • Typically, running a network on the GPU results in a 6X-10X speed of inference increase as compared to running the same network on the CPU and at lower power consumption, so usually the GPU runtime is the obvious choice for network execution unless the GPU is potentially heavily utilized for some other application (e.g. gaming).
    • However, there is a roughly 4-6ms overhead for network execution on the GPU that does not exist on the CPU, so very small networks might execute quicker on the CPU. For example, if a network runs in less than 10ms on the GPU, it may run faster on the CPU as the GPU overhead might eliminate any speed advantage to the actual network execution that the GPU provides.
    • By default, the GPU runtime runs in GPU_FLOAT32_16_HYBRID mode (Please see C++ Runtime_t Enum description). The GPU_FLOAT16 mode may run some networks faster but may incur accuracy loss as well. (Please see GPU Limitations section for more info.)

  • Running on the DSP
    • The DSP offers an optimized execution environment for supported layers, however some layer operations are not optimal on the DSP and may cause slow execution of the model on the DSP.
    • The performance of input preprocessing layers are currently not optimized on the DSP runtime. When using the DSP runtime it is recommended to do input preprocessing (colour space conversion, scaling, crop and mean subtract) before passing the image to SNPE.
    • The DSP runs 8-bit quantized math for most operations. Some networks may be sensitive to this and may not be suitable for the DSP runtime.
    • The default DSP runtime availability check performs platform validation on the DSP to validate DSP runtime support. Basic runtime availability check performs less validation than the default check, i.e. basic check only validates that the SoC platform should have DSP support.
    • Accelerator Init Times are significantly longer for dsp V68 version and above, compared to previous generation platforms. The longer initialization times are due to the graph analysis and optimization.
    • For DSP V68 version and above, enabling the init cache mode is recommended. Subsequent initialization times will be greatly reduced and execution times will also be improved, due to data locality.