Snapdragon Neural Processing Engine SDK Reference Guide

This chapter describes limitations discovered in this release during testing. Future releases will provide fixes for discovered issues.

# General SNPE Limitations

• SNPE currently supports 4D input data, where the first dimension is batch.
• Only batch of 1 is supported for RCNN networks like Faster-RCNN. See Layer Limitations below.

# General Java API Limitations

• Confine INeuralNetwork instance usage to a single thread
The current SDK INeuralNetwork class instances are meant to be accessed from a single thread. Developers must make sure that is enforced within the application or unexpected errors may occur.

# General CPU Runtime Limitations

• Not all layers have been optimized for the CPU runtime. For example, deconvolution is dramatically slower than convolution.

# General GPU Runtime Limitations

• In GPU_FLOAT32_16_HYBRID mode, the GPU kernels use HALF_FLOAT precision for all intermediate data handling and FULL_FLOAT precision for all of its computations. While this does not typically affect mAP for networks that are being used for classification this can overflow/underflow which can impact use of the engine for uses other than classification. If an impact is observed, try running with the CPU runtime which is always FULL_FLOAT to validate any overflow/underflow issues.
• In GPU_FLOAT16 mode, the GPU kernels use HALF_FLOAT precision for all intermediate data handling and all of its computations. In this mode, due to lower computation precision comparing to GPU_FLOAT32_16_HYBRID, chances of negative impact on network's accuracy (e.g. mAP score) are higher. Users are encouraged to test accuracy performance of their network using this mode to ensure it meets requirements of their use case.
• For absolute size restrictions, the concept of “packed” channels refers to the number of channels divided by 4, and rounded up to the nearest integer:
packed_channels = ceil(channels / 4.0)
• Whenever a layer has a 4-dimensional (i.e. batch x width x height x channels) component, such as input, output, or weight tensor, that component will have the following size restrictions:
• Number of packed channels * width < MaxPerGPUSize
• For all layers that have weights/biases, restrictions are:
• Filter size * filter size * 4 <= MaxPerGPUSize
• Number of output channels / 4 <= MaxPerGPUSize
• The MaxPerGPUSize is dependent on Qualcomm Adreno™ GPU type and the values are given below
• A330: 8192
• A430, A530: 16384
• While loading any network, GPU runtime may choose to merge (squash) few layers with the previous layers in the network, depending on the compatibility of the layers. This results in missing performance information for the squashed layers.

# General DSP Runtime Limitations

• When using non-quantized models, the first network execution after network initialization may be significantly slower than subsequent executions. To avoid this, use a DLC file that has been quantized by snpe-dlc-quantize.

# General AIP Runtime Limitations

• If the input layer of a network needs to be processed by HTA the input must be a 4D tensor with shape format as NHWC where the batch dimension N must be 1 and the number of channels C cannot exceed 16.
• However, one could take advantage of manually partitioning a network to bypass this limitation by having the input layer be processed on the HVX instead. See Adding HTA sections for details on partitioning.
• AIP runtime supports batched input for models which are completely using HTA or the models which have all the layers running on HTA except Softmax which is partitioned to HVX.

# Layer Limitations

• ArgMax
• For DSP runtime, ArgMax only outputs float; its output cannot be a quantized data type due to accuracy.
• Batch normalization (+ Scaling)
• Caffe: Scaling (scale_layer) is optional. If present, it extends functionality of Batch normalization (batch_norm_layer). If not present, batch_norm_layer will still be converted as per Caffe specification.
scale_layer used anywhere else in the network but immediately after the batch_norm_layer is not supported.
• 1d (i.e. per-channel) batch normalization: support available only for caffe models. support not available in dsp runtime.
• Starting in 1.15.0, the caffe converter distinguishes between a batch_norm_layer and an instance_norm_layer using the value of the batchnorm_param use_global_stats. If use_global_stats is set to True the converter will consume the layer as a batch_norm_layer. If use_global_stats is set to False the converter will consume the layer as an instance_norm_layer. It's important to ensure the prototxt used to convert a caffe model with a batch_norm_layer has the value of use_global_stats not defined as "False" (i.e. do not use a training prototxt for conversion of a caffe model with batch_norm_layers in them)
• Color space conversion
• For NV21 input image encoding type, width or height must be multiple of 2. The reason is 4 Y (2wx2h) is sharing one UV pair.
• Concatenation
• For GPU runtime, the number of input channels in each of the inputs can assume arbitrary values. However, if one or more of these are not a multiple of 4, performance of the layer will be diminished.
• Convolution
• For GPU runtime, when the number of groups is greater than 1, the number of output channels must be a multiple of 4 * the number of groups. For example, with 2 groups, the number of output channels must be a multiple of 8 (4*2=8).
• Crop
• For GPU runtime, the number of input channels in each of the inputs must be a multiple of 4.
• Crop on the DSP is not optimized in all cases. Spatial cropping is optimized (cropping height and/or width, leaving other dimensions unchanged)
• Deconvolution
• For GPU and CPU runtime, the number of output channels (i.e. number of filters) can be any value (not necessarily a multiple of 4).
• For GPU runtime the following limitations apply:
• number of packed input channels * number output channels <= MaxPerGPUSize
• Filter size-X * Filter size-Y <= MaxPerGPUSize
• Stride <= filter size
• For DSP runtime, deconvolutions with stride > 4 are not fully optimized.
• Caffe parameter limitation: dilation and rectangular filters are not supported
• Depthwise Convolution
• Depthwise Convolution on the DSP is not optimized for all cases. The following case is optimized:
• Horizontal stride is <= 2.
• Filter is 3x3.
• Depth is a multiple of 32.
• Detection Output
• keepTopK must be provided.
• Output buffer must be of sufficient size.
• For DSP runtime, batch > 1 and dlc caching is not supported.
• Fully connected
• For GPU runtime, the following limitations apply:
• Input width * input height * number of input channels <= MaxPerGPUSize
• Number of output channels <= MaxPerGPUSize
• For DSP Runtime, batch > 1 is optimized only when input height * width * channel is a multiple of 16.
• Input Image Scaling
• The DSP runtime image scaling performs well under the conditions listed below. Other configurations are not optimized.
• Scale factor is an upscale by 2x AND
• Depth is a power of 2 AND either
• Depth is less than 128 with width equal to a power of 2 OR
• Depth is greater than 128.
• Instance Normalization
• For certain models containing InstanceNorm layers, the default value for the “epsilon” parameter could overwhelm the standard deviation of the input tensor. In such cases a numerical discrepancy between the source framework and SNPE can happen. For such cases it helps to override the value of epsilon in the source model to a much smaller value.
For instance, for Caffe:
             batch_norm_param {
use_global_stats: false
eps: 1e-9
}
• For the DSP runtime,
• does not support non 4D padding inputs.
• does not support padding along batch.
• Pooling
• Caffe parameter limitation: Average and Maximum pooling methods are supported, but not Stochastic.
• Power
• Power layer is only supported on DSP and has a Caffe parameter limitation: only shift = 0 and power = 1 are supported.
• Proposal
• Proposal layer is not supported on the GPU.
• Only batch of 1 is supported.
• ROI Pooling
• ROI Pooling is not supported on the GPU.
• For DSP runtime, the input to the ROI Pooling layer must be a Proposal layer or an OPAQUE Input layer.
• Only batch of 1 is supported.
• Scale
• Scale is only supported on the DSP.
• For DSP runtime, only channel scaling is supported.
• Slice
• Currently does not support creation of a slice layer without slice points defined.
• Tile
• The Tile layer will currently be displayed as a "Concatenation" layer when the topology of a network containing it is viewed using snpe-dlc-info.
• UDO
• DSP runtime
• SNPE DSP requires a quantized model if the UDO has at least one quantized output.
• The data types supported in DSP UDO layers are FLOAT_32 and UINT_8 (quantized with TF schema).
• GPU runtime
• Only 16-bit floating point (OpenCL half) activations are supported in the network.
• The only data type supported for activation tensors in GPU UDO layers is FLOAT_16.
• CPU runtime
• CPU runtime always operations with full precision (FP32) tensors.
• The only data type supported for activation tensors in CPU UDO layers is FLOAT_32.
• Model Conversion
• Model conversion with UDO is not available for models trained with Caffe.
• Model conversion with UDO is supported for only models trained with Caffe2 that are represented with the ONNX format.
• Package Generation
• Multiple UDOs cannot be defined in a single config file if they are intended to be used with core type = DSP.
In this case users are required to create one config file per UDO and generate separate packages with each op. This restriction does not apply to core types CPU or GPU.
• A tensor parameter in a UDO definition can be expressed with only one data type (e.g: either FLOAT_32 or FLOAT_16 but not both).
Users wanting to use their UDOs on multiple runtimes with different data types may be required to create separated config files per data type and generate multiple corresponding packages.
• Application
• UDO integration is supported only with native C APIs. Java extensions are not available in this release.
Users who want to integrate UDOs into Android applications will have to interface with SNPE APIs at the JNI level in order to take advantage of this functionality.

# Tool Limitations

• snpe-net-run
• Default profiling level is detailed.
• snpe_bench.py
• Default profiling level is basic.
• snpe-caffe2-to-dlc
• It is expected that initialization weights are input through the same NetDef format that the caffe translator Python script generated.
• No database formats are yet supported.
• snpe-dlc-info
• For CMRN layers, the alpha value shown is actually alpha/window_size for models converted from Caffe.
• Example: snpe-dlc-info shows an alpha of 2e-05 for a CMRN layer from Caffe with a window_size of 5 and alpha of 0.0001.
• For deconvolution layers, the num filters value shown is actually num filters / group.
• Example: snpe-dlc-info shows num filters as 1 for a deconvolution layer with num_output of 11 and group of 11.
• snpe-dlc-quantize
• The snpe-dlc-quantize tool cannot quantize models that contain UDL layers.
• Note: these models can still be run on DSP runtime. The DSP runtime will quantize the network during network initialization.
• snpe-tensorflow-to-dlc
• The TensorFlow converter does not support conversion of TensorFlow graphs that have been quantized using TensorFlow tools. In order to quantize a TensorFlow model, run the TensorFlow converter (snpe-tensorflow-to-dlc) first, then run snpe-dlc-quantize on the DLC file generated by the TensorFlow converter.
• Convolution
• BiasAdd node is optional and when missing a bias of zeros will be added.
• Concat
• Concat node must have at least 2 non Const inputs.
• ElementWise Sum/Mul/Max
• Must be the only operation within it's scope.
• Does not support scalar operands.
• Fully Connected
• Inputs to MatMul operation must be 1D.