Snapdragon Neural Processing Engine SDK
Reference Guide
Quantized vs Non-Quantized Models

Overview

  • Non-quantized DLC files use 32 bit floating point representations of network parameters.
  • Quantized DLC files use 8 bit fixed point representations of network parameters. The fixed point representation is the same used in Tensorflow quantized models.

Caffe and Caffe2

The default output of snpe-caffe-to-dlc and snpe-caffe2-to-dlc is a non-quantized model. This means that all the network parameters are left in the 32 floating point representation as present in the original Caffe model. To quantize the model to 8 bit fixed point, see snpe-dlc-quantize. Note that models that are intended to be quantized using snpe-dlc-quantize must have their batch dimension set to 1. A different batch dimension can be used during inference, by resizing the network during initialization.

ONNX

The default output of snpe-onnx-to-dlc is a non-quantized model. This means that all the network parameters are left in the 32 bit floating point representation as present in the original ONNX model. To quantize the model to 8 bit fixed point, see snpe-dlc-quantize. Note that models that are intended to be quantized using snpe-dlc-quantize must have their batch dimension set to 1. A different batch dimension can be used during inference, by resizing the network during initialization.

TensorFlow

The default output of snpe-tensorflow-to-dlc is a non-quantized model. This means that all the network parameters are left in the 32 bit floating point representation as present in the original TensorFlow model. To quantize the model to 8 bit fixed point, see snpe-dlc-quantize. Note that models that are intended to be quantized using snpe-dlc-quantize must have their batch dimension set to 1. A different batch dimension can be used during inference, by resizing the network during initialization.

The TensorFlow converter does not support conversion of TensorFlow graphs that have been quantized using TensorFlow tools.

Choosing Between a Quantized or Non-Quantized Model

Summary

Runtime Quantized DLC Non-Quantized DLC
CPU or GPU Compatible. The model is dequantized by the runtime, increasing network initialization time. Accuracy may be impacted. Compatible. The model is native format for these runtimes. Model can be passed directly to the runtime. May be more accurate than a quantized model.
DSP Compatible. The model is native format for DSP runtime. Model can be passed directly to the runtime. Accuracy may be different than a non-quantized model Compatible. The model is quantized by the runtime, increasing network initialization time. Accuracy may be different than a quantized model.
AIP Compatible. The model is in supported format for AIP runtime. Model can be passed directly to the runtime. Incompatible. Non-quantized models are not supported by the AIP runtime.

Details

  • GPU and CPU
    • The GPU and CPU always use floating point (non-quantized) network parameters.
    • Using quantized DLC files with GPU and CPU runtimes is supported. Network initialization time will dramatically increase as SNPE will automatically de-quantize the network parameters in order to run on GPU and CPU.
    • If network initialization time is a concern, it is recommended to use non-quantized DLC files (default) for both GPU and CPU.
    • Quantization of the DLC file does introduce noise, as quantization is lossy.
    • The network performance during execute is not impacted by the choice of quantized vs non-quantized DLC files.
  • DSP
    • The DSP always uses quantized network parameters.
    • Using a non-quantized DLC file on the DSP is supported. Network initialization time will dramatically increase as SNPE will automatically quantize the network parameters in order to run on the DSP.
    • It is generally recommended to use quantized DLC files for running on the DSP. In addition to faster network initialization time, using quantized models also reduces peak memory usage during initialization, and decreases DLC file size.
  • AIP
    • The AIP runtime always uses quantized network parameters.
    • Passing through snpe-dlc-quantize is mandatory for generating the binaries for HTA subnets.
    • Using a non-quantized DLC file with the AIP runtime is not supported.
    • HTA subnets use the quantized parameters in the DLC.
    • HNN (Hexagon NN) subnets use the quantization parameters in the same way DSP runtime does.
  • Balancing DLC file size, network initialization time and accuracy
    • If the network will mainly run on the GPU and CPU it is recommended to try both quantized and non-quantized models during development. If a quantized model provides enough accuracy, then it may be used at the expense of increased network initialization time. The benefit is a much smaller DLC file. The tradeoff between accuracy, network initialization time, and DLC file size is application specific.
    • If the network will mainly run on the DSP, there is no benefit to using a non-quantized model. As previously stated it will dramatically increase network initialization time and DLC file size, but provide no accuracy benefit.

Quantization Algorithm

This section describes the concepts behind the quantization algorithm used in SNPE. These concepts are used by snpe-dlc-quantize and is also used by SNPE for input quantization when using the DSP runtime.

Overview

Note: SNPE supports multiple quantization modes. The basics of the quantization, regardless of mode, are described here. See Quantization Modes for more information.

  • Quantization converts floating point data to Tensorflow-style 8-bit fixed point format
  • The following requirements are satisfied:
    • Full range of input values is covered.
    • Minimum range of 0.01 is enforced.
    • Floating point zero is exactly representable.
  • Quantization algorithm inputs:
    • Set of floating point values to be quantized.
  • Quantization algorithm outputs:
    • Set of 8-bit fixed point values.
    • Encoding parameters:
      • encoding-min - minimum floating point value representable (by fixed point value 0)
      • encoding-max - maximum floating point value representable (by fixed point value 255)
  • Algorithm
    1. Compute the true range (min, max) of input data.
    2. Compute the encoding-min and encoding-max.
    3. Quantize the input floating point values.
    4. Output:
      • fixed point values
      • encoding-min and encoding-max parameters

Details

  1. Compute the true range of the input floating point data.
    • finds the smallest and largest values in the input data
    • represents the true range of the input data
  2. Compute the encoding-min and encoding-max.
    • These parameters are used in the quantization step.
    • These parameters define the range and floating point values that will be representable by the fixed point format.
      • encoding-min: specifies the smallest floating point value that will be represented by the fixed point value of 0
      • encoding-max: specifies the largest floating point value that will be represented by the fixed point value of 255
      • floating point values at every step size, where step size = (encoding-max - encoding-min) / 255, will be representable
    1. encoding-min and encoding-max are first set to the true min and true max computed in the previous step
    2. First requirement: encoding range must be at least a minimum of 0.01
      • encoding-max is adjusted to max(true max, true min + 0.01)
    3. Second requirement: floating point value of 0 must be exactly representable
      • encoding-min or encoding-max may be further adjusted
  3. Handling 0.
    1. Case 1: Inputs are strictly positive
      • the encoding-min is set to 0.0
      • zero floating point value is exactly representable by smallest fixed point value 0
      • e.g. input range = [5.0, 10.0]
        • encoding-min = 0.0, encoding-max = 10.0
    2. Case 2: Inputs are strictly negative
      • encoding-max is set to 0.0
      • zero floating point value is exactly representable by the largest fixed point value 255
      • e.g. input range = [-20.0, -6.0]
        • encoding-min = -20.0, encoding-max = 0.0
    3. Case 3: Inputs are both negative and positive
      • encoding-min and encoding-max are slightly shifted to make the floating point zero exactly representable
      • e.g. input range = [-5.1, 5.1]
        • encoding-min and encoding-max are first set to -5.1 and 5.1, respectively
        • encoding range is 10.2 and the step size is 10.2/255 = 0.04
        • zero value is currently not representable. The closest values representable are -0.02 and +0.02 by fixed point values 127 and 128, respectively
        • encoding-min and encoding-max are shifted by -0.02. The new encoding-min is -5.12 and the new encoding-max is 5.08
        • floating point zero is now exactly representable by the fixed point value of 128
  4. Quantize the input floating point values.
    • encoding-min and encoding-max parameter determined in the previous step are used to quantize all the input floating values to their fixed point representation
    • Quantization formula is:
      • quantized value = round(255 * (floating point value - encoding.min) / (encoding.max - encoding.min))
    • quantized value is also clamped to be within 0 and 255
  5. Outputs
    • the fixed point values
    • encoding-min and encoding-max parameters

Quantization Example

  • Inputs:
    • input values = [-1.8, -1.0, 0, 0.5]
  • encoding-min is set to -1.8 and encoding-max to 0.5
  • encoding range is 2.3, which is larger than the required 0.01
  • encoding-min is adjusted to −1.803922 and encoding-max to 0.496078 to make zero exactly representable
  • step size is 0.009020
  • Outputs:
    • quantized values are [0, 89, 200, 255]

Dequantization Example

  • Inputs:
    • quantized values = [0, 89, 200, 255]
    • encoding-min = −1.803922, encoding-max = 0.496078
  • step size is 0.009020
  • Outputs:
    • dequantized values = [−1.8039, −1.0011, 0.0000, 0.4961]

Quantization Modes

SNPE supports multiple quantization modes. All modes quantize parameters to 8-bit fixed point, the difference is in how quantization parameters are chosen.

Default Quantization Mode

The default mode has been described above, and uses the true min/max of the data being quantized, followed by an adjustment of the range to ensure a minimum range and to ensure 0.0 is exactly quantizable.

Enhanced Quantization Mode

Enhanced quantization mode (invoked by using the "use_enhanced_quantizer" parameter to snpe-dlc-quantize) uses an algorithm to try to determine a better set of quantization parameters to improve accuracy. The algorithm may pick a different min/max value than the default quantizer, and in some cases it may set the range such that some of the original weights and/or activations cannot fall into that range. However, this range does produce better accuracy than simply using the true min/max.

This is useful for some models where the weights and/or activations may have "long tails". (Imagine a range with most values between -100 and 1000, but a few values much greater than 1000 or much less than -100.) In some cases these long tails can be ignored and the range -100, 1000 can be used more effectively than the full range.

Enhanced quantizer still enforces a minimum range and ensures 0.0 is exactly quantizable.

Quantization Impacts

Quantizing a model and/or running it in a quantized runtime (like the DSP) can affect accuracy. Some models may not work well when quantized, and may yield incorrect results. The metrics for measuring impact of quantization on a model that does classification are typically "Mean Average Precision", "Top-1 Error" and "Top-5 Error". These metrics published in SNPE release notes for various models.