TensorFlow Lite for Inference at the Edge

Thursday 11/12/20 09:00am
|
Posted By Hsin-I Hsu
  • Up0
  • Down0

Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.

Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.

In our introductory blog on TensorFlow, we discussed the framework's graph-based architecture for representing machine learning (ML) models, learned about its output formats, and saw how the Qualcomm® Neural Processing SDK for artificial intelligence (AI) can convert and optimize trained TensorFlow models for inference on devices built with Snapdragon® mobile platforms.

To support inference at the device edge, TensorFlow also provides developers with TensorFlow Lite, a special version of the framework designed for mobile devices. TensorFlow Lite builds, optimizes, and runs ML models with the goal of addressing the unique requirements of mobile including latency reduction, privacy, connectivity, and power efficiency.

Let's take a closer look at TensorFlow Lite to see how it compares to TensorFlow and discuss which one to use for your mobile ML apps.

TensorFlow Lite Main Components

TensorFlow Lite has two main components. Its interpreter runs models (i.e., performs inference) at the device edge on a variety of hardware types including mobile phones, embedded devices, and microcontrollers. But before inference can happen, developers use TensorFlow Lite's converter to optimize the performance and size of TensorFlow models which the interpreter can then efficiently execute for inference at the device edge.

Developers can invoke TensorFlow Lite's converter from either the command line or by using the Python API. The converter can convert Keras, saved models, and concrete functions which have been generated from the different API levels of TensorFlow, resulting in a .tflite model file.

Figure 1 - Summary of the TensorFlow pipeline.
Figure 1 - Summary of the TensorFlow pipeline. Left: the "server” encapsulates the TensorFlow APIs, converter component, and resulting .tflite file which are all involved when generating the trained model. Right: the “client” represents the application at the edge that uses the interpreter component to run inference and the various delegates that send work to the backend (underlying device edge hardware).

Perhaps the biggest benefit of TensorFlow Lite and its converter are the optimizations that can be made without having to use additional, third-party tools. These include smaller binary file sizes which require less storage and less bandwidth to download, less latency (i.e., a reduction in the amount of computations and power required for inference), and faster speed via hardware accelerator-specific optimizations.

These benefits can be realized through three ML model optimization techniques provided by the converter:

  • Quantization: reduces the precision of the numbers used for the model's parameters at the cost of reducing accuracy.
  • Pruning: removes parameters from the model which have only a minor impact on predictions.
  • Clustering: groups the weights of each layer into clusters which share centroid values.

Delegating the Work

To achieve an optimal speed at the edge, developers can use a delegate to run inference on hardware accelerators. A delegate achieves this by modifying the model's execution graph to run operations more efficiently based on the underlying hardware that the delegate targets. TensorFlow Lite includes delegate implementations for GPUs (e.g., Qualcomm® Adreno™ GPU), the Qualcomm® Hexagon™ Digital Signal Processor (DSP), the Android Neural Network API (NNAPI), and others. It also allows developers to implement their own custom delegates.

The following code snippet taken from TensorFlow Lite's GPU delegate page, shows the logic for dynamically determining if the underlying edge hardware supports a GPU delegate for inference. If it is supported, then a GPU delegated is included in the options for the interpreter, otherwise the code defaults to using CPU threads. Following this, the code then makes the in-memory ML model available to the interpreter, tells the interpreter to run (i.e., perform inference), and then reads the predictions produced by the interpreter. (Note: the TensorFlow package is distributed under the Apache 2 license)

import org.tensorflow.lite.Interpreter;
import org.tensorflow.lite.gpu.CompatibilityList;
import org.tensorflow.lite.gpu.GpuDelegate;

// Initialize interpreter with GPU delegate
Interpreter.Options options = new Interpreter.Options();
CompatibilityList compatList = CompatibilityList();

if(compatList.isDelegateSupportedOnThisDevice()){
    // if the device has a supported GPU, add the GPU delegate
    GpuDelegate.Options delegateOptions = compatList.getBestOptionsForThisDevice();
    GpuDelegate gpuDelegate = new GpuDelegate(delegateOptions);
    options.addDelegate(gpuDelegate);
} else {
    // if the GPU is not supported, run on 4 threads
    options.setNumThreads(4);
}

Interpreter interpreter = new Interpreter(model, options);

// Run inference
writeToInput(input);
interpreter.run(input, output);
readFromOutput(output);

A similar flow would be used for the Hexagon delegate which accelerates inference on devices equipped with the Hexagon DSP. The Hexagon delegate compliments NNAPI acceleration for devices which either don't yet support the NNAPI or do not have an NNAPI driver for the DSP. On the other hand, TensorFlow Lite's NNAPI delegate provides hardware acceleration for newer devices like the latest series of Snapdragon mobile platforms and the Qualcomm® Robotics RB5 Platform and should therefore be used when possible.

For the Qualcomm Robotics RB5, it should be noted that this platform runs Ubuntu. Since the NNAPI usually targets the Android platform, we have ported the NNAPI to the Qualcomm Robotics RB5 so that developers can take advantage of the NNAPI delegate on this platform.

TensorFlow or TensorFlow Lite — Which Should you Choose?

With the ability to generate, optimize, and run hardware-accelerated inference on Snapdragon mobile platforms using either TensorFlow or TensorFlow Lite, you might be wondering which framework to choose. The short answer is, it depends on what hardware you're working with and whether TensorFlow Lite is supported.

TensorFlow Lite's delegates provide direct access to specific integrated processors while giving developers all they need to build and run TensorFlow Lite models on many of our mobile platforms without the need for third-party tools. However, TensorFlow Lite's Hexagon and GPU delegates may not be supported on newer devices, though the NNAPI, which is utilized via the NNAPI delegate, is becoming ever more capable of supporting newer devices.

On the other hand, our Qualcomm Neural Processing SDK converts and optimizes TensorFlow models while providing more dynamic runtime support for all devices built with Snapdragon than TensorFlow Lite. So, if developers are unable to find the necessary hardware acceleration support using TensorFlow Lite, a good option is to use TensorFlow with our Qualcomm Neural Processing SDK. Since trained model formats can be easily converted to TensorFlow Lite for mobile platforms in the future, developers can migrate to TensorFlow Lite when that framework supports the necessary hardware acceleration

Conclusion

Mobile platforms like our Snapdragon series provide a range of options for hardware-accelerated inference at the device edge. And when it comes to building, optimizing, and running ML models, developers have the choice of using TensorFlow in conjunction with the Qualcomm Neural Processing SDK, or use TensorFlow Lite as a standalone product via its delegates on newer hardware that supports it.

If you have a cool project that uses TensorFlow or TensorFlow Lite on Snapdragon hardware, tell us about your project for consideration in our project showcase!

Snapdragon, Qualcomm Adreno, Qualcomm Hexagon, Qualcomm Robotics, and Qualcomm Neural Processing SDK are products of Qualcomm Technologies, Inc. and/or its subsidiaries.