ML training at the edge: Training on mobile devices

Monday 12/13/21 08:56am
Posted By Balaji Calidas
  • Up1
  • Down0

Qualcomm products mentioned within this post are offered by
Qualcomm Technologies, Inc. and/or its subsidiaries.

You can get a lot of innovation out of running machine learning inference on mobile devices, but what if you could also train your models on mobile devices? What would you invent if you could fine-tune your models at the network edge?

With our latest version of the Qualcomm® Adreno™ OpenCL ML SDK, get ready to start working on new mobile applications that use both inference and training. The SDK now lets you fine-tune and improve the accuracy of your existing models at the network edge, avoiding the round trip to the cloud and back. In addition to using a model that’s already trained, you now have the option to fine-tune by updating the weights through training passes.

It’s a new way to take the compute to the data instead of shuttling the data to and from the compute.

What can you do with training on mobile?

When we released OpenCL ML SDK 1.1, we took advantage of acceleration built into the OpenCL driver in our Android image for faster inference on our Adreno GPU. Now, the SDK leverages accelerated operations in the OpenCL driver for the inference, back-propagation and weight-update phases of training.

Naturally, this is different from the resource-intensive training you’d perform on a high-performance workstation or in the cloud. It assumes that you start with a trained model and that the on-device training updates the model, as opposed to training it from scratch.

If you’re already using OpenCL for inference on mobile, then training on mobile is the logical next step on your development path. You can implement it in several use cases at the network edge:

  • Federated learning — Suppose your mobile app collects sensitive data such as photos, and for security, you want to keep the data off the network as much as possible. How can you keep your model updated without constantly sending all the new data to the cloud for training? With Open CL ML SDK and the Adreno GPU, you can instead run federated learning to perform the training loops locally and collect updates to the model parameters. When convenient, you can send those updates alone to the cloud.
  • Transfer learning — In this use case, you start out with a model like MobileNet that was trained on the cloud. Then you update the model by adjusting the weights for specific layers through edge training. This enables transfer of model knowledge between data domains.
  • Personalization — This is similar to transfer learning and could apply to use cases such as video conferencing. Imagine, for example, that you want a model that can identify the participant who is speaking at any given time so you can blur out his/her background. You can achieve that through live updates to weights of specific layers during the video conference.

Under the hood – Tensor Batch 1 approach and back-propagation

Training consists of a loop that takes an input, performs a forward pass, calculates the loss, performs a backward pass and updates the weights. The loop continues until the loss begins to decrease to an acceptable level. OpenCL ML SDK 1.1 enabled the forward pass, which is the inference portion of machine learning. Now the SDK enables the backward pass, with accelerated operations for the back-propagation and weight-update phases of training.

Of course, if you tried to train on a mobile device with typical batch sizes of 32 or more tensors, you’d stress the memory on the device. That’s why the operations in the latest version of our SDK have a tensor batch dimension size of 1. The smaller memory footprint of the Tensor Batch 1 approach is well suited to mobile.

The diagram below illustrates the flow of training with this approach.

Gradients are accumulated in each iteration of the mini-batch and the weights are updated at the end of the mini-batch. That is mathematically equivalent to training with a tensor batch size equal to the mini-batch size.

Training is done using 32-bit, floating-point data type. All tensor values and arithmetic are floating-point. For models with batch normalization layers, you would need to preserve the statistics from the original training of the model. However, trainable parameters for batch normalization, such as scale and bias, can be updated.

In future versions of OpenCL ML, we anticipate that training options will be expanded beyond tensor batch dimension size 1.

Download OpenCL ML SDK now!

Remember how impressed you were when machine learning inference at the network edge was finally possible? This is your chance to get in early on training at the edge.

Here’s what you’ll find in the SDK:

  • Spec for the OpenCL extension, cl_qcom_ml_ops, introducing a new set of CL API calls, data structures and tokens for specifying and enqueuing machine learning operations
  • Open CL ML developer guide on model training
  • Sample applications to show how you can use the SDK
  • The Generate Graph Model Tool to convert TensorFlow protobuf frozen models (.pb) or TensorFlowLite (.tflite) models into a TensorFlow Graph Model representation
  • The Graph Model to QFP16/32 Tool to extract the weight tensor as .qfp16 and .qfp32 file types

We think many of you will find our updated SDK useful for fine-tuning at the edge. Download OpenCL ML SDK now and let us know about your use cases and suggestions for improvement.

Qualcomm Adreno is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.