Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.
Today, AI is everywhere, and generative AI is being touted as its killer app. We are seeing large language models (LLMs) like ChatGPT, a generative pre-trained transformer, being applied in new and creative ways.
While LLMs were once relegated to powerful cloud servers, round-trip latency can lead to poor user experiences, and the costs to process such large models in the cloud are increasing. For example, the cost of a search query, a common use case for generative AI, is estimated to increase by ten times compared to traditional search methods as models become more complex. Foundational general-purpose LLMs, like GPT-4 and LaMDA, used in such searches, have achieved unprecedented levels of language understanding, generation capabilities, and world knowledge while pushing 100 billion parameters .
Issues like privacy, personalization, latency, and increasing costs have led to the introduction of Hybrid AI, where developers distribute inference between devices and the cloud. A key ingredient for successful Hybrid AI is the modern dedicated AI engine, which can handle large models at the edge. For example, the Snapdragon® 8cx Gen3 Compute Platform powers many devices with Windows on Snapdragon and features the Qualcomm® Hexagon™ NPU for running AI at the edge. Along with tools and SDKs, which provide advanced quantization and compression techniques, developers can provide hardware-accelerated inference in their Windows apps for models with billions of parameters. At the same time, the platform’s always-connected capability via 5G and Wi-Fi 6 provides access to cloud-based inference virtually anywhere.
With such tools at our disposal, let’s take a closer look at Hybrid AI, and how you can take advantage of it.
Hybrid AI leverages the best of local and cloud-based inference so they can work together to deliver more powerful, efficient, and highly-optimized AI. It also runs simple (aka light) models locally on the device, while more complex (aka full) models can be run locally and/or offloaded to the cloud.
Developers select different offload options based on model or query complexity (e.g., model size, prompt, and generation length) and acceptable accuracy. Other considerations include privacy or personalization (e.g., keeping data on the device), latency and available bandwidth for obtaining results, and balancing energy consumption versus heat generation.
Hybrid AI offers flexibility through three general distribution approaches:
- Device-centric: Models which provide sufficient inference performance on data collected at the device are run locally. If the performance is insufficient (e.g., when the end user is not satisfied with the inference results), an on-device neural network or arbiter may decide to offload inference to the cloud instead.
- Device-sensing: Light models run on the device to handle simpler inference cases (e.g., automatic speech recognition (ASR)). The output predictions from those on-device models are then sent to the cloud and used as input to full models. The models then perform additional, complex inference (e.g., generative AI from the detected speech data) and transmit results back to the device.
- Joint-processing: Since LLMs are memory bound, hardware often sits idle waiting as data is loaded. When several LLMs are needed to generate tokens for words, it can be beneficial to speculatively run LLMs in parallel and offload accuracy checks to an LLM in the cloud.
For more information about Hybrid AI, check out The Future of AI is Hybrid.
A Stack for the Edge
A powerful engine and development stack are required to take advantage of the NPU in an edge platform. This is where the Qualcomm® AI Stack, shown in Figure 1, comes in.
The Qualcomm AI Stack is supported across a range of platforms from Qualcomm Technologies, Inc., including Windows on Snapdragon and the Snapdragon Mobile Platforms which powers many of today’s smartphones.
At the stack’s highest level are popular AI frameworks (e.g., TensorFlow) for generating models. Developers can then choose from a couple of options to integrate those models into their Windows on Snapdragon apps. Note: TFLite and TFLite Micro are not supported for Windows on Snapdragon.
The Qualcomm® Neural Processing SDK for AI provides a high-level, end-to-end solution comprising a pipeline to convert models into a Hexagon specific (DLC) format and a runtime to execute them as shown in Figure 2.
The Qualcomm® Neural Processing SDK for AI is built on the Qualcomm AI Engine Direct SDK. The Qualcomm AI Engine Direct SDK provides lower-level APIs to run inference on specific accelerators via individual backend libraries. This SDK is becoming the preferred method for working with the Qualcomm AI Engine.
The diagram below (Figure 3) shows how models from different frameworks can be used with the AI Engine Direct SDK.
The backend libraries abstract away the different hardware cores, providing the option to run models on the most appropriate core available in different variations of Hexagon NPU (HTP for the Snapdragon 8cx Gen3).
Figure 4 shows how developers work with the Qualcomm AI Engine Direct SDK.
A model is trained and then passed to a framework-specific model conversion tool, along with any optional Op Packages containing definitions of custom operations. The conversion tool generates two components:
- model.cpp containing Qualcomm AI Engine Direct API calls to construct the network graph
- model.bin (binary) file containing the network weights and biases (32-bit floats by default)
Note: Developers have the option to generate these as quantized data.
The SDK’s generator tool then builds a runtime model library while any Op Packages are generated as code to run on the target.
This document provides more information on how these components are integrated into a Windows app.
We have implementations of our Qualcomm AI Stack for several other Snapdragon platforms spanning verticals such as mobile, IoT, and automotive. This means you can develop your AI models once, and run them across those different platforms.
Try it out!
When you’re ready to build AI at the edge for Windows on Snapdragon, there are a couple of sample apps to check out:
- Qualcomm AI Stack Stable Diffusion Demo: Optimize a Stable Diffusion model using AI Model Efficiency Toolkit (AIMET) (our toolkit for optimizing models), and then prepare and run it using the Qualcomm AI Engine Direct SDK.
- Sample App Tutorial: Integrate the Qualcomm AI Engine Direct SDK in a C++ app. Although based around a typical Linux or Android workflow, it includes callouts to help you build for Windows.
Whether your Windows on Snapdragon app runs AI solely at the edge or as part of a hybrid setup, we’ve got several learning resources on how to work with our AI solutions:
- The Future of AI is Hybrid: Provides a more in-depth look at this emerging field.
- Windows on Snapdragon Developer Portal: Information and documentation on how to get started building and porting your apps to Windows on Snapdragon.
- Snapdragon Neural Processing Engine SDK Documentation: Everything you need to work with the Qualcomm Neural Processing SDK.
- Qualcomm AI Engine Direct Documentation: Everything you need to work with the Qualcomm AI Engine Direct SDK.
- AI for Qualcomm Compute: A course that covers Windows on Snapdragon development and using the Qualcomm Neural Processing SDK for AI.
And be sure to sign up for our Windows on Snapdragon newsletter to stay up to date with the latest news and announcements.
Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. AIMET is a product of Qualcomm Innovation Center, Inc.