Power-efficient acceleration for large language models – Qualcomm Cloud AI SDK

Wednesday 11/1/23 04:05am
Posted By Morris Novello
  • Up0
  • Down0

Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.

Want to accelerate your large language model (LLM) inference workloads without blowing your power budget? Or your cooling budget?

The Qualcomm Cloud AI 100 performs AI inference on the edge cloud faster and more efficiently than on CPUs and GPUs. Plus, it allows you to retain control of sensitive data by keeping it on premises. To take advantage of the high performance, low power consumption and privacy protection, we’re launching the Qualcomm Cloud AI SDK. This software-hardware combination is designed to accelerate a variety of AI tasks, including LLM inference for natural-language use cases like text-to-code, transcription, Q&A and language translation.

LLMs have suddenly spread in prominence in applications ranging from powering chatbots to writing application code. Plenty of companies and their engineering teams are scrambling to deploy LLMs not only profitably but also sustainably, in ways that won’t tax the resources in their data centers and edge devices. The Qualcomm Cloud AI 100 combination of hardware and software represents more than a decade of our R&D in acceleration technology for deep learning with low power consumption.

This post describes how developers can take advantage of Qualcomm Cloud AI 100 systems and the new Qualcomm Cloud AI SDK in their own applications, especially in natural language processing (NLP) with LLMs.

The hardware: Qualcomm Cloud AI 100 systems

As shown below, Qualcomm Cloud AI 100 hardware is available in the low-profile PCIe form factor for servers.

The cards run in commercial servers from platform partners like HPE and Lenovo, with more on the way.

Qualcomm Cloud AI 100 offers high performance with low power consumption. PCIe cards are available in Standard and Pro models.

  • Power (Thermal Design Power): 75 W
  • Machine learning capacity, INT8: up to 400 tera operations per second (TOPS)
  • Machine learning capacity, FP16: up to 200 tera floating point operations per second (TFLOPS)
  • On-die SRAM: up to 144 MB
  • On-card DDR: 16 or 32 GB LPR4x at 137 GB/s

The hardware is designed for applying AI processing and analytics to real-time or offline multimedia streams.

The software: Qualcomm Cloud AI SDKs

Qualcomm Cloud AI offers two SDKs: Apps and Platform. Used together, they enable you to compile, optimize and run models from frameworks like ONNX, PyTorch, TensorFlow, Caffe and Caffe2 on Qualcomm Cloud AI 100 hardware. This diagram illustrates the high-level workflow you’ll follow:

  1. Export an inference-friendly network from a trained neural network you’ve created using common ML frameworks.
  2. Load that inference-friendly network into the Apps SDK either directly, with a model loader, or indirectly, with runtimes like ONNX Runtime. You then compile the network.
  3. Compiler produces the binary image of the network.
  4. On Qualcomm Cloud AI 100 hardware, use the runtime library from the Platform SDK to execute the network binary.

We’ve rolled out a Qualcomm Cloud AI 100 software stack with tools you can use to create, optimize and deploy a variety of ML inference applications:

Documentation on the SDKs and tools abounds. Have a look at the Qualcomm Cloud AI 100 User Guide for an overview.

Accelerating your LLMs

Following the workflow and using the developer tools above, you can run your existing LLMs on Qualcomm Cloud AI 100.

Typical use cases for LLMs include:

  • Text-to-code, greatly accelerated application development and site building
  • Customer service and chatbots for retailer online shopping
  • Document summarization and copilot-like usage to summarize meetings or emails
  • Language translation, improving business access to markets across geographies

Qualcomm Cloud AI 100 supports dozens of NLP models like GPT2 and its variants, and Bidirectional Encoder Representations from Transformers (BERT) and its variants. If you want to enable one of your own models and try to optimize it, you’ll find recipes on our Cloud AI GitHub.

Beyond NLP, Qualcomm Cloud AI 100 supports models in domains from computer vision (image classification, object detection, semantic segmentation, pose estimation, face detection) to autonomous driving. See whether it offers neural network support for the models you’ve created.

Next steps

With the combination of Qualcomm Cloud AI 100 hardware and the Qualcomm Cloud AI SDKs, you can meet the growing inference needs in your data center. It encompasses Qualcomm Technologies' pedigree of low power consumption, scale, process node leadership and signal processing expertise. And, the SDKs give you a versatile, flexible tool chain, with optimized libraries for a variety of models and a wide range of APIs, especially for the inference workloads of LLMs up to 175 billion parameters.

Our next steps with Qualcomm Cloud AI are toward constant improvement. We’re continuing to work on ever-better performance for LLMs by delivering more inferences at lower latency and lower energy utilization. Plus, keep an eye out for announcements about accessing Qualcomm Cloud AI as a cloud instance.

Your next steps are to visit our Qualcomm Cloud AI portal, find out more about Qualcomm Technologies’ approach to accelerating AI and LLMs, and gauge the fit with your organization. Sign up for access to the SDK. See how you can run inference on the edge cloud faster and more efficiently.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.