How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100

Thursday 3/7/24 03:40am
|
Posted By Natarajan Vaidhyanathan
  • Up0
  • Down0

Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.

Co-written with Apoorva Gokhale


Figure 1: Generation on Qualcomm Cloud AI 100 Ultra (1) Baseline with FP16 weights (2) Acceleration with MX6 (3) Acceleration with MX6 and SpD

Speculative Sampling (SpS), also known as Speculative Decoding (SpD), and weight compression through MXFP6 microscaling format, are two advanced techniques that significantly enhance large language model (LLM) decoding speeds for LLM inference AI workloads. Both techniques are available for LLM acceleration on Qualcomm Technologies' data center AI accelerators. To achieve a significant inference performance speedup, start exploring Qualcomm Cloud AI instances available on Amazon AWS EC2 and Cirrascale Cloud Services.

This post explores the application of these advanced techniques on two large language models, CodeGen 1-7B and Llama 2-7B-Chat-FT, showcasing the potential for accelerated AI processing and efficiency. Join us as we unravel the details of this advancement and be sure to try out the speculative decoding feature documentation in Qualcomm Cloud AI Github.

Language Models (LMs) and Autoregressive Generation

This section introduces the basics of LLM decoding based on traditional autoregressive decoding and points out its inherent sequential nature of multi-token generation.

Text completion is the common task for LMs: Given a prompt of text, the LMs generate plausible completions.

LMs process text in the granularity of atomic units of text called tokens. Informally, tokens can be thought of as sub words in a word. The set of tokens that a LM can process is referred to as its vocabulary and denoted as Σ.

Language Models take a sequence of tokens as input and produce a probability distribution function (pdf) over its vocabulary. Sampling from this pdf gives one plausible token following the input. During implementation, a pdf over Σ is represented as a vector of length |Σ| with elements in [0,1] that sum up to one.

Autoregressive (AR) Generation

Typically, we want to generate multiple following tokens, not just one. Given a prompt of m tokens u1,…,um generation of n tokens v1,…,vn however requires n invocations of the LM (implemented as a decoder-only transformer model) as shown below:

Figure 2 Outline of Autoregressive (AR) Generation

Observe the following about the above completion generation process in Figure 2:

  1. Logits and Warping: The LM produces a vocabulary-sized vector of quantities called logits. Normalizing it via softmax gives a pdf over Σ. Sampling from this pdf is generally referred to as multinomial sampling.
    • Such a pdf can also be warped to redistribute the probability mass over a small subset of Σ.
    • A commonly used special form of multinomial sampling is greedy sampling. This arises from top-1 warping. It will put all the probability mass on the token with the highest logit value and 0 on others. Sampling from such a distribution will always result in the token with the highest probability/logit value.
  2. Multiple Invocations: LM is invoked n times each time – once for each generated token. Each invocation makes use of all the parameters of the model.
  3. Autoregressive: The output of the LM is concatenated with the input to the same LM in the following iteration.
  4. Causality: It is not possible to generate v_i without generating all the previous tokens making the generation as a serial process -- the generation of vi and vj (i≠j) cannot be parallelized.
  5. Repeated calculation: The LM invocations are independent, and there is no internal state maintained within the LM across invocations. This results in the same input being processed multiple times, generating the same set of Key-Value vectors. For example, the prompt itself is processed n times and the generated token vi is processed n-i times.

Accelerating Autoregressive Generation by reducing the computation with KV Cache (KV$)

The common approach to avoid re-computation is to maintain internal state in the LM referred to as Key-Value Cache (KV$) between invocations. Such a model LM_with_KV$ functions the same as the LM as described above, with the following operational differences:

  • when invoked for the first time on the prompt u1,…,um, it will create KV$ and write to it, the Key-Value vectors of m prompt tokens.
  • when invoked for subsequent token generations:
    • It takes as input only the single token generated in the previous iteration instead of the growing input of tokens.
    • It calculates the Key-Query-Value vectors of the single input token and append the Key-Values to the KV$
    • It processes only the single token through all layers of LM but calculate the causal attention of the single token with all the Key-Value vectors in KV$.

The text completion with LM_with_KV$() can be rewritten as follows:

It is worth emphasizing that LM_with_KV$() with appropriate KV$ contents and LM() are the same function -- both calculate the same conditional probability distribution over the vocabulary that can be used to predict the next token following vi.

LM(u1..um v1…vi ) ≡ LM_with_KV$(vi) ≡ p( .|u1,u2,…,um,v1,…vi)

Performance Profiles of LM Completion Generation

The execution phase of LM_with_KV$() that takes a prompt as input and generates the first token is called the Prefill Phase. The execution phase that generates the subsequent tokens autoregressively, one token at a time, is called the Decode Phase.

These phases have quite different performance profiles. The Prefill Phase requires just one invocation of the LM, requiring the fetch of all the parameters of the model once from the DRAM, and reuses it m times to process all the m tokens in the prompt. With sufficiently large value of m, the Prefill Phase is more compute constrained than memory-bandwidth constrained.

On the other hand, the Decode Phase is more DRAM-bandwidth constrained than computation constrained, because all the parameters of the model need to be fetched to process just one input token. Further, utilization of computational resources in the accelerator is quite low, as only one token is processed at a time. In other words, the decode phase can afford to do additional computation without impacting the latency.

To a first order approximation, the number of tokens that an LM can generate during the decode phase can be calculated as DRAM-Read-Bandwidth (bytes/sec) divided by the capacity of the model in bytes.

Speculative Decoding

It appears that the AR decoding cannot be parallelized because of its causality-induced serial nature: The token vi+1 can only be generated after the token vi has been generated. Let’s observe how SpD overcomes this hurdle.

SpD leverages the insight that it’s faster to check whether a sequence of K (>1) tokens is a plausible continuation for a prompt than generating such a K-token continuation – as the former just requires one invocation of the LM.

In SpD, the plausible completion is generated by another LM – called Draft Language Model (DLM) -- but with lower capacity. In this context, the original LM in contrast is referred to as Target LM (TLM).

To check whether v_1,…,v_K is a plausible completion of u1…um, we need to check the following K conditions:

v1…vi is a plausible completion of u1…um for i=1,..,K

Due to the parallelism inherent in checking the above K conditions, it can be achieved with one invocation of LM -- while generating K tokens requires K sequential invocations.

For the overall speculative decoding scheme to be faster, the following parameters need to be chosen appropriately:

  • Relative capacity of DLM compared to the TLM needs be small enough to generate fast speculation autoregressively and large enough to generate plausible completion for the TLM.
  • The length of speculation (K) needs to be small enough to ensure that both the single invocation of the TLM to check completion and the time for the DLM to generate do not become too expensive computationally.

More formally, given a prompt of u1…um and a potential completion v1…vK, it is possible to efficiently evaluate the following conditional pdfs:

pLM(.|u1…um)

pLM(.|u1…um,v1)

pLM(.|u1…um,v1,…,vi-1) for i=2..K

The above denotes one invocation of LM_with_KV$() with input of K tokens v1,…,vK and KV$ holding the Key-Value vectors of u1…um. This invocation is also referred to as scoring the sequence of K tokens v1,…,vK.

In summary, SpD relies on the fact that for small values of K, the latency to score K tokens is no different, or minimally higher, compared to autoregressively generating a single token because the computational units are underutilized during the AR decode stage.

Components of Speculative Decoding

There are three functional components in SpD to produce n subsequent tokens given a prompt of m tokens u1,…um.

  1. TLM: The LM from which we need to generate completion called Target or True LM (TLM).
  2. DLM: A smaller LM used to guess a K-token completion called Draft LM (DLM), generating those tokens autoregressively one at a time.
    • K is a smaller number compared to n. Since the DLM is small, the resulting increase in latency is expected to be minimal.
    • The sequence generated by the DLM is the speculated completion for TLM.
  3. MRS: A probabilistic acceptance scheme to decide whether TLM should accept the speculated completion or a prefix of it is called Modified Rejection Sampling (MRS).
    • This is the heart of the decoding process that guarantees the correctness of Speculative Sampling. Note that the speculated completion is likely to be a plausible completion from the perspective of DLM but may not be so for the TLM. So, the TLM may accept only a prefix of the speculation or even none of it. The acceptance decision is probabilistic in nature. The probabilistic reasoning underpinning the MRS is to ensure that the probability of acceptance of u1,..umt1,..ti indeed matches the probability TLM assigns to it pTLM(u1,..umt1,..ti).

Process of Speculative Decoding

First, we begin with a simpler case where DLM and TLM use greedy sampling instead of the general multinomial sampling. This is simpler because all the probabilistic reasoning can be reduced to deterministic reasoning as the pdfs become 1-hot pdfs.

Figure 3: AR generation of K tokens from DLM

Greedy Sampling

  1. For the given prompt of m tokens, the DLM will autoregressively generate K tokens t1,..,tK, one at a time, with K invocations (as shown in Figure 3 above).
  2. The TLM will process sequence u1,u2,..um,t1,..,tK, with one invocation of TLM and calculate v1,…,vK+1 (as shown in Figure 4 below).
    • v1 is the next token predicted by TLM on u1…um.
    Figure 4 Single invocation of TLM on speculation

    • vi is the next token predicted by TLM on u1,u2,..um,t1,..,ti-1 for 1 < i ≤ K + 1.
  3. Modified Rejection Sampling (MRS) Algorithm: This component takes the K speculated tokens from DLM t1,…tK , and the K+1 tokens v1,…v(K+1) from TLM via the scoring process. The output is a desired plausible completion for the prompt u1…um. It scans the speculated tokens left to right, one token at a time, to see whether it matches the corresponding token sampled by TLM during scoring. If it does, it is accepted as a plausible completion; if it doesn’t, the process stops and the
    Figure 5 MRS with Greedy Sampling on DLM and TLM

    token that came from TLM is taken as the alternative to the rejected one, and the remaining speculation is rejected. If all are accepted, then an additional token is accepted. See Figure 7 for the possible acceptances when K=3.

    Figure 6: Potential completions from Speculation

  4. At this point of time, we would have generated at least 1 and at most K+1 tokens for the TLM with just 1 invocation of TLM and K invocations of DLM.
  5. The above speculate-accept interaction is repeated until required number of n tokens are generated.
  6. An important implementation detail which we do not elaborate here is that the KV$ of DLM and TLM should be purged of the tokens that were rejected.

Multinomial Sampling

Qualcomm Cloud AI 100 supports the general case when DLM and TLM do multinomial sampling. In this case, when TLM scores the input sequence and outputs conditional pdfs, the MRS scheme needs to make probabilistic decisions with appropriate probabilities, so that the completions are sampled from the desired target distributions. For further explanation on Multinomial Sampling, please see the extended version of this article here.

In summary, SpD involves the use of a smaller capacity model (DLM) running autoregressively to speculate a multi-token completion. Since DLM is small, it can be executed much faster than TLM. The MRS algorithm validates the speculation as a plausible completion from TLM by invoking the TLM just once and accepting a prefix of the speculation. There is a theoretical guarantee that there is no accuracy loss due to the speculate-accept handshake process between the DLM and the TLM.

Weight-Only Quantization using Microscaling (Mx) Formats

In addition to Speculative Sampling, Weight-only Quantization using Microscaling (Mx) Formats can also achieve ~2x speedup on LLM decoding.

In 2023, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm formed the Microscaling Formats (MX) Alliance with the goal of creating and standardizing next-generation 6- and 4-bit data types for AI training and inferencing. Qualcomm Cloud AI 100 has implemented MxFP6 (a 6-bit datatype) as a weight compression format. When the user selects this compilation option, the compiler will automatically compress the weights from FP32 or FP16 into MxFP6 format during the offline compilation phase. Compressing in this form saves 61% of the size of the weights and thus reduces the pressure on DRAM capacity.

At inference run time, the Qualcomm Cloud AI 100 performs on-the-fly decompression in software using its vector engine with an optimized decompression kernel. Decompression can be performed in parallel with weight fetching and computations, so the overhead is mostly hidden. After decompression, the calculations are performed as before in FP16 precision. The use of FP16 is acceptable since the LLMs still remain DRAM constrained so that the compute is not a bottleneck. FP16 also allows to retain the higher precision activations which overcomes loss of accuracy from the quantization.

Experimental Results

This section presents the details of the performance speedup achieved on Qualcomm Cloud AI 100 using both greedy SpD and MX6 weight compression with the baseline of no speculative decoding and with parameters in FP16.

The speedup is reported on 3 different Qualcomm Cloud AI 100 variants:

  • Standard (75W TDP),
  • Pro (75W TDP), and
  • Ultra (150W TDP)

For diversity, we report the speedup on two networks CodeGen 1-7B mono and Llama 2-7B-chat-FT, each with different context lengths. CodeGen 1 is a family of models for program synthesis. The mono subfamily is finetuned to produce python programs from specifications in natural language. The model Llama 2-7B-chat-FT is a model fine-tuned by Qualcomm from Llama 2-7B from Meta.

Acceleration on Qualcomm Cloud AI 100 Ultra

Figure 1, from the beginning of the blog, shows three runs of Llama 2 7B-chat-FT for a particular prompt starting at the same time. The left column shows the execution of the model under a baseline condition with FP16 parameters; The middle column shows the model accelerated under MX6 compression; The right column shows the model accelerated with both MX6 compression and SpD.

The significant speedup is visible by the pace in which tokens are produced and completed in the right most column. Worth noting, the text highlighted blue in the MX6 + SpD column is the speculated completion from DLM that was accepted.

Figure 7 documents the speedup in the metric of generated tokens per second (t/s) observed in Figure 1.

Figure 7: Llama 2 7B Acceleration on Cloud AI 100 Ultra

Speedup on Qualcomm Cloud AI 100 Standard and Pro

A similar range of speedups can be achieved on the Standard and Pro variations of Qualcomm Cloud AI 100, shown in Figure 8.

Figure 8 : Range of speedups on Qualcomm Cloud AI 100 Standard and Pro

Configurations

This section provides the configurations used to measure this speed up.

 CodeGen 1-7B Llama 2-7B chat Finetuned
TLM ~7B parameters in MXFP6 ~7B parameters in MXFP6
DLM ~350M parameters in FP16 115M parameters in FP16
Max Prompt Length 32 128
Max Generation Length 224 896
Speculation Length (K) 7 7
Batch Size 1 1
Diverse set of prompts Hello World, BFS, Topological Sort, Tarjan’s SCC, Linear Regression Elicit information about weather, trip guidance, restaurant suggestions, car control, Music Search, Flight Search

Conclusions and Next Steps

Large Language Models can be accelerated by Speculative Decoding, increasing performance efficiency without a loss in accuracy. Weight compression using MXFP6 is another technique capable of speeding up LLMs 2-fold. Synergistically, both features together offer a multiplicative speedup of LLM decoding by a factor of ~4. Both techniques are available on Qualcomm Cloud AI 100 inference accelerators, which can be utilized in Amazon AWS and Cirrascale Cloud Services. Be sure to check out the model recipes provided for Speculative Decoding in the Qualcomm Cloud AI GitHub repository to streamline your LLM workflow!

If you're interested in the FAQs, appendix, and references from this article, please find them here.