Fundamentals of Voice UI: Part I: Building Blocks

Thursday 1/14/21 09:04am
|
Posted By Ana Schafer
  • Up0
  • Down0

Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.

Audio technologies are important for Qualcomm Technologies' audio products for the delivery of crisp, clear listening experiences for a wide range of products. We are pleased to have DSP Concepts, a leading provider of Audio development tools and IP that power over 40 million products, as part of our ecosystem community. DSP Concepts developed a flexible, high performance reference design powered by the Qualcomm® QCS400 series of audio System-on-Chips (SoCs).

For this guest blog, DSP Concepts' team provided insights into the important algorithms of a voice UI system which affect the quality of the voice signal received, and ultimately the performance of the voice UI. If you need a primer on the blocks of voice UI, read up on it in their previous blog, Building Voice Assistants for Home Audio.


The text that follows was provided by DSP Concepts.

Introduction

In the past few years voice control has become ubiquitous with millions of voice UI products being launched including smart speakers and sound bars to Internet of Things (IoT) appliances. Regardless of type of product, much of the performance of a voice UI system depends on the quality of the voice signal the system receives. The old maxim “garbage in, garbage out” applies as much to these systems as it does to any other technology. The better the ratio of desired signal (the user’s voice) to noise (any other sounds), the more reliably a voice UI system can work.

Voice UI systems receive their commands using multiple microphones, usually arranged into arrays controlled through digital signal processing. The accuracy of a voice recognition system depends on the ability of these arrays to focus on the user’s voice and reject unwanted stimuli, such as environmental noise or sounds emanating from the device itself. Complicating matters is the complexity and unfamiliarity of microphone array design. The challenge is even greater when the number of microphones is multiplied for an array. An engineer needs to determine which types of microphones will work best for the array, what number of microphones to use, and in what physical configuration to place them. A processing algorithm is then needed to allow the array to identify the direction of the user’s voice and focus on that voice while rejecting other sounds. Many such algorithms are available, and all must be optimized for the performance of the microphones, the size and configuration of the array, and the acoustical effects of the enclosure in which they are mounted.

This blog post will address each of these variables, using data to show the effects of the various design choices, and to give product design engineers information to help them make the most appropriate choices for their application.

Key components of Voice UI

A high level overview of Voice UI blocks is provided in the previous blog on Building Voice Assistants for Home Audio, which also provided insights into the decisions that audio developers should consider when designing a voice-enabled product, and the benefits the combined feature set of the Qualcomm® QCS400 series of audio System-on-Chips (SoCs). The block diagram below is provided below for reference.

Figure 1: Block Diagram of typical Voice UI

The algorithm that makes a voice UI product possible is actually a collection of several algorithms, each with a specific function to help the microphone array focus on a user’s voice and ignore unwanted sounds. Here is a brief description of the algorithms typically used in voice UI.

Trigger/Wake Word

The performance of a voice UI system depends on the detection accuracy of trigger words – such as “Alexa” or “Ok Google” that activates the device to respond to user commands. The trigger word must be recognized immediately and with reasonable accuracy for the voice activation to work. Therefore, trigger word detection is mostly done locally on the device to avoid latency issues while using Internet connectivity. To some degree, the device must always be active because it must constantly listen for the trigger word. This trigger word must be sufficiently complex and cannot be a word or phrase that is commonly used in conversations. This will help produce a distinctive waveform at the microphone output to help increase the probability of wake word recognition by the audio front end algorithms. The trigger word also should not be too long because the longer the phrase, the more likely it is cumbersome for the consumer to use the device. Typically, a trigger word using three to five syllables is the best choice.

Trigger word detection is evaluated using two performance criteria:

  • Number of false alarms per hour, which defines how often the algorithm indicates a trigger when none is present. This is typically measured by playing spoken content for a 24-hour period and counting the number of false wakeups.
  • Trigger detection percentage that indicates how well the algorithm correctly detects the trigger phrase out of a given repetition. This test is performed by repeating the trigger word under stationary noise conditions such as fan noise, microwave noise as well as non-stationary conditions such as music or speech distractors present in the background.

Figure 2: False Alarm Vs. Probability of Detection

Wake word models trained using Trigger detection algorithms come in different sizes. Small models take less memory and processing but make more mistakes; large algorithms require more resources but make fewer mistakes. Trigger word algorithms have an adjustable sensitivity setting which can be used to tune the model performance for false alarm vs. trigger detection. The trade-off curve between the two metrics is shown in Figure 2. Designers tend to trade-off trigger detection for less false accepts because customers are less forgiving of false triggers but accept the need to repeat the commands occasionally. Therefore, it is recommended to tune the algorithms to achieve the desired rate of false alarms rather than the trigger detection performance.

DSP Concepts has measured the trigger algorithm performance under several speech levels and interfering noise conditions in multiple test rooms. We found the most prominent factor impacting the trigger detection performance is the signal to noise ratio (SNR) at the output of the microphone array. The graph below shows the trigger detection probability vs. SNR for a 1-mic scenario. A 6-dB improvement in SNR from 0 dB condition (where the user’s voice is at the same level as the distractor) significantly improves the detection percentage to 80%. In most voice UI applications, the user’s voice is often only a few dB louder than the surrounding noise. Note that 80% is considered an acceptable performance for most automatic speech recognition (ASR) systems.

Figure 3: Performance of trigger word detection as a function of SNR in babble noise

Direction of Arrival (DOA)

Once the trigger word has been recognized, the next step is to determine the direction of arrival of the user’s voice. Once the direction is determined, the DOA algorithm tells the beamformer algorithm in which direction it should focus.

The core function of a DOA algorithm is to examine the phase or time delay relationship of the signals coming from different microphones in the array, and use this information to determine which microphone received the sound first. However, the challenge is to filter out the reflections from walls, floor, ceiling and other objects in the room that will cause the user’s voice to arrive from multiple other directions and not just directly from the user’s mouth. To this end, a DOA algorithm includes precedence logic, which separates the louder initial arrival from the quieter reflections. This function electronically eliminates acoustical reflections within a room, and if carefully tuned, the algorithm is even able to reject reflections off nearby surfaces, such as a wall directly behind a smart speaker. Furthermore, DOA algorithms can also use decision logic to recalculate the position of the user’s mouth only when the incoming signal level is above the ambient noise level threshold.

DOA algorithm performance can be measured by the average deviation from the correct degree of arrival of the user’s voice. Figure 4 shows this average deviation as a function of SNR. Zero deviation shows a perfect determination of direction of arrival and a deviation of 180 degree shows completely inaccurate prediction. Graph shows that the algorithm performs reliably at an SNR above 0 dB with a very small average error.

Figure 4. Consolidated DOA results. The Y-axis represents the SNR of the wake word utterance. The X-axis is the deviation error in degrees. The DOA algorithm performs well once the SNR is above 5 dB.

Beamformers

Beamformers are spatial filters that can help focus on sounds coming from a particular direction. This is achieved using multiple microphones in an array to improve the SNR by isolating the user's voice while rejecting sounds that come from other directions.

For example, if the user is on one side of the microphone array and an air conditioner is on the other side, the sound from the air conditioner arrives first at the microphone opposite the user, then arrives a fraction of a second later at the microphone closest to the user. The beamformer algorithm uses these time differences to null out the air conditioner sound while preserving the user’s voice.

The more microphones in an array, the more effective beamforming can be. An array with two microphones has a limited ability to cancel sounds, but an array with multiple microphones can cancel sounds coming from more directions. The fewer the number of microphones, the more the performance will vary as the ‘look angle’—the angle between the user’s voice and the front axis of the voice UI product—changes.

Let’s begin by looking at the performance of a 6-element circular beamformer with a diameter of 70 mm. The microphones are evenly distributed on a circular array as shown below. We designed the beamformer to have a look angle of 0 degrees. That is, it accepts audio arriving from a 0-degree angle and attenuates audio from another direction. The polar pattern is a function of frequency and is shown below.

Figure 5: Microphone distribution (left) and beam pattern (right) of a 6-microphone circular array with 70 mm diameter

The performance of a beamformer depends on how well the beam pattern of the microphone array can be optimized. However, a more intuitive way to understand the beamformer performance is to look at how the SNR of the array results in voice UI performance. So instead of looking at beam patterns, at DSP Concepts we instead study the SNR enhancement of an array. For example, if we assume a speech level of 60 dB SPL, a background noise level of 50 dB, and a microphone SNR of 65 dB, then the microphone array is able to improve the SNR as shown below:

Figure 6: SNR improvement as a function of Frequency with a 6-mic circular array

The graph shows roughly 6 dB improvement above 1 kHz. The SNR improvement is relative to a single microphone. Evaluating based on SNR provides intuition into how much benefit the array will bring. A 6 dB improvement allows you to “stand 6 dB further away” – or twice as far – as would be possible with a single microphone.

In an upcoming blog in this series, we plan to provide insights into the effect various microphone parameters such as SNR, gain matching and microphone spacing in the array have on the performance of beamformers.

Echo Canceller

Voice UI devices that incorporate speakers sometimes encounter the playback content on the speakers itself as one of the distractors during voice query. The device must subtract this content from the signals picked up by the microphone array to increase voice detection accuracy. But the challenge here is the content is significantly altered by the playback speaker, DSP processor, or even the microphones picking up the waveform. Fortunately, the echo canceller algorithm can compare the output of the microphone array to the original playback signal before it is processed by the DSP and calculate the correction curves the algorithm can use to subtract the direct sound of the speaker from the waveform of the voice command.

Room Acoustics

However, the playback content is also affected by acoustical reflections from the environment that make the playback sound arrive at each microphone in the array at a different time. The room modes and the absorptive effects of room furnishings also alter the spectral content of the playback signal. In order to subtract enough of these acoustical echoes from the microphone signals to achieve an acceptable SNR, the Acoustic Echo Cancellation (AEC) algorithm must “look for” sounds that match the program material within a certain margin of error (to compensate for changes to the waveform caused by acoustics), and over a defined time window that corresponds to the expected reverberation time. The time period over which the AEC looks for reflections is called the “echo tail length.” The longer the echo tail length, the more reflections can be canceled and the better the algorithm performs. However, longer tails require more memory and more processing.

The plot below shows the Echo Return Loss Enhancement (ERLE) as a function of the tail length for rooms with different reverb times. ERLE is a measure of attenuation of the speaker signal achieved by the echo canceller at the microphone. 25 dB is considered a minimum requirement for good performance, though the best AEC algorithms can cancel over 30 dB of echo.

The graph shows that the most reverberant space requires a longer echo tail to achieve better ERLE. A semi-anechoic room (Sound Room in the chart) is fairly easy to deal with but does not represent real-world usage. The conference room is the highest reverb environment and requires a longer echo tail. It can be also seen that the longer echo tails only provide marginal improvement after a certain echo tail length. DSP Concepts recommends a 150-200 msec echo tail for far-field smart speakers and Smart TVs in a living room.

Figure 7: Echo canceler performance in four rooms with increasing reverberation time. The larger rooms benefit from algorithms using long echo tails.

Linearity of the loudspeakers

AEC performance also depends on the linearity of the speaker output. At loud volume levels, speakers exhibit higher distortions increasing the non-linear components that can be hard to cancel out by the AEC. The linearity of a loudspeaker is measured using Total Harmonic Distortion (THD). THD is reported as a percent of signal level of a loudspeaker and the lower the THD, the more linear is the speaker. The table below specifies a rule of thumb on the impact of THD on the AEC performance. For example, with 1% THD the distortion level would be 40 dB lower than the signal level, and with 10% THD the distortion would be 20 dB below the signal level. In other words, by increasing the distortion from 1% to 10%, the ERLE of the echo canceller will be reduced by 20 dB. We recommend a THD less than 2% to achieve at least 30 dB of ERLE.

Figure 8: Impact of THD on the AEC performance

It is important to measure the THD of the entire system including the speaker to microphone path. Simply measuring the acoustic output of the speaker is insufficient since it is possible to have conducted sound through the plastic enclosure which couples directly into the microphones.

Because of the distances between the microphones in an array, each microphone receives a slightly different set of acoustical echoes and a slightly different direct sound from the speaker, so achieving the maximum SNR requires separate AEC processing for each microphone.

Multi-channel Playback

Applications such as sound bars require multi-channel echo cancellers to compensate each playback channel independently. Downmixing to lower channel count typically results in degraded AEC performance. The figure below shows 30-35 dB ERLE for a 3-channel product using 3 channel Echo cancellers on the left. When the 3-channel content is downmixed to 2 channels to apply a stereo AEC, the performance is degraded by 5-10 dB. The downside of multi-channel AEC is that it’s MIPS intensive since the CPU load increases with the number of filter taps used in the AEC. Lower tail length and large sample block size could be used to trade-off the MIPS with the required AEC performance.

Figure 9: Effect of downmixing on Echo canceler performance

Noise Reduction

Although microphone array systems use directional pickup patterns to filter out unwanted sounds (i.e., noise), some of these unwanted sounds can be attenuated or eliminated with an algorithm that recognizes the characteristics that separate them from the desired signal. The unwanted sounds can then be removed, much as someone who dislikes lemon flavor might ignore the yellow candies in a bowl. A noise reduction algorithm can run on a single microphone or an array, so it can assist with trigger word recognition, and improve voice UI performance after all the other algorithms have done their jobs. Thus, noise reduction might be used in multiple stages of a voice UI signal processing chain.

Voice commands are momentary, as opposed to steady-state, events. Any sound that is always present, or that is repetitive, can be detected and removed from the signal coming from the microphone array. Examples include road noise in automobiles, and dishwasher and HVAC system noise in homes. Sounds that are above or below the frequency spectrum of the human voice can also be filtered out.

Noise reduction algorithms have been commonly used for many years, but most are optimized for cell phone applications rather than voice UI. They tend to highlight the frequency spectrum most critical for human comprehension rather than the frequency spectrum most critical for an electronic system to isolate and understand voice commands. Most noise reduction algorithms that are tuned for cell phones actually degrade voice UI performance. To put it simply, humans listen for different things than voice UI systems do.

One measure of how well a noise reduction algorithm works is to see how many additional dB of signal reduction it provides at the output of the echo canceler. The figure below shows the performance of DSP Concepts’ frequency domain-based noise reduction algorithm, reducing residual echoes by up to 12 dB.

Figure 10: Effects of a noise reduction algorithm on ERLE. The higher the curve, the more attenuation and thus the better the algorithm performs.

The subjective improvement in sound quality is instantly recognized, but will it improve the performance of the speech recognition algorithm? This requires additional measurements to quantify. The graph below reproduces one of the curves from Figure 2, but does it with and without noise reduction. The noise reduction shifts the curve 2 dB to the left as compared to the original content. This shows that the noise reduction algorithm improves overall speech recognition by 2 dB.

Figure 11: Improvement in trigger word detection provided by noise reduction algorithm. When noise reduction is enabled, the curve shifts to the left by 2 dB, reflecting an improvement in the accuracy of trigger word detection.

Noise reduction performance also depends on the microphone array output, noise suppression algorithms used, and type of interfering noise to be cancelled. Single channel noise reduction algorithms typically can cancel only up to 10 dB of stationary noise. Cancelling out non-stationary noises such as music playing in the background or babble noise needs more advanced algorithms. DSP Concepts’ Adaptive Interference Canceler (AIC) algorithm rejects interfering sounds that are difficult to cancel out with a traditional beamformer, such as a TV playing in the living room or microwave noise in the kitchen. Unlike other adaptive cancellation techniques, DSP Concepts’ AIC algorithm does not require a reference signal to cancel out the interfering noises. Instead, it uses a combination of beamforming, adaptive spatial signal processing and machine learning to cancel out up to 30 dB of interference noise while also preserving the desired speech signal.

This concludes our discussion of Fundamentals of Voice UI systems. Part II of this topic will expand the topic of Beamformers to demonstrate how to optimize Microphone Beamformers and the impact of the number of microphones, microphone geometry and spacing on the performance of Beamformers.

Qualcomm QCS400 is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.

Sections: