Fundamentals of Voice UI: Part II: Designing Optimum Beamformers

Tuesday 2/2/21 09:06am
|
Posted By Ana Schafer
  • Up0
  • Down0

Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.

In our series on Audio Technologies, DSP Concepts contributed a blog on Building Voice Assistants for Home Audio and Fundamentals of Voice UI: Part I: Building Blocks. Audio technologies are important for Qualcomm Technologies' audio products for the delivery of crisp, clear listening experiences for a wide range of products and we are pleased to have DSP Concepts, a leading provider of Audio development tools as part of the Qualcomm ecosystem community.

For this next guest blog, DSP Concepts' team demonstrates how the different microphone types and array configurations affect the performance of voice UI systems.


The text that follows was provided by DSP Concepts.

Our previous blog post, Fundamentals of Voice UI: Part I, explained the algorithms required for a voice UI system. In this post, we will demonstrate how the different microphone types and array configurations affect the performance of voice UI systems.

As we discussed in the previous post, beamforming is the processing of signals from multiple omnidirectional microphones, to focus in on the sound coming from the direction of the most prominent source (i.e., the user’s voice) and disregard sounds coming from other directions. A direction of arrival (DOA) algorithm first determines the desired direction of beam focus, then the beamformer algorithm passes the sound from the nearest microphone while manipulating the phase of the signals from the other microphones so sounds coming from outside the beam are reduced in level.

The performance of a beamformer depends on how well the beam pattern of the microphone array can be optimized. However, a more intuitive way to understand the beamformer performance is to look at how the signal to noise ratio (SNR) of the array results in voice UI performance. So instead of looking at beam patterns, at DSP Concepts we study the SNR enhancement of an array. For example, if we assume a speech level of 60 dB sound pressure level (SPL), a background noise level of 50 dB, and a microphone SNR of 65 dB, then the microphone array is able to improve the SNR as shown below:

Figure 1: SNR improvement as a function of Frequency with a 6-mic circular array

The graph shows roughly 6 dB improvement above 1 kHz. The SNR improvement is relative to a single microphone. Evaluating based on SNR provides intuition into how much benefit the array will bring. A 6 dB improvement allows you to “stand 6 dB further away” – or twice as far – as would be possible with a single microphone.

Parameters Affecting Array Performance

This section presents the SNR performance of various microphone array designs. We studied a microphone array in which mics were arranged on a circle with a 71mm diameter. Rated SNR of these microphones was 65 dB. Testing was done in an environment with diffuse-field noise at 50 dB SPL, with a speech signal at average 60 dB SPL. Beamwidth was 45 degrees, and look angle was 0° except where otherwise specified. Processing of the signals was performed using DSP Concepts’ Audio Weaver, a modular DSP programming package.

Number of microphones

The polar plots below show the pickup patterns of circular arrays using one to seven microphones, measured at frequencies of 500, 1000, 2000 and 4000 Hz (thus covering most of the range of human speech). Ideally, the pickup pattern should show a tight beam pointed directly to the right, at the look angle of 0°, with little variation at different frequencies.

Figure 2: Microphone distribution (left) and polar pattern (right) of different mic configurations

As can be seen in the polar plots, increasing the number of microphones generally allows for a tighter, more focused pickup beam, but in certain cases adding microphones does not improve performance at all frequencies. For example, three microphones clearly produce a better result at all frequencies than two microphones, however, increasing the microphone count to four improves performance from 500 to 2000 Hz but degrades it at 4000 Hz. The two-, three- and four-microphone arrays produce significant off-axis lobing at 4000 Hz; this reduces system SNR and increases the chance of an inaccurate DOA determination, which could make the beamformer aim in the wrong direction.

The two-microphone array, in particular, does a relatively poor job of rejecting sounds from 180°. (The three- and four-mic arrays also exhibit this flaw, but only to a significant amount at 4000 Hz.) This error can be especially problematic if the unit is placed near a wall or other large sound-reflecting objects, where the reflection might cause the voice UI system to think the user’s voice is coming from the wall instead of from the user.

The arrays of five and six microphones produce clearly better results, with tightly focused beams on the 0° axis, negligible off-axis lobing, and excellent rejection of sounds from 180°.

The graph below shows how the SNRs of the different mic arrays compare with respect to frequency. The higher the trace is on the chart, the better the SNR and the better the performance of the voice UI system should be.

Figure 3: SNR Improvement as a function of frequency for different microphone configurations

The 7-mic configuration with the center microphone gives the best performance. If you take away the center microphone, we get a dip in the high frequency region. The microphones are more closely spaced with the presence of the center mic and it avoids spatial aliasing problem at high frequencies. Without the center mic, certain high frequencies will have dropouts when the mic signals combine out of phase. The result is better SNR gain at high frequencies and hence better beamformer performance.

The six-microphone array shows a clear advantage in SNR at all frequencies from 1000 to 5500 Hz. As the number of microphones is reduced, overall SNR suffers, although reducing the number of mics can actually improve SNR within certain frequency bands.

The 4 microphone Trillium (3+1 configuration) works well across a wide range of frequencies. The center mic again boosts the SNR at high frequencies. DSP Concepts recommends the 4-mic Trillium configuration to achieve premium performance with 360-degree voice capture while keeping the design cost-effective.

The 2-mic array comes in broadside and end-fire configuration as shown below:

Figure 4: End-fire and Broadside 2-mic orientation relative to the user

The end fire design works fairly well at low frequencies. However, at high frequencies, SNR drops due to spatial aliasing. The broadside array provides some SNR boost around 3 kHz but does not improve the performance in low frequencies. This is because in the end-fire configuration, the signal is arriving at the front and the rear microphone at different times and this time delay helps to steer the beam well at low frequencies. In the broadside configuration, the signal arrives at both microphones at the same time and the performance is poor. Broadside arrays are only recommended as a last resort, particularly when the product is flat and there is no room for alternate configurations.

Microphone SNR

Considering that system SNR is critical to accurate voice recognition, it might be reasonable to assume that using microphones with higher SNR would improve voice UI performance. To test this assumption, total system SNR was tested with microphones rated at 65- and 70-dB SNR, each type arranged in arrays comprising one to six microphones. The main advantage of a high-SNR mic would occur at low frequencies because the improved SNR would permit more aggressive processing of low frequencies, which is where most environmental noise in homes and autos tends to occur.

The graphs below show how microphone SNR affected the performance of the different arrays, with system SNR shown relative to frequency. The higher the trace is on the chart, the better the SNR and the better the performance of the voice UI system should be. The tests were performed twice- once with 50 dB SPL ambient noise and once with 35 dB ambient noise.

The graph below shows the results with a 50 dB SPL ambient noise field, which is what would be encountered in a typical residential living room with no active sources of direct sound (i.e., no voices, TV, loud appliances, etc.).

Figure 5: System SNR relative to frequency at 65 dB and 70 dB microphone SNR for different microphone configurations. Ambient noise at 50 dB SPL. Solid lines show the results with the 65 dB SNR microphone; dotted lines show the result with the 70 dB SNR microphone

In this case, the improvement gained by using microphones with better SNR is in most cases barely measurable and would not noticeably improve voice UI performance.

The graph below shows the same test conducted in a background noise level of 35 dB, which corresponds to a very quiet home environment.

Figure 6: System SNR relative to frequency at 65 dB and 70 dB microphone SNR for different microphone configurations. Ambient noise at 35 dB SPL. Solid lines show the results with the 65 dB SNR microphone; dotted lines show the result with the 70 dB SNR microphone

Under these conditions, using microphones with better rated SNR has a larger impact, in many cases increasing system SNR by about 1 dB. However, note that reducing ambient noise has a much larger impact on system SNR, typically improving it by about 14 dB. Thus, the benefits of a 1 dB improvement in microphone SNR on the overall voice UI system performance would be insignificant in this case.

Microphone Gain Matching

Like other mechanical devices, microphones exhibit unit-to-unit inconsistency. The gain of two samples of the same microphone can vary substantially; a tolerance of ±3 dB (for a maximum difference of 6 dB in gain between two samples) is common. In arrays of multiple microphones, these inconsistencies might negatively affect system SNR and the overall performance of the voice UI system. Microphones with tighter gain tolerance, or with factory calibration measurements for each mic, are sometimes available, but they are typically more costly.

To evaluate the effects of microphone gain mismatch on system SNR, models of theoretical arrays of one to six perfectly matched microphones were tested. Gain mismatches of ±1, ±2 and ±3 dB was then introduced into the model, and the tests repeated.

The graphs below show how microphone gain tolerance affected the performance of the different arrays, with system SNR shown relative to frequency. The higher the trace is on the chart, the better the SNR and the better the performance of the voice UI system should be.

Figure 7: Effect of ±1 dB gain mismatch in the mic array. Solid lines show the results with perfect gain matching; dotted lines show the result with the gain mismatched at ±1 dB

Figure 8: Effect of ±2 dB gain mismatch in the mic array. Solid lines show the results with perfect gain matching; dotted lines show the result with the gain mismatched at ±2 dB

Figure 9: Effect of ±3 dB gain mismatch in the mic array. Solid lines show the results with perfect gain matching; dotted lines show the result with the gain mismatched at ±3 dB

The above charts show that gain mismatches in arrayed microphones can have a large negative impact on system SNR, often comparable to the impact that reducing the number of microphones might have. The effect is particularly noticeable in the bottom chart, with the ±3 dB mismatch that is typical of the microphones used in voice UI systems.

It is important to note here that these tests were performed on a theoretical array without an enclosure. Once the microphones are mounted in an enclosure, the gain and frequency response of the microphone will change, depending on how the microphones are mounted, where on the unit they are mounted, and the consistency of the acoustic seals around the microphones. For this reason, using microphones of better consistency, or supplied with factory calibration data, may not produce an optimal result because the acoustical effects of the enclosure and mounting will be inconsistent from microphone to microphone, and may introduce performance inconsistencies even with the most tightly matched microphones.

The best solution in this case is for the microphone gain to be measured with the microphones installed, and the gain for each microphone adjusted in software. Ideally, each unit would be individually measured and calibrated in the factory after the product is assembled, so that the software can compensate for any inherent gain mismatch in the mics as well as for mismatches caused by the acoustical effects of the enclosure.

Microphone spacing

Increasing the spacing of the microphones in an array might be expected to create greater differences in level among all the microphones because the difference in the source-to-microphone distances will be greater. It will also alter the relative phase differences among the microphones. To find out how spacing affects system SNR, arrays using two to six microphones were tested, with the microphones placed on circles ranging from 5 to 71mm in diameter. Although arrays with larger spacing may be impractical for many voice UI products, a three-mic array was also tested with the mics placed on circles measuring 40, 80, 160 and 320mm.

The following series of graphs show how microphone spacing affected the performance of the different arrays, with system SNR shown relative to frequency. The higher the trace is on the chart, the better the SNR and the better the performance of the voice UI system should be.

The graphs below show the result of different spacings on a two-mic array for an end-fire and a broadside position. In the end-fire position, SNR is significantly improved at the 40 and 71mm mic spacings. The 71mm spacing gives the best result within the human vocal range. However, in the broadside position, the best average SNR within the vocal range is achieved at the 20mm spacing.

Figure 10. Effect of microphone spacing on beamformer performance for 2-mic Broadside (top) and End-Fire (bottom) array

The graph below shows the results from 3-mic triangle array. 70mm offers the best performance in the vocal range.

Figure 11: Effect of microphone spacing on beamformer performance for 3-mic Triangular array

The graphs below shows the effects of microphone spacing on a four-mic array. Best results within the vocal range are obtained at 70mm, followed by 40mm. The trillium configuration provides a flatter response at high frequencies.

Figure 12: Effect of microphone spacing on beamformer performance for 4-mic Trillium (top) and Square (bottom) configurations

The graphs below show the effects of microphone spacing on a six-mic and 7-mic arrays. Best results within the vocal range are again obtained at 70mm.

Figure 13: Effect of microphone spacing on beamformer performance for 6-mic and 7-mic configurations

Based on these results, placing microphones on a circle measuring 70mm in diameter is generally the best choice with arrays of three to six microphones, providing there is sufficient physical space on the voice UI device. With a two-microphone array, results vary considerably depending on whether the source is in line with the microphones or broadside to the mic array. This result suggests that if a two-microphone array is used (either to cut costs or accommodate specific device form factors), the device should be optimized for either end-fire or broadside arrival and designed so the user is more likely to address it from the optimal direction.

Conclusion

Part I and II of the “Fundamentals of Voice UI” covered the key aspects of voice UI audio front end (AFE) design including the major audio processing blocks and the impact of microphone arrays on Beamforming algorithms’ performance. Successful design of voice assistants requires significant expertise in the development and integration of microphone arrays and the sophisticated audio processing algorithms that go along with the arrays on the target hardware. DSP Concepts recommends that product developers should start with an industry standard reference design that can help them rapidly deploy high-performance voice products. Qualcomm® QCS400 series of audio System-on-Chips (SoCs) reference design provides developers a choice of 4 and 6 microphone configurations on voice UI together with multichannel playback capability. The design provides out-of-the-box implementation of voice assistant features to significantly expedite the product development cycle and thus reduce the cost and time to market of audio product lines.

Qualcomm QCS400 is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.

Sections: