Q&A: DSPs and advanced noise suppression
The editors of DSP-FPGA.com have been looking forward to learning more about Audience since Will Strauss called their technology “breakthrough” after the 2008 Mobile World Congress. Recently Audience’s Director of Product Marketing, Jim Bohac answered questions for us.
DSP-FPGA.com: This technology is new enough and unusual enough that some background and an explanation of its basics would be appreciated.
JB: Audience is a voice processor company. We build an integrated circuit that emulates the human hearing system and in doing so, as one of our principal voice processor functions, we eliminate background noise from conversations. We provide a solution for cell phones that greatly improves the voice quality of communications by performing noise suppression, echo cancellation, and other voice enhancements both for transmit and receive paths in the cell phone.
To understand the basic ideas behind Audience’s voice processor it will help to take a look at the two figures we have here.
Figure 1 is a visual representation of Computational Auditory Scene Analysis [CASA], which is the core processing philosophy behind the Audience voice processor; that is, hearing like you hear. A well-known illustration of CASA is the so-called cocktail party effect; at a busy party, one is able to follow a conversation even though other voices and background music are present. The Audience voice processor uses cues from the signal to categorize the sounds, and then uses signal processing to extract the voice signal of interest from the combined audio signal.
With CASA processing, we characterize the energy in the signal based on well-established perceptual cues like pitch, spatial position, and onset time, which allows us to group the energy in the signal according to the sound sources that created it, just like your brain does in real time. Then we pick the voice out of that signal, reconstruct that voice, and send it on to the rest of the system.
On the top of Figure 2, you see a pictorial representation of the Fast Cochlea Transform [FCT] for a voice signal in a noisy background. The FCT transforms the digital audio stream into a three-dimensional, high-quality spectral representation of the sound mixture. Time is on the x axis, frequency on the y axis, and loudness in decibels [dB] is represented in color. The transformation provides optimum time-frequency resolution on a logarithmic frequency axis, without introducing frame artifacts, to allow the various components of the multiple sound sources to be characterized and separated from each other. The top part of Figure 2 represents a person talking on a noisy street corner with street noise, voices, and cell phones ringing in the background. The primary voice is evident in the higher intensity signals.
The bottom part of Figure 2 shows the result of the Audience Voice Processor signal processing. We subtract all of the background noise out of the signal using the CASA process. You are seeing the results of our noise suppression algorithms. After performing the inverse Fast Cochlea Transform, clean voice is sent on to the vocoder of the cell phone system.
So instead of the vocoder trying to process both voice and noise and passing both the voice and noise into the network, you get a very clean signal. Systems that inhibit the cell phone transmitter when there is no voice work better, and more importantly, you get clean voice at the other end of the system where the listener is trying to hear you.
DSP-FPGA.com: What is the role of the Fast Cochlea Transform [FCT]?
JB: The FCT emulates exactly how the cochlea transforms the sound waves into the frequency domain, creating a representation of sound that your brain uses to hear.
The cochlea represents sounds using a logarithmic frequency scale, which gives good spectral resolution at low frequencies, allowing us to separate voices based on their pitch. It also has an optimal frequency-dependent trade-off between spectral resolution and latency, just like the human hearing system.
DSP-FPGA.com: Why FCT rather than Fast Fourier Transform [FFT]?
JB: The FFT uses a linear frequency scale, which gives unnecessary high spectral resolution at high frequencies, and inadequate resolution at low frequencies. Essential in our processing – and one of the key elements of the Audience Voice Processor – is that we have a high-fidelity representation of the audio signal – replicating how you actually hear.
The higher resolution at lower frequencies allows us to pick out harmonics and other important sound attributes in a more accurate fashion. So when we are doing things like looking at pitch, there are many cues we can use to pick out a voice from other sounds. Having an accurate ability to extract pitch from the signals is an important process.
Also, the FCT is very efficient from a computational load and latency standpoint, giving us very low latency so that we can provide real-time processing of the audio signals.
DSP-FPGA.com: The FCT required the DSP to be purpose-built?
JB: The FCT and Inverse Fast Cochlea Transform [IFCT] were two of our processing elements that motivated the selection of a purpose-built DSP. We devised an optimized DSP with additional instructions specifically oriented toward the calculations that would be used in audio processing. When we looked at the calculations required, particularly for the inverse FCT, because it is log linear calculation, we found that existing DSPs would not give us the throughput required to get the low latency and processing power we need.
This means that we have a smaller processing pipeline for specific audio functions, faster processing, and lower latency processing. Having purpose-built instructions facilitates our FCT to be low delay and provides the ability to do more calculations at lower power than if we used a general-purpose DSP.
DSP-FPGA.com: And the boost in processing power helps because?
JB: Having more processing power gives us a stronger ability to reduce the noise in the background and to provide other audio enhancements. Also, it gives us the ability to do all of that at very low power. In the cell phone market, low latency, strong, consistent performance, and low power are important parameters. By having our own customized DSP we can achieve low latency, consistent performance, and low power, which are required for a successful cell phone product.
DSP-FPGA.com: What applications is Audience looking at beyond those involving making cell phone conversations easier to hear in noisy environments?
JB: When customers have made the decision to include strong noise suppression technology using the Audience Voice Processor in the phone, we now have a purpose-built audio processor in the handset, and there are a number of things that one can do in audio processing with this purpose-built device.
You hear music the same way you hear a voice, so anything we do for processing for any auditory signal will benefit from using the FCT as the representation of the signal, because it is a better match for how human beings hear the sounds, and that continues with our philosophy of doing things using the human hearing system as our model.
In a noise suppression application, we take the sounds in and categorize those sounds, much as you do with a conversation with music in the background. Although all the sounds would be coming into your ear, in the conversation we are having [for this Q&A, for example], your brain would process and pick up my voice as the sound of interest. Then we do a grouping of those sounds, much as your brain does, pick out the voice, do an inverse FCT, and send that clean voice to the rest of the processing system.
You can also imagine the application of the FCT to other auditory sounds. If my conversation was uninteresting, you could focus in on music. So you could use the same core FCT to process other sounds of interest. It is a core building block to a lot of features we are going to be coming out with in the future.
DSP-FPGA.com: How does your FCT-based approach affect such features as echo cancellation?
JB: Instead of trying to pick a separate echo path as in traditional echo cancellation systems, because of the unique way we process the signals, the echo signal looks just like any other distracter in the conversation we are having. We process the echo signal not in the traditional way other echo cancellers do with a separate sampling path to try to determine the echo signal; instead we just treat the echo signal as another distracter in that mix of music, babble, and sounds coming into the front end of our system. A benefit of this approach is that we can cancel echo without requiring a separate echo cancelling subsystem.
Another feature that takes advantage of having a strong voice processor in the system is our voice equalization feature. Because we know what the local estimate of the noise is, in addition to using that estimate of the noise in outbound speech, we found that we could also use that local estimate of the noise to enhance the incoming speech from the far end talker. If you are talking to me in a noisy restaurant, or catching a train and a train comes by, we can use the estimate of the noise to boost the incoming voice signal above the ambient noise. In doing so, we have increased the possibility of your hearing the person on the other end of the conversation accurately.
We also bring dual benefit on the receive side by helping you hear conversations better by doing both noise suppression on the receive signal and dynamically boosting the voice signal when you are in a noisy situation. This is a feature that we have termed Dual Downlink Signal Improvement – both suppressing noise on the receive signal and boosting the signal above the local ambient noise.
People try to find quiet places to make cell phone calls and avoid taking or making calls in noisy situations. We remove that restriction and deliver on the promise of a mobile phone that you can use anywhere you want, anytime you want.