Speech Enhancement
The presence of background noise in speech significantly reduces its quality and intelligibility, affecting the ability of a person, whether impaired or normal hearing, to understand what the speaker is saying. Speech enhancement algorithms are used to suppress such background noise and improve the perceptual speech quality and intelligibility. Many applications like mobile communications, automatic speech recognition and hearing aids, to name a few, drive the effort to build more effective noise reduction algorithms for better performance, due to the growing demand for speech-based human-computer interfaces.
Over the years engineers have developed a variety of theoretical and relatively effective techniques to combat this issue. However, the problem of cleaning noisy speech still poses a challenge to the area of signal processing. Removing various types of noise is difficult due to the random nature of the noise and the inherent complexities of speech. Noise reduction techniques usually have a trade off between the amount of noise removal and speech distortions introduced due the processing of the speech signal. Complexity and ease of implementation of the noise reduction algorithms is also of concern in applications especially those related to portable devices such as mobile communications and digital hearing aids.
Attempting to classify noise reduction techniques, one can recognize single or multi-sensors algorithms. In the latter, speech quality and intelligibility can be improved by exploiting the spatial diversity of speech and noise sources. Upon these techniques, known as beamforming, one can differentiate between fixed and adaptive beamformers. Also, one can divide noise reduction algorithms between that working over the entire spectrum and that using a subband approach. Because of real world noise is mostly coloured and does not affect the signal uniformly over the entire spectrum, the latter algorithms will result in better performance.
To further improve the performance, properties of the human ear can be taken advantage of. Research into human auditory properties is an ongoing process. However, available models of the human auditory system have been successfully used to improve the performance of speech and audio coding algorithms. In these coding algorithms, the purpose is to take only as much of the signal as is perceptually relevant. This reduction of information allows the signal to be stored or transmitted using fewer bits. Existing noise suppression methods incorporating these same perceptual models have shown significant gains in performance. However, there is still room for improvements, and research into new methods continues.
The encouraging results obtained in automatic speech recognition (ASR) motivated researchers to develop speech-based human-computer interfaces (HCIs), forcing them to face up with new problems, actually addressed by the field of robust speech recognition, characterized by two main approaches: the first aims to extend and improve existing speech enhancement techniques to be used in advance of the ASR decoder, while the second aims to adapt the decoder to cope with noisy inputs.
A new emerging technique is represented by audio-visual speech enhancement, which tries to exploit the well-known bimodal nature of human speech. The high degree of correlation between audio and visual speech has been deeply investigated in literature, showing that facial measures provide enough information to reasonably estimate related speech acoustics. This plays a central role in the development of human-computer interfaces (HCI), especially in automatic speech recognition (ASR) where traditional algorithms can be extended to deal with audio-visual features.
Reverberation, which arises when acoustic signals are emitted in non-anechoic environments, although often considered an independent field, is tightly linked to speech enhancement. Indeed, it may considerably reduce the intelligibility and ASR performance in hands-free communication systems. The perceptual effects of room acoustics are often considered to comprise two distinct properties: the coloration caused by the early reflections and the reverberation caused by the reverberant tail of the room impulse response (RIR). The aim of speech dereverberation is to recover the clean speech using the observed microphone signals and some knowledge, if available, about the RIR.
Our research is actually focused in the following main topics:
- Analysis and development of robust ASR front-ends for noisy speech recognition
- Analysis and development of robust adaptive beamformers
- Perceptually-motivated techniques for single and multi-channel speech enhancement
- Robust techniques for speech dereverberation using the inverse RIR
- Advanced algorithms based on audio-visual cues