2011 |
Dimitrios Dimitriadis, Petros Maragos, Alexandros Potamianos On the effects of filterbank design and energy computation on robust speech recognition Journal Article IEEE Transactions on Audio, Speech and Language Processing, 19 (6), pp. 1504–1516, 2011, ISSN: 15587916. Abstract | BibTeX | Links: [PDF] @article{137, title = {On the effects of filterbank design and energy computation on robust speech recognition}, author = {Dimitrios Dimitriadis and Petros Maragos and Alexandros Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/DimitriadisMaragosPotamianos_Effects-Filterbank-Design-Energy-Computation-Robust-Speech-Recognition_ieeeTASLP_aug11.pdf}, doi = {10.1109/TASL.2010.2092766}, issn = {15587916}, year = {2011}, date = {2011-01-01}, journal = {IEEE Transactions on Audio, Speech and Language Processing}, volume = {19}, number = {6}, pages = {1504--1516}, abstract = {In this paper, we examine how energy computation and filterbank design contribute to the overall front-end robustness, especially when the investigated features are applied to noisy speech signals, in mismatched training-testing conditions. In prior work (“Auditory Teager energy cepstrum coefficients for robust speech recognition,” D. Dimitriadis, P. Maragos, and A. Potamianos, in Proc. Eurospeech'05, Sep. 2005), a novel feature set called “Teager energy cepstrum coefficients” (TECCs) has been proposed, employing a dense, smooth filterbank and alternative energy computation schemes. TECCs were shown to be more robust to noise and exhibit improved performance compared to the widely used Mel frequency cepstral coefficients (MFCCs). In this paper, we attempt to interpret these results using a combined theoretical and experimental analysis framework. Specifically, we investigate in detail the connection between the filterbank design, i.e., the filter shape and bandwidth, the energy estimation scheme and the automatic speech recognition (ASR) performance under a variety of additive and/or convolutional noise conditions. For this purpose: 1) the performance of filterbanks using triangular, Gabor, and Gammatone filters with various bandwidths and filter positions are examined under different noisy speech recognition tasks, and 2) the squared amplitude and Teager–Kaiser energy operators are compared as two alternative approaches of computing the signal energy. Our end-goal is to understand how to select the most efficient filterbank and energy computation scheme that are maximally robust under both clean and noisy recording conditions. Theoretical and experimental results show that: 1) the filter bandwidth is one of the most important factors affecting speech recognition performance in noise, while the shape of the filter is of secondary importance, and 2) the Teager–Kaiser operator outperforms (on the average and for most noise types) the squared amplitude energy computation scheme for speech recognition in noisy conditions, especially, for large filter bandwidths. Experimental results show that selecting the appropriate filterbank and energy computation scheme can lead to significant error rate reduction over both MFCC and perceptual linear predicion (PLP) features for a variety of speech recognition tasks. A relative error rate reduction of up to textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 30backslashhboxbackslash%$textless/textextgreater textless/formulatextgreater for MFCCs and textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 39backslashhboxbackslash%$textless/textextgreatertextless/formulatextgreater for PLPs is shown for the Aurora-3 Spanish Task.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In this paper, we examine how energy computation and filterbank design contribute to the overall front-end robustness, especially when the investigated features are applied to noisy speech signals, in mismatched training-testing conditions. In prior work (“Auditory Teager energy cepstrum coefficients for robust speech recognition,” D. Dimitriadis, P. Maragos, and A. Potamianos, in Proc. Eurospeech'05, Sep. 2005), a novel feature set called “Teager energy cepstrum coefficients” (TECCs) has been proposed, employing a dense, smooth filterbank and alternative energy computation schemes. TECCs were shown to be more robust to noise and exhibit improved performance compared to the widely used Mel frequency cepstral coefficients (MFCCs). In this paper, we attempt to interpret these results using a combined theoretical and experimental analysis framework. Specifically, we investigate in detail the connection between the filterbank design, i.e., the filter shape and bandwidth, the energy estimation scheme and the automatic speech recognition (ASR) performance under a variety of additive and/or convolutional noise conditions. For this purpose: 1) the performance of filterbanks using triangular, Gabor, and Gammatone filters with various bandwidths and filter positions are examined under different noisy speech recognition tasks, and 2) the squared amplitude and Teager–Kaiser energy operators are compared as two alternative approaches of computing the signal energy. Our end-goal is to understand how to select the most efficient filterbank and energy computation scheme that are maximally robust under both clean and noisy recording conditions. Theoretical and experimental results show that: 1) the filter bandwidth is one of the most important factors affecting speech recognition performance in noise, while the shape of the filter is of secondary importance, and 2) the Teager–Kaiser operator outperforms (on the average and for most noise types) the squared amplitude energy computation scheme for speech recognition in noisy conditions, especially, for large filter bandwidths. Experimental results show that selecting the appropriate filterbank and energy computation scheme can lead to significant error rate reduction over both MFCC and perceptual linear predicion (PLP) features for a variety of speech recognition tasks. A relative error rate reduction of up to textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 30backslashhboxbackslash%$textless/textextgreater textless/formulatextgreater for MFCCs and textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 39backslashhboxbackslash%$textless/textextgreatertextless/formulatextgreater for PLPs is shown for the Aurora-3 Spanish Task. |
2009 |
Dimitrios Dimitriadis, Alexandros Potamianos, Petros Maragos A comparison of the squared energy and teager-kaiser operators for short-term energy estimation in additive noise Journal Article IEEE Transactions on Signal Processing, 57 (7), pp. 2569–2581, 2009, ISSN: 1053587X. Abstract | BibTeX | Links: [PDF] @article{132, title = {A comparison of the squared energy and teager-kaiser operators for short-term energy estimation in additive noise}, author = {Dimitrios Dimitriadis and Alexandros Potamianos and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/DimitriadisPotamianosMaragos_ComparisonSquaredAmpl-TKOper-EnergyEstimation_ieeetSP2008.pdf}, doi = {10.1109/TSP.2009.2019299}, issn = {1053587X}, year = {2009}, date = {2009-01-01}, journal = {IEEE Transactions on Signal Processing}, volume = {57}, number = {7}, pages = {2569--2581}, abstract = {Time-frequency distributions that evaluate the signal's energy content both in the time and frequency domains are indispensable signal processing tools, especially, for nonstationary signals. Various short-time energy computation schemes are used in practice, including the mean squared amplitude and Teager-Kaiser energy approaches. Herein, we focus primarily on the short- and medium-term properties of these two energy estimation schemes, as well as, on their performance in the presence of additive noise. To facilitate this analysis and generalize the approach, we use a harmonic noise model to approximate the noise component. The error analysis is conducted both in the continuous- and discrete-time domains, deriving similar conclusions. The estimation errors are measured in terms of normalized deviations from the expected signal energy and are shown to greatly depend on both the signals' spectral content and the analysis window length. When medium- and long-term analysis windows are employed, the Teager-Kaiser energy operator is proven superior to the common squared energy operator, provided that the spectral content of the noise is more lowpass than the corresponding signal content, and vice versa. However, for shorter window lengths, the Teager-Kaiser operator always outperforms the squared energy operator. The theoretical results are experimentally verified for synthetic signals. Finally, the performance of the proposed energy operators is evaluated for short-term analysis of noisy speech signals and the implications for speech processing applications are outlined.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Time-frequency distributions that evaluate the signal's energy content both in the time and frequency domains are indispensable signal processing tools, especially, for nonstationary signals. Various short-time energy computation schemes are used in practice, including the mean squared amplitude and Teager-Kaiser energy approaches. Herein, we focus primarily on the short- and medium-term properties of these two energy estimation schemes, as well as, on their performance in the presence of additive noise. To facilitate this analysis and generalize the approach, we use a harmonic noise model to approximate the noise component. The error analysis is conducted both in the continuous- and discrete-time domains, deriving similar conclusions. The estimation errors are measured in terms of normalized deviations from the expected signal energy and are shown to greatly depend on both the signals' spectral content and the analysis window length. When medium- and long-term analysis windows are employed, the Teager-Kaiser energy operator is proven superior to the common squared energy operator, provided that the spectral content of the noise is more lowpass than the corresponding signal content, and vice versa. However, for shorter window lengths, the Teager-Kaiser operator always outperforms the squared energy operator. The theoretical results are experimentally verified for synthetic signals. Finally, the performance of the proposed energy operators is evaluated for short-term analysis of noisy speech signals and the implications for speech processing applications are outlined. |
Copyright Notice:
Some material presented is available for download to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
The work already published by the IEEE is under its copyright. Personal use of such material is permitted. However, permission to reprint/republish the material for advertising or promotional purposes, or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of the work in other works must be obtained from the IEEE.