(+30) 210772-4709
- potam@central.ntua.gr
- Office 2.1.4
Biosketch
Alexandros Potamianos received the Diploma in Electrical and Computer Engineering from the National Technical University of Athens, Greece in 1990. He received the M.S and Ph.D. degrees in Engineering Sciences from Harvard University, Cambridge, MA, USA in 1991 and 1995, respectively. He received the M.B.A. degree from Stern School of Business, NYU in 2002.
From 1991 to June 1993 he was a research assistant at the Robotics Lab, Harvard University. From 1993 to 1995 he was a research assistant at the Digital Signal Processing Lab at Georgia Tech. From 1995 to 1999 he was a Senior Technical Staff Member at the Speech and Image Processing Lab, AT&T Shannon Labs, Florham Park, NJ. From 1999 to 2002 he was a Technical Staff Member and Technical Supervisor at the Multimedia Communications Lab at Bell Labs, Lucent Technologies, Murray Hill, NJ. From 1999 to 2001 he was an adjunct Assistant Professor at the Department of Electrical Engineering of Columbia University, New York, NY. From 2003 to 2013 he was an adjunct Associate Professor at the Department of Electronic and Computer Engineering of Technical University of Crete, Chania, Greece. In the summer of 2013, he joined the School of Electronical and Computer Engineering at the National Technical University of Athens, Athens, Greece as an associate professor.
His current research interests include speech processing, analysis, synthesis and recognition, dialog and multi-modal systems, lexical semantics, nonlinear signal processing, natural language understanding, artificial intelligence and multimodal child-computer interaction.
Prof. Potamianos has authored or co-authored over 110 papers in professional journals and conferences (citations: 2700, h-index: 25, in google scholar as of Sept 2013). He is the co-author of the paper “Creating conversational interfaces for children” that received a 2005 IEEE Signal Processing Society Best Paper Award. He is the co-editor of the book “Multimodal Processing and Interaction: Audio, Video, Text”, Springer, 2008. He holds four patents. He has been a member of the IEEE Signal Processing Society since 1992 and a senior member since 2010. He is currently serving his third term at the IEEE Speech and Language Technical Committee and his first term at the IEEE Multimedia Signal Processing Committee.
Publications
2017 |
A Zlatintsi, P Koutras, G Evangelopoulos, N Malandrakis, N Efthymiou, K Pastra, A Potamianos, P Maragos COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization Journal Article EURASIP Journal on Image and Video Processing, 54 , pp. 1–24, 2017. Abstract | BibTeX | Links: [PDF] @article{ZKE+17, title = {COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization}, author = {A Zlatintsi and P Koutras and G Evangelopoulos and N Malandrakis and N Efthymiou and K Pastra and A Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/publications/Zlatintsi+_COGNIMUSEdb_EURASIP_JIVP-2017.pdf}, doi = {doi 10.1186/s13640-017-0194}, year = {2017}, date = {2017-01-01}, journal = {EURASIP Journal on Image and Video Processing}, volume = {54}, pages = {1--24}, abstract = {Research related to computational modeling for machine-based understanding requires ground truth data for training, content analysis, and evaluation. In this paper, we present a multimodal video database, namely COGNIMUSE, annotated with sensory and semantic saliency, events, cross-media semantics, and emotion. The purpose of this database is manifold; it can be used for training and evaluation of event detection and summarization algorithms, for classification and recognition of audio-visual and cross-media events, as well as for emotion tracking. In order to enable comparisons with other computational models, we propose state-of-the-art algorithms, specifically a unified energy-based audio-visual framework and a method for text saliency computation, for the detection of perceptually salient events from videos. Additionally, a movie summarization system for the automatic production of summaries is presented. Two kinds of evaluation were performed, an objective based on the saliency annotation of the database and an extensive qualitative human evaluation of the automatically produced summaries, where we investigated what composes high-quality movie summaries, where both methods verified the appropriateness of the proposed methods. The annotation of the database and the code for the summarization system can be found at http://cognimuse.cs.ntua.gr/database.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Research related to computational modeling for machine-based understanding requires ground truth data for training, content analysis, and evaluation. In this paper, we present a multimodal video database, namely COGNIMUSE, annotated with sensory and semantic saliency, events, cross-media semantics, and emotion. The purpose of this database is manifold; it can be used for training and evaluation of event detection and summarization algorithms, for classification and recognition of audio-visual and cross-media events, as well as for emotion tracking. In order to enable comparisons with other computational models, we propose state-of-the-art algorithms, specifically a unified energy-based audio-visual framework and a method for text saliency computation, for the detection of perceptually salient events from videos. Additionally, a movie summarization system for the automatic production of summaries is presented. Two kinds of evaluation were performed, an objective based on the saliency annotation of the database and an extensive qualitative human evaluation of the automatically produced summaries, where we investigated what composes high-quality movie summaries, where both methods verified the appropriateness of the proposed methods. The annotation of the database and the code for the summarization system can be found at http://cognimuse.cs.ntua.gr/database. |
G Karamanolakis, E Iosif, A Zlatintsi, A Pikrakis, A Potamianos Audio-based Distributional Semantic Models for Music Auto-tagging and Similarity Measurement Conference Proc. MultiLearn2017: Multimodal Processing, Modeling and Learning for Human-Computer/Robot Interaction Workshop, in conjuction with European Signal Processing Conference, Kos, Greece, 2017. Abstract | BibTeX | Links: [PDF] @conference{KIZ+17, title = {Audio-based Distributional Semantic Models for Music Auto-tagging and Similarity Measurement}, author = {G Karamanolakis and E Iosif and A Zlatintsi and A Pikrakis and A Potamianos}, url = {http://robotics.ntua.gr/wp-content/publications/Karamanolakis+_MultiLearn-17_ML7.pdf}, year = {2017}, date = {2017-01-01}, booktitle = {Proc. MultiLearn2017: Multimodal Processing, Modeling and Learning for Human-Computer/Robot Interaction Workshop, in conjuction with European Signal Processing Conference}, address = {Kos, Greece}, abstract = {The recent development of Audio-based Distributional Semantic Models (ADSMs) enables the computation of audio and lexical vector representations in a joint acoustic-semantic space. In this work, these joint representations are applied to the problem of automatic tag generation. The predicted tags together with their corresponding acoustic representation are exploited for the construction of acoustic-semantic clip embeddings. The proposed algorithms are evaluated on the task of similarity measurement between music clips. Acoustic-semantic models are shown to outperform the state-of-the-art for this task and produce high quality tags for audio/music clips.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } The recent development of Audio-based Distributional Semantic Models (ADSMs) enables the computation of audio and lexical vector representations in a joint acoustic-semantic space. In this work, these joint representations are applied to the problem of automatic tag generation. The predicted tags together with their corresponding acoustic representation are exploited for the construction of acoustic-semantic clip embeddings. The proposed algorithms are evaluated on the task of similarity measurement between music clips. Acoustic-semantic models are shown to outperform the state-of-the-art for this task and produce high quality tags for audio/music clips. |
2016 |
G Karamanolakis, E Iosif, A Zlatintsi, A Pikrakis, A Potamianos Audio-Based Distributional Representations of Meaning Using a Fusion of Feature Encodings Conference 2016. Abstract | BibTeX | Links: [Webpage] [PDF] @conference{Karamanolakis2016, title = {Audio-Based Distributional Representations of Meaning Using a Fusion of Feature Encodings}, author = {G Karamanolakis and E Iosif and A Zlatintsi and A Pikrakis and A Potamianos}, url = {http://robotics.ntua.gr/wp-content/publications/Karamanolakis+_Interspeech16.PDF http://robotics.ntua.gr/wp-content/uploads/sites/2/karamanolakis16_interspeech.pdf}, year = {2016}, date = {2016-09-01}, abstract = {Recently a “Bag-of-Audio-Words” approach was proposed [1] for the combination of lexical features with audio clips in a multimodal semantic representation, i.e., an Audio Distributional Semantic Model (ADSM). An important step towards the creation of ADSMs is the estimation of the semantic distance between clips in the acoustic space, which is especially challenging given the diversity of audio collections. In this work, we investigate the use of different feature encodings in order to address this challenge following a two-step approach. First, an audio clip is categorized with respect to three classes, namely, music, speech and other. Next, the feature encodings are fused according to the posterior probabilities estimated in the previous step. Using a collection of audio clips annotated with tags we derive a mapping between words and audio clips. Based on this mapping and the proposed audio semantic distance, we construct an ADSM model in order to compute the distance between words (lexical semantic similarity task). The proposed model is shown to significantly outperform (23.6% relative improvement in correlation coefficient) the state-of-the-art results reported in the literature.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Recently a “Bag-of-Audio-Words” approach was proposed [1] for the combination of lexical features with audio clips in a multimodal semantic representation, i.e., an Audio Distributional Semantic Model (ADSM). An important step towards the creation of ADSMs is the estimation of the semantic distance between clips in the acoustic space, which is especially challenging given the diversity of audio collections. In this work, we investigate the use of different feature encodings in order to address this challenge following a two-step approach. First, an audio clip is categorized with respect to three classes, namely, music, speech and other. Next, the feature encodings are fused according to the posterior probabilities estimated in the previous step. Using a collection of audio clips annotated with tags we derive a mapping between words and audio clips. Based on this mapping and the proposed audio semantic distance, we construct an ADSM model in order to compute the distance between words (lexical semantic similarity task). The proposed model is shown to significantly outperform (23.6% relative improvement in correlation coefficient) the state-of-the-art results reported in the literature. |
2015 |
P Koutras, A Zlatintsi, E.Iosif, A Katsamanis, P Maragos, A Potamianos Predicting Audio-Visual Salient Events Based on Visual, Audio and Text Modalities for Movie Summarization Conference Proc. {IEEE} Int'l Conf. Acous., Speech, and Signal Processing, Quebec, Canada, 2015. Abstract | BibTeX | Links: [PDF] @conference{KZI+15, title = {Predicting Audio-Visual Salient Events Based on Visual, Audio and Text Modalities for Movie Summarization}, author = {P Koutras and A Zlatintsi and E.Iosif and A Katsamanis and P Maragos and A Potamianos}, url = {http://robotics.ntua.gr/wp-content/publications/KZIKMP_MovieSum2_ICIP-2015.pdf}, year = {2015}, date = {2015-09-01}, booktitle = {Proc. {IEEE} Int'l Conf. Acous., Speech, and Signal Processing}, address = {Quebec, Canada}, abstract = {In this paper, we present a new and improved synergistic approach to the problem of audio-visual salient event detection and movie summarization based on visual, audio and text modalities. Spatio-temporal visual saliency is estimated through a perceptually inspired frontend based on 3D (space, time) Gabor filters and frame-wise features are extracted from the saliency volumes. For the auditory salient event detection we extract features based on Teager-Kaiser Energy Operator, while text analysis incorporates part-of-speech tag-ging and affective modeling of single words on the movie subtitles. For the evaluation of the proposed system, we employ an elementary and non-parametric classification technique like KNN. Detection results are reported on the MovSum database, using objective evaluations against ground-truth denoting the perceptually salient events, and human evaluations of the movie summaries. Our evaluation verifies the appropriateness of the proposed methods compared to our baseline system. Finally, our newly proposed summarization algorithm produces summaries that consist of salient and meaningful events, also improving the comprehension of the semantics.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } In this paper, we present a new and improved synergistic approach to the problem of audio-visual salient event detection and movie summarization based on visual, audio and text modalities. Spatio-temporal visual saliency is estimated through a perceptually inspired frontend based on 3D (space, time) Gabor filters and frame-wise features are extracted from the saliency volumes. For the auditory salient event detection we extract features based on Teager-Kaiser Energy Operator, while text analysis incorporates part-of-speech tag-ging and affective modeling of single words on the movie subtitles. For the evaluation of the proposed system, we employ an elementary and non-parametric classification technique like KNN. Detection results are reported on the MovSum database, using objective evaluations against ground-truth denoting the perceptually salient events, and human evaluations of the movie summaries. Our evaluation verifies the appropriateness of the proposed methods compared to our baseline system. Finally, our newly proposed summarization algorithm produces summaries that consist of salient and meaningful events, also improving the comprehension of the semantics. |
A Zlatintsi, E.Iosif, P Maragos, A Potamianos Audio Salient Event Detection and Summarization using Audio and Text Modalities Conference Nice, France, 2015. Abstract | BibTeX | Links: [PDF] @conference{ZIM+15, title = {Audio Salient Event Detection and Summarization using Audio and Text Modalities}, author = {A Zlatintsi and E.Iosif and P Maragos and A Potamianos}, url = {http://robotics.ntua.gr/wp-content/publications/ZlatintsiEtAl_AudioTextSum-EUSIPCO-2015.pdf}, year = {2015}, date = {2015-09-01}, address = {Nice, France}, abstract = {This paper investigates the problem of audio event detection and summarization, building on previous work [1, 2] on the detection of perceptually important audio events based on saliency models. We take a synergistic approach to audio summarization where saliency computation of audio streams is assisted by using the text modality as well. Auditory saliency is assessed by auditory and perceptual cues such as Teager energy, loudness and roughness; all known to correlate with attention and human hearing. Text analysis incorporates part-of-speech tagging and affective modeling. A computational method for the automatic correction of the boundaries of the selected audio events is applied creating summaries that consist not only of salient but also meaningful and semantically coherent events. A non-parametric classification technique is employed and results are reported on the MovSum movie database using objective evaluations against ground-truth designating the auditory and semantically salient events.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } This paper investigates the problem of audio event detection and summarization, building on previous work [1, 2] on the detection of perceptually important audio events based on saliency models. We take a synergistic approach to audio summarization where saliency computation of audio streams is assisted by using the text modality as well. Auditory saliency is assessed by auditory and perceptual cues such as Teager energy, loudness and roughness; all known to correlate with attention and human hearing. Text analysis incorporates part-of-speech tagging and affective modeling. A computational method for the automatic correction of the boundaries of the selected audio events is applied creating summaries that consist not only of salient but also meaningful and semantically coherent events. A non-parametric classification technique is employed and results are reported on the MovSum movie database using objective evaluations against ground-truth designating the auditory and semantically salient events. |
A Zlatintsi, P Koutras, N Efthymiou, P Maragos, A Potamianos, K Pastra Quality Evaluation of Computational Models for Movie Summarization Conference Costa Navarino, Messinia, Greece, 2015. Abstract | BibTeX | Links: [PDF] @conference{ZKE+15, title = {Quality Evaluation of Computational Models for Movie Summarization}, author = {A Zlatintsi and P Koutras and N Efthymiou and P Maragos and A Potamianos and K Pastra}, url = {http://robotics.ntua.gr/wp-content/publications/ZlatintsiEtAl_MovieSumEval-QoMEX2015.pdf}, year = {2015}, date = {2015-05-01}, address = {Costa Navarino, Messinia, Greece}, abstract = {In this paper we present a movie summarization system and we investigate what composes high quality movie summaries in terms of user experience evaluation. We propose state-of-the-art audio, visual and text techniques for the detection of perceptually salient events from movies. The evaluation of such computational models is usually based on the comparison of the similarity between the system-detected events and some ground-truth data. For this reason, we have developed the MovSum movie database, which includes sensory and semantic saliency annotation as well as cross-media relations, for objective evaluations. The automatically produced movie summaries were qualitatively evaluated, in an extensive human evaluation, in terms of informativeness and enjoyability accomplishing very high ratings up to 80% and 90%, respectively, which verifies the appropriateness of the proposed methods.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } In this paper we present a movie summarization system and we investigate what composes high quality movie summaries in terms of user experience evaluation. We propose state-of-the-art audio, visual and text techniques for the detection of perceptually salient events from movies. The evaluation of such computational models is usually based on the comparison of the similarity between the system-detected events and some ground-truth data. For this reason, we have developed the MovSum movie database, which includes sensory and semantic saliency annotation as well as cross-media relations, for objective evaluations. The automatically produced movie summaries were qualitatively evaluated, in an extensive human evaluation, in terms of informativeness and enjoyability accomplishing very high ratings up to 80% and 90%, respectively, which verifies the appropriateness of the proposed methods. |
P. Koutras, A. Zlatintsi, E. Iosif, A. Katsamanis, P. Maragos, A. Potamianos Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization Conference Proceedings - International Conference on Image Processing, ICIP, 2015-December , 2015, ISSN: 15224880. @conference{307, title = {Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization}, author = { P. Koutras and A. Zlatintsi and E. Iosif and A. Katsamanis and P. Maragos and A. Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/KZIKMP_MovieSum2_ICIP-2015.pdf}, doi = {10.1109/ICIP.2015.7351630}, issn = {15224880}, year = {2015}, date = {2015-01-01}, booktitle = {Proceedings - International Conference on Image Processing, ICIP}, volume = {2015-December}, pages = {4361--4365}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
2012 |
A Zlatintsi, P Maragos, A Potamianos, G Evangelopoulos A Saliency-Based Approach to Audio Event Detection and Summarization Conference Proc. European Signal Processing Conference, Bucharest, Romania, 2012. Abstract | BibTeX | Links: [PDF] @conference{ZMP+12, title = {A Saliency-Based Approach to Audio Event Detection and Summarization}, author = {A Zlatintsi and P Maragos and A Potamianos and G Evangelopoulos}, url = {http://robotics.ntua.gr/wp-content/publications/ZlatintsiMaragos+_SaliencyBasedAudioSummarization_EUSIPCO2012.pdf}, year = {2012}, date = {2012-08-01}, booktitle = {Proc. European Signal Processing Conference}, address = {Bucharest, Romania}, abstract = {In this paper, we approach the problem of audio summarization by saliency computation of audio streams, exploring the potential of a modulation model for the detection of perceptually important audio events based on saliency models, along with various fusion schemes for their combination. The fusion schemes include linear, adaptive and nonlinear methods. A machine learning approach, where training of the features is performed, was also applied for the purpose of comparison with the proposed technique. For the evaluation of the algorithm we use audio data taken from movies and we show that nonlinear fusion schemes perform best. The results are reported on the MovSum database, using objective evaluations (against ground-truth denoting the perceptually important audio events). Analysis of the selected audio segments is also performed against a labeled database in respect to audio categories, while a method for fine-tuning of the selected audio events is proposed.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } In this paper, we approach the problem of audio summarization by saliency computation of audio streams, exploring the potential of a modulation model for the detection of perceptually important audio events based on saliency models, along with various fusion schemes for their combination. The fusion schemes include linear, adaptive and nonlinear methods. A machine learning approach, where training of the features is performed, was also applied for the purpose of comparison with the proposed technique. For the evaluation of the algorithm we use audio data taken from movies and we show that nonlinear fusion schemes perform best. The results are reported on the MovSum database, using objective evaluations (against ground-truth denoting the perceptually important audio events). Analysis of the selected audio segments is also performed against a labeled database in respect to audio categories, while a method for fine-tuning of the selected audio events is proposed. |
2011 |
Dimitrios Dimitriadis, Petros Maragos, Alexandros Potamianos On the effects of filterbank design and energy computation on robust speech recognition Journal Article IEEE Transactions on Audio, Speech and Language Processing, 19 (6), pp. 1504–1516, 2011, ISSN: 15587916. Abstract | BibTeX | Links: [PDF] @article{137, title = {On the effects of filterbank design and energy computation on robust speech recognition}, author = {Dimitrios Dimitriadis and Petros Maragos and Alexandros Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/DimitriadisMaragosPotamianos_Effects-Filterbank-Design-Energy-Computation-Robust-Speech-Recognition_ieeeTASLP_aug11.pdf}, doi = {10.1109/TASL.2010.2092766}, issn = {15587916}, year = {2011}, date = {2011-01-01}, journal = {IEEE Transactions on Audio, Speech and Language Processing}, volume = {19}, number = {6}, pages = {1504--1516}, abstract = {In this paper, we examine how energy computation and filterbank design contribute to the overall front-end robustness, especially when the investigated features are applied to noisy speech signals, in mismatched training-testing conditions. In prior work (“Auditory Teager energy cepstrum coefficients for robust speech recognition,” D. Dimitriadis, P. Maragos, and A. Potamianos, in Proc. Eurospeech'05, Sep. 2005), a novel feature set called “Teager energy cepstrum coefficients” (TECCs) has been proposed, employing a dense, smooth filterbank and alternative energy computation schemes. TECCs were shown to be more robust to noise and exhibit improved performance compared to the widely used Mel frequency cepstral coefficients (MFCCs). In this paper, we attempt to interpret these results using a combined theoretical and experimental analysis framework. Specifically, we investigate in detail the connection between the filterbank design, i.e., the filter shape and bandwidth, the energy estimation scheme and the automatic speech recognition (ASR) performance under a variety of additive and/or convolutional noise conditions. For this purpose: 1) the performance of filterbanks using triangular, Gabor, and Gammatone filters with various bandwidths and filter positions are examined under different noisy speech recognition tasks, and 2) the squared amplitude and Teager–Kaiser energy operators are compared as two alternative approaches of computing the signal energy. Our end-goal is to understand how to select the most efficient filterbank and energy computation scheme that are maximally robust under both clean and noisy recording conditions. Theoretical and experimental results show that: 1) the filter bandwidth is one of the most important factors affecting speech recognition performance in noise, while the shape of the filter is of secondary importance, and 2) the Teager–Kaiser operator outperforms (on the average and for most noise types) the squared amplitude energy computation scheme for speech recognition in noisy conditions, especially, for large filter bandwidths. Experimental results show that selecting the appropriate filterbank and energy computation scheme can lead to significant error rate reduction over both MFCC and perceptual linear predicion (PLP) features for a variety of speech recognition tasks. A relative error rate reduction of up to textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 30backslashhboxbackslash%$textless/textextgreater textless/formulatextgreater for MFCCs and textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 39backslashhboxbackslash%$textless/textextgreatertextless/formulatextgreater for PLPs is shown for the Aurora-3 Spanish Task.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In this paper, we examine how energy computation and filterbank design contribute to the overall front-end robustness, especially when the investigated features are applied to noisy speech signals, in mismatched training-testing conditions. In prior work (“Auditory Teager energy cepstrum coefficients for robust speech recognition,” D. Dimitriadis, P. Maragos, and A. Potamianos, in Proc. Eurospeech'05, Sep. 2005), a novel feature set called “Teager energy cepstrum coefficients” (TECCs) has been proposed, employing a dense, smooth filterbank and alternative energy computation schemes. TECCs were shown to be more robust to noise and exhibit improved performance compared to the widely used Mel frequency cepstral coefficients (MFCCs). In this paper, we attempt to interpret these results using a combined theoretical and experimental analysis framework. Specifically, we investigate in detail the connection between the filterbank design, i.e., the filter shape and bandwidth, the energy estimation scheme and the automatic speech recognition (ASR) performance under a variety of additive and/or convolutional noise conditions. For this purpose: 1) the performance of filterbanks using triangular, Gabor, and Gammatone filters with various bandwidths and filter positions are examined under different noisy speech recognition tasks, and 2) the squared amplitude and Teager–Kaiser energy operators are compared as two alternative approaches of computing the signal energy. Our end-goal is to understand how to select the most efficient filterbank and energy computation scheme that are maximally robust under both clean and noisy recording conditions. Theoretical and experimental results show that: 1) the filter bandwidth is one of the most important factors affecting speech recognition performance in noise, while the shape of the filter is of secondary importance, and 2) the Teager–Kaiser operator outperforms (on the average and for most noise types) the squared amplitude energy computation scheme for speech recognition in noisy conditions, especially, for large filter bandwidths. Experimental results show that selecting the appropriate filterbank and energy computation scheme can lead to significant error rate reduction over both MFCC and perceptual linear predicion (PLP) features for a variety of speech recognition tasks. A relative error rate reduction of up to textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 30backslashhboxbackslash%$textless/textextgreater textless/formulatextgreater for MFCCs and textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 39backslashhboxbackslash%$textless/textextgreatertextless/formulatextgreater for PLPs is shown for the Aurora-3 Spanish Task. |
2009 |
Dimitrios Dimitriadis, Alexandros Potamianos, Petros Maragos A comparison of the squared energy and teager-kaiser operators for short-term energy estimation in additive noise Journal Article IEEE Transactions on Signal Processing, 57 (7), pp. 2569–2581, 2009, ISSN: 1053587X. Abstract | BibTeX | Links: [PDF] @article{132, title = {A comparison of the squared energy and teager-kaiser operators for short-term energy estimation in additive noise}, author = {Dimitrios Dimitriadis and Alexandros Potamianos and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/DimitriadisPotamianosMaragos_ComparisonSquaredAmpl-TKOper-EnergyEstimation_ieeetSP2008.pdf}, doi = {10.1109/TSP.2009.2019299}, issn = {1053587X}, year = {2009}, date = {2009-01-01}, journal = {IEEE Transactions on Signal Processing}, volume = {57}, number = {7}, pages = {2569--2581}, abstract = {Time-frequency distributions that evaluate the signal's energy content both in the time and frequency domains are indispensable signal processing tools, especially, for nonstationary signals. Various short-time energy computation schemes are used in practice, including the mean squared amplitude and Teager-Kaiser energy approaches. Herein, we focus primarily on the short- and medium-term properties of these two energy estimation schemes, as well as, on their performance in the presence of additive noise. To facilitate this analysis and generalize the approach, we use a harmonic noise model to approximate the noise component. The error analysis is conducted both in the continuous- and discrete-time domains, deriving similar conclusions. The estimation errors are measured in terms of normalized deviations from the expected signal energy and are shown to greatly depend on both the signals' spectral content and the analysis window length. When medium- and long-term analysis windows are employed, the Teager-Kaiser energy operator is proven superior to the common squared energy operator, provided that the spectral content of the noise is more lowpass than the corresponding signal content, and vice versa. However, for shorter window lengths, the Teager-Kaiser operator always outperforms the squared energy operator. The theoretical results are experimentally verified for synthetic signals. Finally, the performance of the proposed energy operators is evaluated for short-term analysis of noisy speech signals and the implications for speech processing applications are outlined.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Time-frequency distributions that evaluate the signal's energy content both in the time and frequency domains are indispensable signal processing tools, especially, for nonstationary signals. Various short-time energy computation schemes are used in practice, including the mean squared amplitude and Teager-Kaiser energy approaches. Herein, we focus primarily on the short- and medium-term properties of these two energy estimation schemes, as well as, on their performance in the presence of additive noise. To facilitate this analysis and generalize the approach, we use a harmonic noise model to approximate the noise component. The error analysis is conducted both in the continuous- and discrete-time domains, deriving similar conclusions. The estimation errors are measured in terms of normalized deviations from the expected signal energy and are shown to greatly depend on both the signals' spectral content and the analysis window length. When medium- and long-term analysis windows are employed, the Teager-Kaiser energy operator is proven superior to the common squared energy operator, provided that the spectral content of the noise is more lowpass than the corresponding signal content, and vice versa. However, for shorter window lengths, the Teager-Kaiser operator always outperforms the squared energy operator. The theoretical results are experimentally verified for synthetic signals. Finally, the performance of the proposed energy operators is evaluated for short-term analysis of noisy speech signals and the implications for speech processing applications are outlined. |
G Evangelopoulos, A Zlatintsi, G Skoumas, K Rapantzikos, A Potamianos, P Maragos, Y Avrithis Video Event Detection and Summarization Using Audio, Visual and Text Saliency Conference Taipei, Taiwan, 2009. Abstract | BibTeX | Links: [PDF] @conference{EZS+09, title = {Video Event Detection and Summarization Using Audio, Visual and Text Saliency}, author = {G Evangelopoulos and A Zlatintsi and G Skoumas and K Rapantzikos and A Potamianos and P Maragos and Y Avrithis}, url = {http://robotics.ntua.gr/wp-content/publications/EvangelopoulosZlatintsiEtAl_VideoEventDetectionSummarizationUsingAVTSaliency_ICASSP09.pdf}, year = {2009}, date = {2009-04-01}, address = {Taipei, Taiwan}, abstract = {Detection of perceptually important video events is formulated here on the basis of saliency models for the audio, visual and textual information conveyed in a video stream. Audio saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color and motion. Text saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The various modality curves are integrated in a single attention curve, where the presence of an event may be signised in one or multiple domains. This multimodal saliency curve is the basis of a bottom-up video summarization algorithm, that refines results from unimodal or audiovisual-based skimming. The algorithm performs favorably for video summarization in terms of informativeness and enjoyability.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Detection of perceptually important video events is formulated here on the basis of saliency models for the audio, visual and textual information conveyed in a video stream. Audio saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color and motion. Text saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The various modality curves are integrated in a single attention curve, where the presence of an event may be signised in one or multiple domains. This multimodal saliency curve is the basis of a bottom-up video summarization algorithm, that refines results from unimodal or audiovisual-based skimming. The algorithm performs favorably for video summarization in terms of informativeness and enjoyability. |
G Evangelopoulos, A Zlatintsi, G Skoumas, K Rapantzikos, A Potamianos, P Maragos, Y Avrithis Video Event Detection and Summarization using Audio, Visual and Text Saliency Conference Icassp, (2), 2009, ISBN: 9781424423545. @conference{195, title = {Video Event Detection and Summarization using Audio, Visual and Text Saliency}, author = { G Evangelopoulos and A Zlatintsi and G Skoumas and K Rapantzikos and A Potamianos and P Maragos and Y Avrithis}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/EvangelopoulosZlatintsiEtAl_VideoEventDetectionSummarizationUsingAVTSaliency_ICASSP09.pdf}, isbn = {9781424423545}, year = {2009}, date = {2009-01-01}, booktitle = {Icassp}, number = {2}, pages = {3553--3556}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
2008 |
G. Evangelopoulos, K. Rapantzikos, A. Potamianos, P. Maragos, A. Zlatintsi, Y. Avrithis Movie summarization based on audiovisual saliency detection Conference Proceedings - International Conference on Image Processing, ICIP, 2008, ISSN: 15224880. Abstract | BibTeX | Links: [PDF] @conference{203, title = {Movie summarization based on audiovisual saliency detection}, author = { G. Evangelopoulos and K. Rapantzikos and A. Potamianos and P. Maragos and A. Zlatintsi and Y. Avrithis}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/ERPMZA_MovieSummarizAVSaliency_ICIP2008.pdf}, doi = {10.1109/ICIP.2008.4712308}, issn = {15224880}, year = {2008}, date = {2008-01-01}, booktitle = {Proceedings - International Conference on Image Processing, ICIP}, pages = {2528--2531}, abstract = {Based on perceptual and computational attention modeling studies, we formulate measures of saliency for an audiovisual stream. Audio saliency is captured by signal modulations and related multi-frequency band features, extracted through nonlinear operators and energy tracking. Visual saliency is measured by means of a spatiotemporal attention model driven by various feature cues (intensity, color, motion). Audio and video curves are integrated in a single attention curve, where events may be enhanced, suppressed or vanished. The presence of salient events is signified on this audiovisual curve by geometrical features such as local extrema, sharp transition points and level sets. An audiovisual saliency-based movie summarization algorithm is proposed and evaluated. The algorithm is shown to perform very well in terms of summary informativeness and enjoyability for movie clips of various genres.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Based on perceptual and computational attention modeling studies, we formulate measures of saliency for an audiovisual stream. Audio saliency is captured by signal modulations and related multi-frequency band features, extracted through nonlinear operators and energy tracking. Visual saliency is measured by means of a spatiotemporal attention model driven by various feature cues (intensity, color, motion). Audio and video curves are integrated in a single attention curve, where events may be enhanced, suppressed or vanished. The presence of salient events is signified on this audiovisual curve by geometrical features such as local extrema, sharp transition points and level sets. An audiovisual saliency-based movie summarization algorithm is proposed and evaluated. The algorithm is shown to perform very well in terms of summary informativeness and enjoyability for movie clips of various genres. |
Georgios Evangelopoulos, Konstantinos Rapantzikos, Petros Maragos, Yannis Avrithis, Alexandros Potamianos Audiovisual Attention Modeling and Salient Event Detection Book Chapter Maragos, Petros; Potamianos, Alexandros; Gros, Patrick (Ed.): Multimodal Processing and Interaction: Audio, Video, Text, pp. 1–21, Springer US, Boston, MA, 2008, ISBN: 978-0-387-76316-3. BibTeX | Links: [Webpage] [PDF] @inbook{Evangelopoulos2008b, title = {Audiovisual Attention Modeling and Salient Event Detection}, author = {Georgios Evangelopoulos and Konstantinos Rapantzikos and Petros Maragos and Yannis Avrithis and Alexandros Potamianos}, editor = {Petros Maragos and Alexandros Potamianos and Patrick Gros}, url = {https://doi.org/10.1007/978-0-387-76316-3_8 http://robotics.ntua.gr/wp-content/uploads/sites/2/Evangelopoulos-et-al_Chapter-of-Book_MPIAVT_Maragos-et-aled_Springer2008_peprint.pdf}, doi = {10.1007/978-0-387-76316-3_8}, isbn = {978-0-387-76316-3}, year = {2008}, date = {2008-01-01}, booktitle = {Multimodal Processing and Interaction: Audio, Video, Text}, pages = {1--21}, publisher = {Springer US}, address = {Boston, MA}, keywords = {}, pubstate = {published}, tppubtype = {inbook} } |
2005 |
Dimitrios Dimitriadis, Petros Maragos, Alexandros Potamianos Auditory Teager Energy Cepstrum Coefficients for Robust Speech Recognition Conference Proc. of European Speech Processing Conference, (2), 2005. Abstract | BibTeX | Links: [PDF] @conference{233, title = {Auditory Teager Energy Cepstrum Coefficients for Robust Speech Recognition}, author = { Dimitrios Dimitriadis and Petros Maragos and Alexandros Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/DimitriadisMaragosPotamianos_AuditTeagEnergCepstrumRobustSpeechRecogn_Interspeech2005.pdf}, year = {2005}, date = {2005-01-01}, booktitle = {Proc. of European Speech Processing Conference}, number = {2}, pages = {3013--3016}, abstract = {In this paper, a feature extraction algorithm for robust speech recognition is introduced. The feature extraction algorithm is motivated by the human auditory processing and the nonlinear Teager-Kaiser energy operator that estimates the true energy of the source of a resonance. The proposed features are labeled as Teager Energy Cepstrum Coefficients (TECCs). TECCs are computed by first filtering the speech signal through a dense non constant-Q Gammatone filterbank and then by estimating the " true " energy of the signal's source, i.e., the short-time average of the output of the Teager-Kaiser energy operator. Error anal-ysis and speech recognition experiments show that the TECCs and the mel frequency cepstrum coefficients (MFCCs) perform similarly for clean recording conditions; while the TECCs per-form significantly better than the MFCCs for noisy recognition tasks. Specifically, relative word error rate improvement of 60% over the MFCC baseline is shown for the Aurora-3 database for the high-mismatch condition. Absolute error rate improvement ranging from 5% to 20% is shown for a phone recognition task in (various types of additive) noise.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } In this paper, a feature extraction algorithm for robust speech recognition is introduced. The feature extraction algorithm is motivated by the human auditory processing and the nonlinear Teager-Kaiser energy operator that estimates the true energy of the source of a resonance. The proposed features are labeled as Teager Energy Cepstrum Coefficients (TECCs). TECCs are computed by first filtering the speech signal through a dense non constant-Q Gammatone filterbank and then by estimating the " true " energy of the signal's source, i.e., the short-time average of the output of the Teager-Kaiser energy operator. Error anal-ysis and speech recognition experiments show that the TECCs and the mel frequency cepstrum coefficients (MFCCs) perform similarly for clean recording conditions; while the TECCs per-form significantly better than the MFCCs for noisy recognition tasks. Specifically, relative word error rate improvement of 60% over the MFCC baseline is shown for the Aurora-3 database for the high-mismatch condition. Absolute error rate improvement ranging from 5% to 20% is shown for a phone recognition task in (various types of additive) noise. |
2002 |
D Dimitriadis, P Maragos, A Potamianos Modulation features for speech recognition Journal Article International Conference on Acoustics, 1 , pp. I–377–I–380, 2002. @article{76c, title = {Modulation features for speech recognition}, author = {D Dimitriadis and P Maragos and A Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/DimitriadisMaragosPotamianos_RobustAMFM_Features_SpeechRecognition_ieeeSPL2005.pdf}, year = {2002}, date = {2002-01-01}, journal = {International Conference on Acoustics}, volume = {1}, pages = {I--377--I--380}, keywords = {}, pubstate = {published}, tppubtype = {article} } |
D Dimitriadis, P Maragos, A Potamianos Modulation features for speech recognition Conference International Conference on Acoustics, 1 , 2002. @conference{Dimitriadis2002, title = {Modulation features for speech recognition}, author = { D Dimitriadis and P Maragos and A Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/dimitriadis2002.pdf}, year = {2002}, date = {2002-05-01}, booktitle = {International Conference on Acoustics}, volume = {1}, pages = {I--377--I--380}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
2001 |
Alexandros Potamianos, Petros Maragos Time-frequency distributions for automatic speech recognition Journal Article Transactions on Speech and Audio Processing, IEEE, 9 (3), pp. 196–200, 2001. @article{114, title = {Time-frequency distributions for automatic speech recognition}, author = {Alexandros Potamianos and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/PotamianosMaragos_TFD-ASR_ieeetSAP2001.pdf}, year = {2001}, date = {2001-01-01}, journal = {Transactions on Speech and Audio Processing, IEEE}, volume = {9}, number = {3}, pages = {196--200}, keywords = {}, pubstate = {published}, tppubtype = {article} } |
1999 |
Alexandros Potamianos, Petros Maragos Speech analysis and synthesis using an AM ± FM modulation Journal Article Speech Communication, 28 (3), pp. 195–209, 1999. @article{110, title = {Speech analysis and synthesis using an AM ± FM modulation}, author = {Alexandros Potamianos and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/PotamianosMaragos_SpeechAnalSynthUsingAMFM-ModulModel_SpeCom1999.pdf}, year = {1999}, date = {1999-01-01}, journal = {Speech Communication}, volume = {28}, number = {3}, pages = {195--209}, keywords = {}, pubstate = {published}, tppubtype = {article} } |
Petros Maragos, Alexandros Potamianos Fractal dimensions of speech sounds: Computation and application to automatic speech recognition Journal Article The Journal of the Acoustical Society of America, 105 (3), pp. 1925–1932, 1999, ISSN: 0001-4966. Abstract | BibTeX | Links: [Webpage] [PDF] @article{109, title = {Fractal dimensions of speech sounds: Computation and application to automatic speech recognition}, author = {Petros Maragos and Alexandros Potamianos}, url = {http://asa.scitation.org/doi/10.1121/1.426738 http://robotics.ntua.gr/wp-content/uploads/sites/2/MaragosPotamianos_SpeecFrDimRecogn_JASA1999.pdf}, doi = {10.1121/1.426738}, issn = {0001-4966}, year = {1999}, date = {1999-01-01}, journal = {The Journal of the Acoustical Society of America}, volume = {105}, number = {3}, pages = {1925--1932}, abstract = {The dynamics of airflow during speech production may often result in some small or large degree of turbulence. In this paper, the geometry of speechturbulence as reflected in the fragmentation of the time signal is quantified by using fractalmodels. An efficient algorithm for estimating the short-time fractal dimension of speech signals based on multiscale morphological filtering is described, and its potential for speech segmentation and phonetic classification discussed. Also reported are experimental results on using the short-time fractal dimension of speech signals at multiple scales as additional features in an automatic speech-recognition system using hidden Markovmodels, which provide a modest improvement in speech-recognition performance.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The dynamics of airflow during speech production may often result in some small or large degree of turbulence. In this paper, the geometry of speechturbulence as reflected in the fragmentation of the time signal is quantified by using fractalmodels. An efficient algorithm for estimating the short-time fractal dimension of speech signals based on multiscale morphological filtering is described, and its potential for speech segmentation and phonetic classification discussed. Also reported are experimental results on using the short-time fractal dimension of speech signals at multiple scales as additional features in an automatic speech-recognition system using hidden Markovmodels, which provide a modest improvement in speech-recognition performance. |
1997 |
Petros Maragos, Alexandros Potamianos On Using Fractal Features of Speech Sounds in Automatic Speech Recognition Conference Eurospeech, 1997. @conference{Maragos1997, title = {On Using Fractal Features of Speech Sounds in Automatic Speech Recognition}, author = { Petros Maragos and Alexandros Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/maragos97_eurospeech.pdf}, year = {1997}, date = {1997-09-01}, booktitle = {Eurospeech}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
1996 |
A Potamianos, P Maragos Speech formant frequency and bandwidth tracking using multiband energy demodulation Journal Article 1995 International Conference on Acoustics, Speech, and Signal Processing, 1 , pp. 784–787, 1996, ISSN: 1520-6149. Abstract | BibTeX | Links: [Webpage] [PDF] @article{Potamianos1996, title = {Speech formant frequency and bandwidth tracking using multiband energy demodulation}, author = {A Potamianos and P Maragos}, url = {http://ieeexplore.ieee.org/document/479811/ http://robotics.ntua.gr/wp-content/uploads/sites/2/PotamianosMaragos_SpeecFormntBandwESA_JASA1996.pdf}, doi = {10.1109/ICASSP.1995.479811}, issn = {1520-6149}, year = {1996}, date = {1996-01-01}, journal = {1995 International Conference on Acoustics, Speech, and Signal Processing}, volume = {1}, pages = {784--787}, abstract = {In this paper, the amplitude and frequency ?AM–FM? modulation model and a multiband demodulation analysis scheme are applied to formant frequency and bandwidth tracking of speech signals. Filtering by a bank of Gabor bandpass filters is performed to isolate each speech resonance in the signal. Next, the amplitude envelope ?AM? and instantaneous frequency ?FM? are estimated for each band using the energy separation algorithm ?ESA?. Short-time formant frequency and bandwidth estimates are obtained from the instantaneous amplitude and frequency signals; two frequency estimates are proposed and their relative merits are discussed. The short-time estimates are used to compute the formant locations and bandwidths. Performance and computational issues of the algorithm are discussed. Overall, multiband demodulation analysis ?MDA? is shown to be a useful tool for extracting information from the speech resonances in the time–frequency plane.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In this paper, the amplitude and frequency ?AM–FM? modulation model and a multiband demodulation analysis scheme are applied to formant frequency and bandwidth tracking of speech signals. Filtering by a bank of Gabor bandpass filters is performed to isolate each speech resonance in the signal. Next, the amplitude envelope ?AM? and instantaneous frequency ?FM? are estimated for each band using the energy separation algorithm ?ESA?. Short-time formant frequency and bandwidth estimates are obtained from the instantaneous amplitude and frequency signals; two frequency estimates are proposed and their relative merits are discussed. The short-time estimates are used to compute the formant locations and bandwidths. Performance and computational issues of the algorithm are discussed. Overall, multiband demodulation analysis ?MDA? is shown to be a useful tool for extracting information from the speech resonances in the time–frequency plane. |
1995 |
Petros Maragos, Alexandros Potamianos Higher Order Differential Energy Operators Journal Article IEEE Signal Processing Letters, 2 (8), pp. 152–154, 1995, ISSN: 15582361. Abstract | BibTeX | Links: [PDF] @article{100, title = {Higher Order Differential Energy Operators}, author = {Petros Maragos and Alexandros Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/MaragosPotamianos_HOEnergOper_ieeeSPL1995.pdf}, doi = {10.1109/97.404130}, issn = {15582361}, year = {1995}, date = {1995-01-01}, journal = {IEEE Signal Processing Letters}, volume = {2}, number = {8}, pages = {152--154}, abstract = {Instantaneous signal operators$backslash$nϒtextlesssubtextgreaterktextless/subtextgreater(x)=x˙xtextlesssuptextgreater(k-1)textless/suptextgreater-xxtextlesssuptextgreater(k)textless/suptextgreater of$backslash$ninteger orders k are proposed to measure the cross energy between a$backslash$nsignal x and its derivatives. These higher order differential energy$backslash$noperators contain as a special case, for k=2, the Teager-Kaiser (1990)$backslash$noperator. When applied to (possibly modulated) sinusoids, they yield$backslash$nseveral new energy measurements useful for parameter estimation or AM-FM$backslash$ndemodulation. Applying them to sampled signals involves replacing$backslash$nderivatives with differences that lead to several useful discrete energy$backslash$noperators defined on an extremely short window of samples}, keywords = {}, pubstate = {published}, tppubtype = {article} } Instantaneous signal operators$backslash$nϒtextlesssubtextgreaterktextless/subtextgreater(x)=x˙xtextlesssuptextgreater(k-1)textless/suptextgreater-xxtextlesssuptextgreater(k)textless/suptextgreater of$backslash$ninteger orders k are proposed to measure the cross energy between a$backslash$nsignal x and its derivatives. These higher order differential energy$backslash$noperators contain as a special case, for k=2, the Teager-Kaiser (1990)$backslash$noperator. When applied to (possibly modulated) sinusoids, they yield$backslash$nseveral new energy measurements useful for parameter estimation or AM-FM$backslash$ndemodulation. Applying them to sampled signals involves replacing$backslash$nderivatives with differences that lead to several useful discrete energy$backslash$noperators defined on an extremely short window of samples |
P. Maragos, A. Potamianos, B. Santhanam Instantaneous Energy Operators: Applications to Speech Processing and Communications Conference Proc. IEEE Workshop on Nonlinear Signal and Image Processing, Halkidiki, Greece, pp.955-958, June 1995, 1995. @conference{Maragos1995b, title = {Instantaneous Energy Operators: Applications to Speech Processing and Communications}, author = { P. Maragos and A. Potamianos and B. Santhanam}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/nsip95.pdf}, year = {1995}, date = {1995-06-01}, booktitle = {Proc. IEEE Workshop on Nonlinear Signal and Image Processing, Halkidiki, Greece, pp.955-958, June 1995}, pages = {955--958}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
A. Potamianos, P. Maragos Speech formant frequency and bandwidth tracking using multiband energy demodulation Conference 1995 International Conference on Acoustics, Speech, and Signal Processing, 1 , 1995, ISSN: 1520-6149. Abstract | BibTeX | Links: [Webpage] [PDF] @conference{Potamianos1995, title = {Speech formant frequency and bandwidth tracking using multiband energy demodulation}, author = { A. Potamianos and P. Maragos}, url = {http://ieeexplore.ieee.org/document/479811/ http://robotics.ntua.gr/wp-content/uploads/sites/2/speech-formant-frequency-and-bandwidth-tracking-using-multiband-.pdf}, doi = {10.1109/ICASSP.1995.479811}, issn = {1520-6149}, year = {1995}, date = {1995-05-01}, booktitle = {1995 International Conference on Acoustics, Speech, and Signal Processing}, volume = {1}, pages = {784--787}, abstract = {In this paper, the amplitude and frequency ?AM–FM? modulation model and a multiband demodulation analysis scheme are applied to formant frequency and bandwidth tracking of speech signals. Filtering by a bank of Gabor bandpass filters is performed to isolate each speech resonance in the signal. Next, the amplitude envelope ?AM? and instantaneous frequency ?FM? are estimated for each band using the energy separation algorithm ?ESA?. Short-time formant frequency and bandwidth estimates are obtained from the instantaneous amplitude and frequency signals; two frequency estimates are proposed and their relative merits are discussed. The short-time estimates are used to compute the formant locations and bandwidths. Performance and computational issues of the algorithm are discussed. Overall, multiband demodulation analysis ?MDA? is shown to be a useful tool for extracting information from the speech resonances in the time–frequency plane.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } In this paper, the amplitude and frequency ?AM–FM? modulation model and a multiband demodulation analysis scheme are applied to formant frequency and bandwidth tracking of speech signals. Filtering by a bank of Gabor bandpass filters is performed to isolate each speech resonance in the signal. Next, the amplitude envelope ?AM? and instantaneous frequency ?FM? are estimated for each band using the energy separation algorithm ?ESA?. Short-time formant frequency and bandwidth estimates are obtained from the instantaneous amplitude and frequency signals; two frequency estimates are proposed and their relative merits are discussed. The short-time estimates are used to compute the formant locations and bandwidths. Performance and computational issues of the algorithm are discussed. Overall, multiband demodulation analysis ?MDA? is shown to be a useful tool for extracting information from the speech resonances in the time–frequency plane. |
1994 |
H M Hanson, P Maragos, A Potamianos A system for finding speech formants and modulations via energy separation Journal Article IEEE Transactions on Speech and Audio Processing, 2 (3), pp. 436-443, 1994, ISSN: 1063-6676. Abstract | BibTeX | Links: [PDF] @article{Hanson1994, title = {A system for finding speech formants and modulations via energy separation}, author = {H M Hanson and P Maragos and A Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/HansonMaragosPotamianos_IterESA_ieeetSAP1994.pdf}, doi = {10.1109/89.294358}, issn = {1063-6676}, year = {1994}, date = {1994-07-01}, journal = {IEEE Transactions on Speech and Audio Processing}, volume = {2}, number = {3}, pages = {436-443}, abstract = {This correspondence presents an experimental system that uses an energy-tracking operator and a related energy separation algorithm to automatically find speech formants and amplitude/frequency modulations in voiced speech segments. Initial estimates of formant center frequencies are provided by either LPC or morphological spectral peak picking. These estimates are then shown to be improved by a combination of bandpass filtering and iterative application of energy separation.< keywords = {}, pubstate = {published}, tppubtype = {article} } This correspondence presents an experimental system that uses an energy-tracking operator and a related energy separation algorithm to automatically find speech formants and amplitude/frequency modulations in voiced speech segments. Initial estimates of formant center frequencies are provided by either LPC or morphological spectral peak picking. These estimates are then shown to be improved by a combination of bandpass filtering and iterative application of energy separation.<<ETX>> |
A Potamianos, P Maragos A Comparison of the Energy Operator and Hilbert Transform Approaches for Signal and Speech Demodulation Journal Article Signal Processing, 37 (1), pp. 95–120, 1994. @article{96c, title = {A Comparison of the Energy Operator and Hilbert Transform Approaches for Signal and Speech Demodulation}, author = {A Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/PotamianosMaragos_ComparEnergOpHilbertTransfSigSpeechDemod_SigPro1994.pdf}, year = {1994}, date = {1994-01-01}, journal = {Signal Processing}, volume = {37}, number = {1}, pages = {95--120}, keywords = {}, pubstate = {published}, tppubtype = {article} } |
A Potamianos, P Maragos Applications of Speech Processing Using an AM--FM Modulation Model and Energy Operators Conference Proc. European Signal Process. Conf., 1994. @conference{Potamianos1994, title = {Applications of Speech Processing Using an AM--FM Modulation Model and Energy Operators}, author = { A Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/Potamianos_ApplicSpeechProc_1994.pdf}, year = {1994}, date = {1994-01-01}, booktitle = {Proc. European Signal Process. Conf.}, pages = {III: 1669----1672}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
1993 |
H. M. Hanson, P. Maragos,, A. Potamianos Finding Speech Formants and Modulations via Energy Separation: With an Application to a Vocoder Conference Proc. Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP-93), Minneapolis, MN, 1993. @conference{Hanson1993b, title = {Finding Speech Formants and Modulations via Energy Separation: With an Application to a Vocoder}, author = {H. M. Hanson, P. Maragos, and A. Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/hanson1993.pdf}, year = {1993}, date = {1993-04-04}, booktitle = {Proc. Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP-93)}, journal = {Proc. Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP-93)}, address = {Minneapolis, MN}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |