Alexandros Potamianos

Alexandros PotamianosAssociate Professor

(+30) 210772-4709
potam@central.ntua.gr
Office 2.1.4
http://www.potam.com

Courses

Biosketch

Alexandros Potamianos received the Diploma in Electrical and Computer Engineering from the National Technical University of Athens, Greece in 1990. He received the M.S and Ph.D. degrees in Engineering Sciences from Harvard University, Cambridge, MA, USA in 1991 and 1995, respectively. He received the M.B.A. degree from Stern School of Business, NYU in 2002.

From 1991 to June 1993 he was a research assistant at the Robotics Lab, Harvard University. From 1993 to 1995 he was a research assistant at the Digital Signal Processing Lab at Georgia Tech. From 1995 to 1999 he was a Senior Technical Staff Member at the Speech and Image Processing Lab, AT&T Shannon Labs, Florham Park, NJ. From 1999 to 2002 he was a Technical Staff Member and Technical Supervisor at the Multimedia Communications Lab at Bell Labs, Lucent Technologies, Murray Hill, NJ. From 1999 to 2001 he was an adjunct Assistant Professor at the Department of Electrical Engineering of Columbia University, New York, NY. From 2003 to 2013 he was an adjunct Associate Professor at the Department of Electronic and Computer Engineering of Technical University of Crete, Chania, Greece. In the summer of 2013, he joined the School of Electronical and Computer Engineering at the National Technical University of Athens, Athens, Greece as an associate professor.

His current research interests include speech processing, analysis, synthesis and recognition, dialog and multi-modal systems, lexical semantics, nonlinear signal processing, natural language understanding, artificial intelligence and multimodal child-computer interaction.

Prof. Potamianos has authored or co-authored over 110 papers in professional journals and conferences (citations: 2700, h-index: 25, in google scholar as of Sept 2013). He is the co-author of the paper “Creating conversational interfaces for children” that received a 2005 IEEE Signal Processing Society Best Paper Award. He is the co-editor of the book “Multimodal Processing and Interaction: Audio, Video, Text”, Springer, 2008. He holds four patents. He has been a member of the IEEE Signal Processing Society since 1992 and a senior member since 2010. He is currently serving his third term at the IEEE Speech and Language Technical Committee and his first term at the IEEE Multimedia Signal Processing Committee.

Publications

Recent Research Projects

BabyRobot

BabyRobot

Publications

2017

A Zlatintsi, P Koutras, G Evangelopoulos, N Malandrakis, N Efthymiou, K Pastra, A Potamianos, P Maragos

COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization Journal Article

EURASIP Journal on Image and Video Processing, 54 , pp. 1–24, 2017.

Abstract | BibTeX | Links: [PDF]

@article{ZKE+17,
title = {COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization},
author = {A Zlatintsi and P Koutras and G Evangelopoulos and N Malandrakis and N Efthymiou and K Pastra and A Potamianos and P Maragos},
url = {http://robotics.ntua.gr/wp-content/publications/Zlatintsi+_COGNIMUSEdb_EURASIP_JIVP-2017.pdf},
doi = {doi 10.1186/s13640-017-0194},
year = {2017},
date = {2017-01-01},
journal = {EURASIP Journal on Image and Video Processing},
volume = {54},
pages = {1--24},
abstract = {Research related to computational modeling for machine-based understanding requires ground truth data for training, content analysis, and evaluation. In this paper, we present a multimodal video database, namely COGNIMUSE, annotated with sensory and semantic saliency, events, cross-media semantics, and emotion. The purpose of this database is manifold; it can be used for training and evaluation of event detection and summarization algorithms, for classification and recognition of audio-visual and cross-media events, as well as for emotion tracking. In order to enable comparisons with other computational models, we propose state-of-the-art algorithms, specifically a unified energy-based audio-visual framework and a method for text saliency computation, for the detection of perceptually salient events from videos. Additionally, a movie summarization system for the automatic production of summaries is presented. Two kinds of evaluation were performed, an objective based on the saliency annotation of the database and an extensive qualitative human evaluation of the automatically produced summaries, where we investigated what composes high-quality movie summaries, where both methods verified the appropriateness of the proposed methods. The annotation of the database and the code for the summarization system can be found at http://cognimuse.cs.ntua.gr/database.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

G Karamanolakis, E Iosif, A Zlatintsi, A Pikrakis, A Potamianos

Audio-based Distributional Semantic Models for Music Auto-tagging and Similarity Measurement Conference

Proc. MultiLearn2017: Multimodal Processing, Modeling and Learning for Human-Computer/Robot Interaction Workshop, in conjuction with European Signal Processing Conference, Kos, Greece, 2017.

Abstract | BibTeX | Links: [PDF]

2016

G Karamanolakis, E Iosif, A Zlatintsi, A Pikrakis, A Potamianos

Audio-Based Distributional Representations of Meaning Using a Fusion of Feature Encodings Conference

2016.

Abstract | BibTeX | Links: [Webpage] [PDF]

2015

A Zlatintsi, E.Iosif, P Maragos, A Potamianos

Audio Salient Event Detection and Summarization using Audio and Text Modalities Conference

Nice, France, 2015.

Abstract | BibTeX | Links: [PDF]

P Koutras, A Zlatintsi, E.Iosif, A Katsamanis, P Maragos, A Potamianos

Predicting Audio-Visual Salient Events Based on Visual, Audio and Text Modalities for Movie Summarization Conference

Proc. {IEEE} Int'l Conf. Acous., Speech, and Signal Processing, Quebec, Canada, 2015.

Abstract | BibTeX | Links: [PDF]

A Zlatintsi, P Koutras, N Efthymiou, P Maragos, A Potamianos, K Pastra

Quality Evaluation of Computational Models for Movie Summarization Conference

Costa Navarino, Messinia, Greece, 2015.

Abstract | BibTeX | Links: [PDF]

P. Koutras, A. Zlatintsi, E. Iosif, A. Katsamanis, P. Maragos, A. Potamianos

Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization Conference

Proceedings - International Conference on Image Processing, ICIP, 2015-December , 2015, ISSN: 15224880.

BibTeX | Links: [PDF]

2012

A Zlatintsi, P Maragos, A Potamianos, G Evangelopoulos

A Saliency-Based Approach to Audio Event Detection and Summarization Conference

Proc. European Signal Processing Conference, Bucharest, Romania, 2012.

Abstract | BibTeX | Links: [PDF]

2011

Dimitrios Dimitriadis, Petros Maragos, Alexandros Potamianos

On the effects of filterbank design and energy computation on robust speech recognition Journal Article

IEEE Transactions on Audio, Speech and Language Processing, 19 (6), pp. 1504–1516, 2011, ISSN: 15587916.

Abstract | BibTeX | Links: [PDF]

@article{137,
title = {On the effects of filterbank design and energy computation on robust speech recognition},
author = {Dimitrios Dimitriadis and Petros Maragos and Alexandros Potamianos},
url = {http://robotics.ntua.gr/wp-content/uploads/publications/DimitriadisMaragosPotamianos_Effects-Filterbank-Design-Energy-Computation-Robust-Speech-Recognition_ieeeTASLP_aug11.pdf},
doi = {10.1109/TASL.2010.2092766},
issn = {15587916},
year = {2011},
date = {2011-01-01},
journal = {IEEE Transactions on Audio, Speech and Language Processing},
volume = {19},
number = {6},
pages = {1504--1516},
abstract = {In this paper, we examine how energy computation and filterbank design contribute to the overall front-end robustness, especially when the investigated features are applied to noisy speech signals, in mismatched training-testing conditions. In prior work (“Auditory Teager energy cepstrum coefficients for robust speech recognition,” D. Dimitriadis, P. Maragos, and A. Potamianos, in Proc. Eurospeech'05, Sep. 2005), a novel feature set called “Teager energy cepstrum coefficients” (TECCs) has been proposed, employing a dense, smooth filterbank and alternative energy computation schemes. TECCs were shown to be more robust to noise and exhibit improved performance compared to the widely used Mel frequency cepstral coefficients (MFCCs). In this paper, we attempt to interpret these results using a combined theoretical and experimental analysis framework. Specifically, we investigate in detail the connection between the filterbank design, i.e., the filter shape and bandwidth, the energy estimation scheme and the automatic speech recognition (ASR) performance under a variety of additive and/or convolutional noise conditions. For this purpose: 1) the performance of filterbanks using triangular, Gabor, and Gammatone filters with various bandwidths and filter positions are examined under different noisy speech recognition tasks, and 2) the squared amplitude and Teager–Kaiser energy operators are compared as two alternative approaches of computing the signal energy. Our end-goal is to understand how to select the most efficient filterbank and energy computation scheme that are maximally robust under both clean and noisy recording conditions. Theoretical and experimental results show that: 1) the filter bandwidth is one of the most important factors affecting speech recognition performance in noise, while the shape of the filter is of secondary importance, and 2) the Teager–Kaiser operator outperforms (on the average and for most noise types) the squared amplitude energy computation scheme for speech recognition in noisy conditions, especially, for large filter bandwidths. Experimental results show that selecting the appropriate filterbank and energy computation scheme can lead to significant error rate reduction over both MFCC and perceptual linear predicion (PLP) features for a variety of speech recognition tasks. A relative error rate reduction of up to textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 30backslashhboxbackslash%$textless/textextgreater textless/formulatextgreater for MFCCs and textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 39backslashhboxbackslash%$textless/textextgreatertextless/formulatextgreater for PLPs is shown for the Aurora-3 Spanish Task.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

In this paper, we examine how energy computation and filterbank design contribute to the overall front-end robustness, especially when the investigated features are applied to noisy speech signals, in mismatched training-testing conditions. In prior work (“Auditory Teager energy cepstrum coefficients for robust speech recognition,” D. Dimitriadis, P. Maragos, and A. Potamianos, in Proc. Eurospeech'05, Sep. 2005), a novel feature set called “Teager energy cepstrum coefficients” (TECCs) has been proposed, employing a dense, smooth filterbank and alternative energy computation schemes. TECCs were shown to be more robust to noise and exhibit improved performance compared to the widely used Mel frequency cepstral coefficients (MFCCs). In this paper, we attempt to interpret these results using a combined theoretical and experimental analysis framework. Specifically, we investigate in detail the connection between the filterbank design, i.e., the filter shape and bandwidth, the energy estimation scheme and the automatic speech recognition (ASR) performance under a variety of additive and/or convolutional noise conditions. For this purpose: 1) the performance of filterbanks using triangular, Gabor, and Gammatone filters with various bandwidths and filter positions are examined under different noisy speech recognition tasks, and 2) the squared amplitude and Teager–Kaiser energy operators are compared as two alternative approaches of computing the signal energy. Our end-goal is to understand how to select the most efficient filterbank and energy computation scheme that are maximally robust under both clean and noisy recording conditions. Theoretical and experimental results show that: 1) the filter bandwidth is one of the most important factors affecting speech recognition performance in noise, while the shape of the filter is of secondary importance, and 2) the Teager–Kaiser operator outperforms (on the average and for most noise types) the squared amplitude energy computation scheme for speech recognition in noisy conditions, especially, for large filter bandwidths. Experimental results show that selecting the appropriate filterbank and energy computation scheme can lead to significant error rate reduction over both MFCC and perceptual linear predicion (PLP) features for a variety of speech recognition tasks. A relative error rate reduction of up to textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 30backslashhboxbackslash%$textless/textextgreater textless/formulatextgreater for MFCCs and textlessformula formulatype="inline"textgreatertextlesstex Notation="TeX"textgreater$backslashsimbackslashhbox 39backslashhboxbackslash%$textless/textextgreatertextless/formulatextgreater for PLPs is shown for the Aurora-3 Spanish Task.

2009

Dimitrios Dimitriadis, Alexandros Potamianos, Petros Maragos

A comparison of the squared energy and teager-kaiser operators for short-term energy estimation in additive noise Journal Article

IEEE Transactions on Signal Processing, 57 (7), pp. 2569–2581, 2009, ISSN: 1053587X.

Abstract | BibTeX | Links: [PDF]

@article{132,
title = {A comparison of the squared energy and teager-kaiser operators for short-term energy estimation in additive noise},
author = {Dimitrios Dimitriadis and Alexandros Potamianos and Petros Maragos},
url = {http://robotics.ntua.gr/wp-content/uploads/publications/DimitriadisPotamianosMaragos_ComparisonSquaredAmpl-TKOper-EnergyEstimation_ieeetSP2008.pdf},
doi = {10.1109/TSP.2009.2019299},
issn = {1053587X},
year = {2009},
date = {2009-01-01},
journal = {IEEE Transactions on Signal Processing},
volume = {57},
number = {7},
pages = {2569--2581},
abstract = {Time-frequency distributions that evaluate the signal's energy content both in the time and frequency domains are indispensable signal processing tools, especially, for nonstationary signals. Various short-time energy computation schemes are used in practice, including the mean squared amplitude and Teager-Kaiser energy approaches. Herein, we focus primarily on the short- and medium-term properties of these two energy estimation schemes, as well as, on their performance in the presence of additive noise. To facilitate this analysis and generalize the approach, we use a harmonic noise model to approximate the noise component. The error analysis is conducted both in the continuous- and discrete-time domains, deriving similar conclusions. The estimation errors are measured in terms of normalized deviations from the expected signal energy and are shown to greatly depend on both the signals' spectral content and the analysis window length. When medium- and long-term analysis windows are employed, the Teager-Kaiser energy operator is proven superior to the common squared energy operator, provided that the spectral content of the noise is more lowpass than the corresponding signal content, and vice versa. However, for shorter window lengths, the Teager-Kaiser operator always outperforms the squared energy operator. The theoretical results are experimentally verified for synthetic signals. Finally, the performance of the proposed energy operators is evaluated for short-term analysis of noisy speech signals and the implications for speech processing applications are outlined.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

G Evangelopoulos, A Zlatintsi, G Skoumas, K Rapantzikos, A Potamianos, P Maragos, Y Avrithis

Video Event Detection and Summarization Using Audio, Visual and Text Saliency Conference

Taipei, Taiwan, 2009.

Abstract | BibTeX | Links: [PDF]

G Evangelopoulos, A Zlatintsi, G Skoumas, K Rapantzikos, A Potamianos, P Maragos, Y Avrithis

Video Event Detection and Summarization using Audio, Visual and Text Saliency Conference

Icassp, (2), 2009, ISBN: 9781424423545.

BibTeX | Links: [PDF]

2008

G. Evangelopoulos, K. Rapantzikos, A. Potamianos, P. Maragos, A. Zlatintsi, Y. Avrithis

Movie summarization based on audiovisual saliency detection Conference

Proceedings - International Conference on Image Processing, ICIP, 2008, ISSN: 15224880.

Abstract | BibTeX | Links: [PDF]

Georgios Evangelopoulos, Konstantinos Rapantzikos, Petros Maragos, Yannis Avrithis, Alexandros Potamianos

Audiovisual Attention Modeling and Salient Event Detection Book Chapter

Maragos, Petros; Potamianos, Alexandros; Gros, Patrick (Ed.): Multimodal Processing and Interaction: Audio, Video, Text, pp. 1–21, Springer US, Boston, MA, 2008, ISBN: 978-0-387-76316-3.

BibTeX | Links: [Webpage] [PDF]

2005

Dimitrios Dimitriadis, Petros Maragos, Alexandros Potamianos

Auditory Teager Energy Cepstrum Coefficients for Robust Speech Recognition Conference

Proc. of European Speech Processing Conference, (2), 2005.

Abstract | BibTeX | Links: [PDF]

2002

D Dimitriadis, P Maragos, A Potamianos

Modulation features for speech recognition Conference

International Conference on Acoustics, Speech, and Signal Processing, Orlando, Florida, 2002.

BibTeX | Links: [PDF]

2001

Alexandros Potamianos, Petros Maragos

Time-frequency distributions for automatic speech recognition Journal Article

Transactions on Speech and Audio Processing, IEEE, 9 (3), pp. 196–200, 2001.

BibTeX | Links: [PDF]

1999

Alexandros Potamianos, Petros Maragos

Speech analysis and synthesis using an AM ± FM modulation Journal Article

Speech Communication, 28 (3), pp. 195–209, 1999.

BibTeX | Links: [PDF]

Petros Maragos, Alexandros Potamianos

Fractal dimensions of speech sounds: Computation and application to automatic speech recognition Journal Article

The Journal of the Acoustical Society of America, 105 (3), pp. 1925–1932, 1999, ISSN: 0001-4966.

Abstract | BibTeX | Links: [Webpage] [PDF]

1997

Petros Maragos, Alexandros Potamianos

On Using Fractal Features of Speech Sounds in Automatic Speech Recognition Conference

Eurospeech, 1997.

BibTeX | Links: [PDF]

1996

A Potamianos, P Maragos

Speech formant frequency and bandwidth tracking using multiband energy demodulation Journal Article

1995 International Conference on Acoustics, Speech, and Signal Processing, 1 , pp. 784–787, 1996, ISSN: 1520-6149.

Abstract | BibTeX | Links: [Webpage] [PDF]

1995

Petros Maragos, Alexandros Potamianos

Higher Order Differential Energy Operators Journal Article

IEEE Signal Processing Letters, 2 (8), pp. 152–154, 1995, ISSN: 15582361.

Abstract | BibTeX | Links: [PDF]

P. Maragos, A. Potamianos, B. Santhanam

Instantaneous Energy Operators: Applications to Speech Processing and Communications Conference

Proc. IEEE Workshop on Nonlinear Signal and Image Processing, Halkidiki, Greece, pp.955-958, June 1995, 1995.

BibTeX | Links: [PDF]

A. Potamianos, P. Maragos

Speech formant frequency and bandwidth tracking using multiband energy demodulation Conference

1995 International Conference on Acoustics, Speech, and Signal Processing, 1 , 1995, ISSN: 1520-6149.

Abstract | BibTeX | Links: [Webpage] [PDF]

1994

H M Hanson, P Maragos, A Potamianos

A system for finding speech formants and modulations via energy separation Journal Article

IEEE Transactions on Speech and Audio Processing, 2 (3), pp. 436-443, 1994, ISSN: 1063-6676.

Abstract | BibTeX | Links: [PDF]

A Potamianos, P Maragos

A Comparison of the Energy Operator and Hilbert Transform Approaches for Signal and Speech Demodulation Journal Article

Signal Processing, 37 (1), pp. 95–120, 1994.

BibTeX | Links: [PDF]

A Potamianos, P Maragos

Applications of Speech Processing Using an AM--FM Modulation Model and Energy Operators Conference

Proc. European Signal Process. Conf., 1994.

BibTeX | Links: [PDF]

1993

H. M. Hanson, P. Maragos,, A. Potamianos

Finding Speech Formants and Modulations via Energy Separation: With an Application to a Vocoder Conference

Proc. Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP-93), Minneapolis, MN, 1993.

BibTeX | Links: [PDF]