2017 |
Panagiotis Paraskevas Filntisis, Athanasios Katsamanis, Pirros Tsiakoulis, Petros Maragos Video-realistic expressive audio-visual speech synthesis for the Greek language Journal Article Speech Communication, 95 , pp. 137–152, 2017, ISSN: 01676393. Abstract | BibTeX | Links: [PDF] @article{345, title = {Video-realistic expressive audio-visual speech synthesis for the Greek language}, author = {Panagiotis Paraskevas Filntisis and Athanasios Katsamanis and Pirros Tsiakoulis and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/FilntisisKatsamanisTsiakoulis+_VideoRealExprAudioVisSpeechSynthGrLang_SC17.pdf}, doi = {10.1016/j.specom.2017.08.011}, issn = {01676393}, year = {2017}, date = {2017-01-01}, journal = {Speech Communication}, volume = {95}, pages = {137--152}, abstract = {High quality expressive speech synthesis has been a long-standing goal towards natural human-computer interaction. Generating a talking head which is both realistic and expressive appears to be a considerable challenge, due to both the high complexity in the acoustic and visual streams and the large non-discrete number of emotional states we would like the talking head to be able to express. In order to cover all the desired emotions, a significant amount of data is required, which poses an additional time-consuming data collection challenge. In this paper we attempt to address the aforementioned problems in an audio-visual context. Towards this goal, we propose two deep neural network (DNN) architectures for Video-realistic Expressive Audio-Visual Text-To-Speech synthesis (EAVTTS) and evaluate them by comparing them directly both to traditional hidden Markov model (HMM) based EAVTTS, as well as a concatenative unit selection EAVTTS approach, both on the realism and the expressiveness of the generated talking head. Next, we investigate adaptation and interpolation techniques to address the problem of covering the large emotional space. We use HMM interpolation in order to generate different levels of intensity for an emotion, as well as investigate whether it is possible to generate speech with intermediate speaking styles between two emotions. In addition, we employ HMM adaptation to adapt an HMM-based system to another emotion using only a limited amount of adaptation data from the target emotion. We performed an extensive experimental evaluation on a medium sized audio-visual corpus covering three emotions, namely anger, sadness and happiness, as well as neutral reading style. Our results show that DNN-based models outperform HMMs and unit selection on both the realism and expressiveness of the generated talking heads, while in terms of adaptation we can successfully adapt an audio-visual HMM set trained on a neutral speaking style database to a target emotion. Finally, we show that HMM interpolation can indeed generate different levels of intensity for EAVTTS by interpolating an emotion with the neutral reading style, as well as in some cases, generate audio-visual speech with intermediate expressions between two emotions.}, keywords = {}, pubstate = {published}, tppubtype = {article} } High quality expressive speech synthesis has been a long-standing goal towards natural human-computer interaction. Generating a talking head which is both realistic and expressive appears to be a considerable challenge, due to both the high complexity in the acoustic and visual streams and the large non-discrete number of emotional states we would like the talking head to be able to express. In order to cover all the desired emotions, a significant amount of data is required, which poses an additional time-consuming data collection challenge. In this paper we attempt to address the aforementioned problems in an audio-visual context. Towards this goal, we propose two deep neural network (DNN) architectures for Video-realistic Expressive Audio-Visual Text-To-Speech synthesis (EAVTTS) and evaluate them by comparing them directly both to traditional hidden Markov model (HMM) based EAVTTS, as well as a concatenative unit selection EAVTTS approach, both on the realism and the expressiveness of the generated talking head. Next, we investigate adaptation and interpolation techniques to address the problem of covering the large emotional space. We use HMM interpolation in order to generate different levels of intensity for an emotion, as well as investigate whether it is possible to generate speech with intermediate speaking styles between two emotions. In addition, we employ HMM adaptation to adapt an HMM-based system to another emotion using only a limited amount of adaptation data from the target emotion. We performed an extensive experimental evaluation on a medium sized audio-visual corpus covering three emotions, namely anger, sadness and happiness, as well as neutral reading style. Our results show that DNN-based models outperform HMMs and unit selection on both the realism and expressiveness of the generated talking heads, while in terms of adaptation we can successfully adapt an audio-visual HMM set trained on a neutral speaking style database to a target emotion. Finally, we show that HMM interpolation can indeed generate different levels of intensity for EAVTTS by interpolating an emotion with the neutral reading style, as well as in some cases, generate audio-visual speech with intermediate expressions between two emotions. |
2008 |
George Caridakis, Olga Diamanti, Kostas Karpouzis, Petros Maragos Automatic sign language recognition Conference Proceedings of the 1st ACM international conference on PErvasive Technologies Related to Assistive Environments - PETRA '08, 2008, ISBN: 9781605580678. Abstract | BibTeX | Links: [PDF] @conference{208, title = {Automatic sign language recognition}, author = { George Caridakis and Olga Diamanti and Kostas Karpouzis and Petros Maragos}, url = {http://portal.acm.org/citation.cfm?doid=1389586.1389687}, doi = {10.1145/1389586.1389687}, isbn = {9781605580678}, year = {2008}, date = {2008-01-01}, booktitle = {Proceedings of the 1st ACM international conference on PErvasive Technologies Related to Assistive Environments - PETRA '08}, pages = {1}, abstract = {This work focuses on two of the research problems comprising automatic sign language recognition, namely robust computer vision techniques for consistent hand detection and tracking, while preserving the hand shape contour which is useful for extraction of features related to the handshape and a novel classification scheme incorporating Self-organizing maps, Markov chains and Hidden Markov Models. Geodesic Active Contours enhanced with skin color and motion information are employed for the hand detection and the extraction of the hand silhouette, while features extracted describe hand trajectory, region and shape. Extracted features are used as input to separate classifiers, forming a robust and adaptive architecture whose main contribution is the optimal utilization of the neighboring characteristic of the SOM during the decoding stage of the Markov chain, representing the sign class.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } This work focuses on two of the research problems comprising automatic sign language recognition, namely robust computer vision techniques for consistent hand detection and tracking, while preserving the hand shape contour which is useful for extraction of features related to the handshape and a novel classification scheme incorporating Self-organizing maps, Markov chains and Hidden Markov Models. Geodesic Active Contours enhanced with skin color and motion information are employed for the hand detection and the extraction of the hand silhouette, while features extracted describe hand trajectory, region and shape. Extracted features are used as input to separate classifiers, forming a robust and adaptive architecture whose main contribution is the optimal utilization of the neighboring characteristic of the SOM during the decoding stage of the Markov chain, representing the sign class. |
2002 |
D Dimitriadis, P Maragos, A Potamianos Modulation features for speech recognition Journal Article International Conference on Acoustics, 1 , pp. I–377–I–380, 2002. @article{76c, title = {Modulation features for speech recognition}, author = {D Dimitriadis and P Maragos and A Potamianos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/DimitriadisMaragosPotamianos_RobustAMFM_Features_SpeechRecognition_ieeeSPL2005.pdf}, year = {2002}, date = {2002-01-01}, journal = {International Conference on Acoustics}, volume = {1}, pages = {I--377--I--380}, keywords = {}, pubstate = {published}, tppubtype = {article} } |
D Dimitriadis, P Maragos, A Potamianos Modulation features for speech recognition Conference International Conference on Acoustics, 1 , 2002. @conference{253, title = {Modulation features for speech recognition}, author = { D Dimitriadis and P Maragos and A Potamianos}, year = {2002}, date = {2002-01-01}, booktitle = {International Conference on Acoustics}, volume = {1}, pages = {I--377--I--380}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
V Pitsikalis, P Maragos Speech analysis and feature extraction using chaotic models Conference International Conference on Acoustics, 1 , 2002. @conference{252, title = {Speech analysis and feature extraction using chaotic models}, author = { V Pitsikalis and P Maragos}, year = {2002}, date = {2002-01-01}, booktitle = {International Conference on Acoustics}, volume = {1}, pages = {I--533--I--536 vol.1}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
Copyright Notice:
Some material presented is available for download to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
The work already published by the IEEE is under its copyright. Personal use of such material is permitted. However, permission to reprint/republish the material for advertising or promotional purposes, or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of the work in other works must be obtained from the IEEE.