
(+30) 210772-2420, 2964
- filby@central.ntua.gr
- Office 2.2.19
Biosketch
I was born in Athens in 1989. I received my Diploma Degree in Electrical Engineering and Computer Science from the National Technical University of Athens in October 2015.
Since November 2015 I am a PhD candidate in the IRAL lab under the supervision of Prof. Petros Maragos.
My research interests lie at the crossroads of computer vision and audio processing for affective computing, including building affective multimodal virtual avatars and automatic emotion recognition systems.
Recent Research Projects
Publications
2022 |
Niki Efthymiou, Panagiotis P Filntisis, Petros Koutras, Antigoni Tsiami, Jack Hadfield, Gerasimos Potamianos, Petros Maragos ChildBot: Multi-robot perception and interaction with children Journal Article Robotics and Autonomous Systems, 150 , pp. 103975, 2022, ISSN: 0921-8890. Abstract | BibTeX | Links: [PDF] @article{Efthymiou2022, title = {ChildBot: Multi-robot perception and interaction with children}, author = {Niki Efthymiou and Panagiotis P Filntisis and Petros Koutras and Antigoni Tsiami and Jack Hadfield and Gerasimos Potamianos and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/2022_EfthymiouEtAl_ChildBot-MultiRobotPerception-InteractionChildren_RAS.pdf}, doi = {https://doi.org/10.1016/j.robot.2021.103975}, issn = {0921-8890}, year = {2022}, date = {2022-01-01}, journal = {Robotics and Autonomous Systems}, volume = {150}, pages = {103975}, abstract = {In this paper, we present an integrated robotic system capable of participating in and performing a wide range of educational and entertainment tasks collaborating with one or more children. The system, called ChildBot, features multimodal perception modules and multiple robotic agents that monitor the interaction environment and can robustly coordinate complex Child–Robot Interaction use-cases. In order to validate the effectiveness of the system and its integrated modules, we have conducted multiple experiments with a total of 52 children. Our results show improved perception capabilities in comparison to our earlier works that ChildBot was based on. In addition, we have conducted a preliminary user experience study, employing some educational/entertainment tasks, that yields encouraging results regarding the technical validity of our system and initial insights on the user experience with it.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In this paper, we present an integrated robotic system capable of participating in and performing a wide range of educational and entertainment tasks collaborating with one or more children. The system, called ChildBot, features multimodal perception modules and multiple robotic agents that monitor the interaction environment and can robustly coordinate complex Child–Robot Interaction use-cases. In order to validate the effectiveness of the system and its integrated modules, we have conducted multiple experiments with a total of 52 children. Our results show improved perception capabilities in comparison to our earlier works that ChildBot was based on. In addition, we have conducted a preliminary user experience study, employing some educational/entertainment tasks, that yields encouraging results regarding the technical validity of our system and initial insights on the user experience with it. |
Foivos Paraperas-Papantoniou, Panagiotis P Filntisis, Petros Maragos, Anastasios Roussos Neural Emotion Director: Speech-Preserving Semantic Control of Facial Expressions in ''In-the-Wild'' Videos Inproceedings Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Oral - Best Paper Finalist) , New Orleans, USA, 2022. BibTeX | Links: [Video] [Poster] [Supp] [PDF] @inproceedings{Papantoniou_2022_CVPR, title = {Neural Emotion Director: Speech-Preserving Semantic Control of Facial Expressions in ''In-the-Wild'' Videos}, author = {Foivos Paraperas-Papantoniou and Panagiotis P Filntisis and Petros Maragos and Anastasios Roussos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/Paraperas_cvpr2022_NED_oral_video.mp4 http://robotics.ntua.gr/wp-content/uploads/sites/2/Paraperas_cvpr2022_NED_poster.pdf http://robotics.ntua.gr/wp-content/uploads/sites/2/Paraperas_NED_CVPR2022_supplemental-material.pdf http://robotics.ntua.gr/wp-content/uploads/sites/2/Paraperas_NED-SpeechPreservingSemanticControlFacialExpressions_CVPR2022_paper.pdf}, year = {2022}, date = {2022-06-19}, booktitle = {Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Oral - Best Paper Finalist) }, address = {New Orleans, USA}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
2021 |
N Efthymiou, P P Filntisis, G Potamianos, P Maragos Visual Robotic Perception System with Incremental Learning for Child–Robot Interaction Scenarios Journal Article Technologies, 9 (4), 2021. Abstract | BibTeX | Links: [PDF] @article{Efthymiou2021, title = {Visual Robotic Perception System with Incremental Learning for Child–Robot Interaction Scenarios}, author = {N Efthymiou and P P Filntisis and G Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/2021_EfthymiouEtAl_VisualRobotPerceptionSystem-ChildRobotInteract_Technologies.pdf}, doi = {10.3390/technologies9040086}, year = {2021}, date = {2021-01-01}, journal = {Technologies}, volume = {9}, number = {4}, abstract = {This paper proposes a novel lightweight visual perception system with Incremental Learning (IL), tailored to child–robot interaction scenarios. Specifically, this encompasses both an action and emotion recognition module, with the former wrapped around an IL system, allowing novel actions to be easily added. This IL system enables the tutor aspiring to use robotic agents in interaction scenarios to further customize the system according to children’s needs. We perform extensive evaluations of the developed modules, achieving state-of-the-art results on both the children’s action BabyRobot dataset and the children’s emotion EmoReact dataset. Finally, we demonstrate the robustness and effectiveness of the IL system for action recognition by conducting a thorough experimental analysis for various conditions and parameters.}, keywords = {}, pubstate = {published}, tppubtype = {article} } This paper proposes a novel lightweight visual perception system with Incremental Learning (IL), tailored to child–robot interaction scenarios. Specifically, this encompasses both an action and emotion recognition module, with the former wrapped around an IL system, allowing novel actions to be easily added. This IL system enables the tutor aspiring to use robotic agents in interaction scenarios to further customize the system according to children’s needs. We perform extensive evaluations of the developed modules, achieving state-of-the-art results on both the children’s action BabyRobot dataset and the children’s emotion EmoReact dataset. Finally, we demonstrate the robustness and effectiveness of the IL system for action recognition by conducting a thorough experimental analysis for various conditions and parameters. |
P Antoniadis, P P Filntisis, P Maragos Exploiting Emotional Dependencies with Graph Convolutional Networks for Facial Expression Recognition Conference Proc. 16th IEEE Int’l Conf. on Automatic Face and Gesture Recognition (FG-2021), 2021. @conference{Antoniadis2021, title = {Exploiting Emotional Dependencies with Graph Convolutional Networks for Facial Expression Recognition}, author = {P Antoniadis and P P Filntisis and P Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/2021_AntoniadisEtAl_Emotion-GCN-FacialExpressionRecogn_FG.pdf}, year = {2021}, date = {2021-12-01}, booktitle = {Proc. 16th IEEE Int’l Conf. on Automatic Face and Gesture Recognition (FG-2021)}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
I Pikoulis, P P Filntisis, P Maragos Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild Conference Proc. 16th IEEE Int’l Conf. on Automatic Face and Gesture Recognition (FG-2021), 2021. @conference{Pikoulis2021, title = {Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild}, author = {I Pikoulis and P P Filntisis and P Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/2021_PikoulisEtAl_VideoEmotionRecognInTheWild-SemanticMultiStreamContext_FG.pdf}, year = {2021}, date = {2021-12-01}, booktitle = {Proc. 16th IEEE Int’l Conf. on Automatic Face and Gesture Recognition (FG-2021)}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
P P Filntisis, Niki Efthymiou, Gerasimos Potamianos, Petros Maragos An Audiovisual Child Emotion Recognition System for Child-Robot Interaction Applications Conference European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 2021. BibTeX | Links: [PDF] [Slides] [Video] @conference{filntisis2021anaudiovisual, title = {An Audiovisual Child Emotion Recognition System for Child-Robot Interaction Applications}, author = {P P Filntisis and Niki Efthymiou and Gerasimos Potamianos and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/2021_FilntisisEtAl_AV-ChildEmotionRecognSystem-ChildRobotInteract_EUSIPCO.pdf http://robotics.ntua.gr/wp-content/uploads/sites/2/Filntisis_EUSIPCO2021_ChildEmotionRecogn_presentation_slides.pdf http://robotics.ntua.gr/wp-content/uploads/sites/2/Filntisis_EUSIPCO2021_ChildEmotionRecogn_presentation_video.mp4}, year = {2021}, date = {2021-08-01}, booktitle = {European Signal Processing Conference (EUSIPCO)}, pages = {41--44}, address = {Dublin, Ireland}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
C. Garoufis, A. Zlatintsi, P. P. Filntisis, N. Efthymiou, E. Kalisperakis, V. Garyfalli, T. Karantinos, L. Mantonakis, N. Smyrnis, P. Maragos An Unsupervised Learning Approach for Detecting Relapses from Spontaneous Speech in Patients with Psychosis Conference Proc. IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI-2021), 2021. BibTeX | Links: [Poster] [Poster] @conference{Garoufis2021b, title = {An Unsupervised Learning Approach for Detecting Relapses from Spontaneous Speech in Patients with Psychosis}, author = {C. Garoufis, A. Zlatintsi, P. P. Filntisis, N. Efthymiou, E. Kalisperakis, V. Garyfalli, T. Karantinos, L. Mantonakis, N. Smyrnis and P. Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/Garoufis_BHI2021_UnsupervisedLearningRelapseDetection_Paper.pdfhttp://robotics.ntua.gr/wp-content/uploads/sites/2/Garoufis_BHI21_Poster_presentation.mp4 http://robotics.ntua.gr/wp-content/uploads/sites/2/Garoufis_BHI21_Poster.pdf}, year = {2021}, date = {2021-07-31}, booktitle = {Proc. IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI-2021)}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
2020 |
I. Maglogiannis, A. Zlatintsi, A. Menychtas, D. Papadimatos, P.P. Filntisis, N. Efthymiou, G. Retsinas, P. Tsanakas,, P. Maragos An intelligent cloud-based platform for effective monitoring of patients with psychotic disorders Conference Int’l Conf. on Artificial Intelligence Applications and Innovation (AIAI-2020), Halkidiki, Greece, 2020. @conference{Maglogiannis2020, title = {An intelligent cloud-based platform for effective monitoring of patients with psychotic disorders}, author = {I. Maglogiannis, A. Zlatintsi, A. Menychtas, D. Papadimatos, P.P. Filntisis, N. Efthymiou, G. Retsinas, P. Tsanakas, and P. Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/2020_MaglogiannisEtAl_e-Prevention_IntelligentCloudPlatform_AIAI-1.pdf}, year = {2020}, date = {2020-06-04}, booktitle = {Int’l Conf. on Artificial Intelligence Applications and Innovation (AIAI-2020)}, address = {Halkidiki, Greece}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
G Retsinas, P P Filntisis, N Efthymiou, E Theodosis, A Zlatintsi, P Maragos Person Identification Using Deep Convolutional Neural Networks on Short-Term Signals from Wearable Sensors Conference ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020. @conference{9053910, title = {Person Identification Using Deep Convolutional Neural Networks on Short-Term Signals from Wearable Sensors}, author = {G Retsinas and P P Filntisis and N Efthymiou and E Theodosis and A Zlatintsi and P Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/icassp2020_retsinas.pdf}, year = {2020}, date = {2020-05-01}, booktitle = {ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages = {3657-3661}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
P. P. Filntisis, N. Efthymiou, G. Potamianos, P. Maragos Emotion Understanding in Videos Through Body, Context, and Visual-Semantic Embedding Loss Workshop Proc. 16th European Computer Vision Conference Workshops (ECCVW) - Workshop on Bodily Expressed Emotion Understanding, 2020. @workshop{Filntisis2020, title = {Emotion Understanding in Videos Through Body, Context, and Visual-Semantic Embedding Loss}, author = {P. P. Filntisis and N. Efthymiou and G. Potamianos and P. Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/Emotion_understanding_in_videos_through_body__context__and_visual_semantic_embedding_loss-1.pdf https://github.com/filby89/NTUA-BEEU-eccv2020}, year = {2020}, date = {2020-08-01}, booktitle = {Proc. 16th European Computer Vision Conference Workshops (ECCVW) - Workshop on Bodily Expressed Emotion Understanding}, keywords = {}, pubstate = {published}, tppubtype = {workshop} } |
2019 |
P P Filntisis, N Efthymiou, P Koutras, G Potamianos, P Maragos Fusing Body Posture With Facial Expressions for Joint Recognition of Affect in Child–Robot Interaction Journal Article IEEE Robotics and Automation Letters (with IROS option), 4 (4), pp. 4011-4018, 2019. Abstract | BibTeX | Links: [PDF] @article{Filntisis2019, title = {Fusing Body Posture With Facial Expressions for Joint Recognition of Affect in Child–Robot Interaction}, author = {P P Filntisis and N Efthymiou and P Koutras and G Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/2019_FilntisisEtAl_FuseBodyFace-AffectRecogn-ChildRobotInteract_ieeeRAL.pdf}, doi = {10.1109/LRA.2019.2930434}, year = {2019}, date = {2019-10-01}, journal = {IEEE Robotics and Automation Letters (with IROS option)}, volume = {4}, number = {4}, pages = {4011-4018}, abstract = {In this letter, we address the problem of multi-cue affect recognition in challenging scenarios such as child–robot interaction. Toward this goal we propose a method for automatic recognition of affect that leverages body expressions alongside facial ones, as opposed to traditional methods that typically focus only on the latter. Our deep-learning based method uses hierarchical multi-label annotations and multi-stage losses, can be trained both jointly and separately, and offers us computational models for both individual modalities, as well as for the whole body emotion. We evaluate our method on a challenging child–robot interaction database of emotional expressions collected by us, as well as on the GEneva multimodal emotion portrayal public database of acted emotions by adults, and show that the proposed method achieves significantly better results than facial-only expression baselines.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In this letter, we address the problem of multi-cue affect recognition in challenging scenarios such as child–robot interaction. Toward this goal we propose a method for automatic recognition of affect that leverages body expressions alongside facial ones, as opposed to traditional methods that typically focus only on the latter. Our deep-learning based method uses hierarchical multi-label annotations and multi-stage losses, can be trained both jointly and separately, and offers us computational models for both individual modalities, as well as for the whole body emotion. We evaluate our method on a challenging child–robot interaction database of emotional expressions collected by us, as well as on the GEneva multimodal emotion portrayal public database of acted emotions by adults, and show that the proposed method achieves significantly better results than facial-only expression baselines. |
C. Garoufis, A. Zlatintsi, K. Kritsis, P.P. Filntisis, V. Katsouros, and P. Maragos An Environment for Gestural Interaction with 3D Virual Musical Instruments as an Educational Tool Conference Proc. 27th European Conf.(EUSIPCO-19), A Coruna, Spain, 2019. BibTeX | Links: [PDF] [Poster] @conference{Garoufis2019b, title = {An Environment for Gestural Interaction with 3D Virual Musical Instruments as an Educational Tool}, author = {C. Garoufis and A. Zlatintsi and K. Kritsis and P.P. Filntisis and V. Katsouros and and P. Maragos }, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/2019_GZKFKM_GestureInteractWithVirtualMusicInstrumentsForEducation_EUSIPCO-1-1.pdf http://robotics.ntua.gr/wp-content/uploads/sites/2/GZKFKM_InteractionWithVRMIsForEducation_EUSIPCO19_poster.pdf}, year = {2019}, date = {2019-09-01}, booktitle = {Proc. 27th European Conf.(EUSIPCO-19)}, address = {A Coruna, Spain}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
2018 |
N. Efthymiou, P. Koutras, P. ~P. Filntisis, G. Potamianos, P. Maragos MULTI-VIEW FUSION FOR ACTION RECOGNITION IN CHILD-ROBOT INTERACTION Conference Proc. IEEE Int'l Conf. on Image Processing, Athens, Greece, 2018. Abstract | BibTeX | Links: [PDF] @conference{efthymiou18action, title = {MULTI-VIEW FUSION FOR ACTION RECOGNITION IN CHILD-ROBOT INTERACTION}, author = { N. Efthymiou and P. Koutras and P. ~P. Filntisis and G. Potamianos and P. Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/EfthymiouKoutrasFilntisis_MultiViewFusActRecognChildRobotInteract_ICIP18.pdf}, year = {2018}, date = {2018-10-01}, booktitle = {Proc. IEEE Int'l Conf. on Image Processing}, address = {Athens, Greece}, abstract = {Answering the challenge of leveraging computer vision methods in order to enhance Human Robot Interaction (HRI) experience, this work explores methods that can expand the capabilities of an action recognition system in such tasks. A multi-view action recognition system is proposed for integration in HRI scenarios with special users, such as children, in which there is limited data for training and many state-of-the-art techniques face difficulties. Different feature extraction approaches, encoding methods and fusion techniques are combined and tested in order to create an efficient system that recognizes children pantomime actions. This effort culminates in the integration of a robotic platform and is evaluated under an alluring Children Robot Interaction scenario.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Answering the challenge of leveraging computer vision methods in order to enhance Human Robot Interaction (HRI) experience, this work explores methods that can expand the capabilities of an action recognition system in such tasks. A multi-view action recognition system is proposed for integration in HRI scenarios with special users, such as children, in which there is limited data for training and many state-of-the-art techniques face difficulties. Different feature extraction approaches, encoding methods and fusion techniques are combined and tested in order to create an efficient system that recognizes children pantomime actions. This effort culminates in the integration of a robotic platform and is evaluated under an alluring Children Robot Interaction scenario. |
A. Zlatintsi, P.P. Filntisis, C. Garoufis, A. Tsiami, K. Kritsis, M.A. Kaliakatsos-Papakostas, A. Gkiokas, V. Katsouros, P. Maragos A Web-based Real-Time Kinect Application for Gestural Interaction with Virtual Musical Instruments Conference Proc. of Audio Mostly Conference (AM’18), Wrexham, North Wales, UK, 2018. @conference{Zlatintsi2018, title = {A Web-based Real-Time Kinect Application for Gestural Interaction with Virtual Musical Instruments}, author = {A. Zlatintsi and P.P. Filntisis and C. Garoufis and A. Tsiami and K. Kritsis and M.A. Kaliakatsos-Papakostas and A. Gkiokas and V. Katsouros and P. Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/ZlatintsiEtAl_WebBasedRealTimeKinectAppGestInteractVMI_ΑΜ18-1.pdf}, year = {2018}, date = {2018-09-01}, booktitle = {Proc. of Audio Mostly Conference (AM’18)}, address = {Wrexham, North Wales, UK}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
A Tsiami, P Koutras, Niki Efthymiou, P P Filntisis, G Potamianos, P Maragos Multi3: Multi-sensory Perception System for Multi-modal Child Interaction with Multiple Robots Conference IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 2018. Abstract | BibTeX | Links: [PDF] @conference{multi3, title = {Multi3: Multi-sensory Perception System for Multi-modal Child Interaction with Multiple Robots}, author = {A Tsiami and P Koutras and Niki Efthymiou and P P Filntisis and G Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/publications/2018_TsiamiEtAl_Multi3-MultisensorMultimodalChildInteractMultRobots_ICRA.pdf}, year = {2018}, date = {2018-05-01}, booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}, address = {Brisbane, Australia}, abstract = {Child-robot interaction is an interdisciplinary research area that has been attracting growing interest, primarily focusing on edutainment applications. A crucial factor to the successful deployment and wide adoption of such applications remains the robust perception of the child's multimodal actions, when interacting with the robot in a natural and untethered fashion. Since robotic sensory and perception capabilities are platform-dependent and most often rather limited, we propose a multiple Kinect-based system to perceive the child-robot interaction scene that is robot-independent and suitable for indoors interaction scenarios. The audio-visual input from the Kinect sensors is fed into speech, gesture, and action recognition modules, appropriately developed in this paper to address the challenging nature of child-robot interaction. For this purpose, data from multiple children are collected and used for module training or adaptation. Further, information from the multiple sensors is fused to enhance module performance. The perception system is integrated in a modular multi-robot architecture demonstrating its flexibility and scalability with different robotic platforms. The whole system, called Multi3, is evaluated, both objectively at the module level and subjectively in its entirety, under appropriate child-robot interaction scenarios containing several carefully designed games between children and robots.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Child-robot interaction is an interdisciplinary research area that has been attracting growing interest, primarily focusing on edutainment applications. A crucial factor to the successful deployment and wide adoption of such applications remains the robust perception of the child's multimodal actions, when interacting with the robot in a natural and untethered fashion. Since robotic sensory and perception capabilities are platform-dependent and most often rather limited, we propose a multiple Kinect-based system to perceive the child-robot interaction scene that is robot-independent and suitable for indoors interaction scenarios. The audio-visual input from the Kinect sensors is fed into speech, gesture, and action recognition modules, appropriately developed in this paper to address the challenging nature of child-robot interaction. For this purpose, data from multiple children are collected and used for module training or adaptation. Further, information from the multiple sensors is fused to enhance module performance. The perception system is integrated in a modular multi-robot architecture demonstrating its flexibility and scalability with different robotic platforms. The whole system, called Multi3, is evaluated, both objectively at the module level and subjectively in its entirety, under appropriate child-robot interaction scenarios containing several carefully designed games between children and robots. |
A Tsiami, P P Filntisis, N Efthymiou, P Koutras, G Potamianos, P Maragos FAR-FIELD AUDIO-VISUAL SCENE PERCEPTION OF MULTI-PARTY HUMAN-ROBOT INTERACTION FOR CHILDREN AND ADULTS Conference Proc. IEEE Int'l Conf. Acous., Speech, and Signal Processing (ICASSP), Calgary, Canada, 2018. Abstract | BibTeX | Links: [PDF] @conference{tsiamifar, title = {FAR-FIELD AUDIO-VISUAL SCENE PERCEPTION OF MULTI-PARTY HUMAN-ROBOT INTERACTION FOR CHILDREN AND ADULTS}, author = {A Tsiami and P P Filntisis and N Efthymiou and P Koutras and G Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/publications/2018_TsiamiEtAl_FarfieldAVperceptionHRI-ChildrenAdults_ICASSP.pdf}, year = {2018}, date = {2018-04-01}, booktitle = {Proc. IEEE Int'l Conf. Acous., Speech, and Signal Processing (ICASSP)}, address = {Calgary, Canada}, abstract = {Human-robot interaction (HRI) is a research area of growing interest with a multitude of applications for both children and adult user groups, as, for example, in edutainment and social robotics. Crucial, however, to its wider adoption remains the robust perception of HRI scenes in natural, untethered, and multi-party interaction scenarios, across user groups. Towards this goal, we investigate three focal HRI perception modules operating on data from multiple audio-visual sensors that observe the HRI scene from the far-field, thus bypassing limitations and platform-dependency of contemporary robotic sensing. In particular, the developed modules fuse intra- and/or inter-modality data streams to perform: (i) audio-visual speaker localization; (ii) distant speech recognition; and (iii) visual recognition of hand-gestures. Emphasis is also placed on ensuring high speech and gesture recognition rates for both children and adults. Development and objective evaluation of the three modules is conducted on a corpus of both user groups, collected by our far-field multi-sensory setup, for an interaction scenario of a question-answering ``guess-the-object'' collaborative HRI game with a ``Furhat'' robot. In addition, evaluation of the game incorporating the three developed modules is reported. Our results demonstrate robust far-field audio-visual perception of the multi-party HRI scene.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Human-robot interaction (HRI) is a research area of growing interest with a multitude of applications for both children and adult user groups, as, for example, in edutainment and social robotics. Crucial, however, to its wider adoption remains the robust perception of HRI scenes in natural, untethered, and multi-party interaction scenarios, across user groups. Towards this goal, we investigate three focal HRI perception modules operating on data from multiple audio-visual sensors that observe the HRI scene from the far-field, thus bypassing limitations and platform-dependency of contemporary robotic sensing. In particular, the developed modules fuse intra- and/or inter-modality data streams to perform: (i) audio-visual speaker localization; (ii) distant speech recognition; and (iii) visual recognition of hand-gestures. Emphasis is also placed on ensuring high speech and gesture recognition rates for both children and adults. Development and objective evaluation of the three modules is conducted on a corpus of both user groups, collected by our far-field multi-sensory setup, for an interaction scenario of a question-answering ``guess-the-object'' collaborative HRI game with a ``Furhat'' robot. In addition, evaluation of the game incorporating the three developed modules is reported. Our results demonstrate robust far-field audio-visual perception of the multi-party HRI scene. |
2017 |
Panagiotis Paraskevas Filntisis, Athanasios Katsamanis, Pirros Tsiakoulis, Petros Maragos Video-realistic expressive audio-visual speech synthesis for the Greek language Journal Article Speech Communication, 95 , pp. 137–152, 2017, ISSN: 01676393. Abstract | BibTeX | Links: [PDF] @article{345, title = {Video-realistic expressive audio-visual speech synthesis for the Greek language}, author = {Panagiotis Paraskevas Filntisis and Athanasios Katsamanis and Pirros Tsiakoulis and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/FilntisisKatsamanisTsiakoulis+_VideoRealExprAudioVisSpeechSynthGrLang_SC17.pdf}, doi = {10.1016/j.specom.2017.08.011}, issn = {01676393}, year = {2017}, date = {2017-01-01}, journal = {Speech Communication}, volume = {95}, pages = {137--152}, abstract = {High quality expressive speech synthesis has been a long-standing goal towards natural human-computer interaction. Generating a talking head which is both realistic and expressive appears to be a considerable challenge, due to both the high complexity in the acoustic and visual streams and the large non-discrete number of emotional states we would like the talking head to be able to express. In order to cover all the desired emotions, a significant amount of data is required, which poses an additional time-consuming data collection challenge. In this paper we attempt to address the aforementioned problems in an audio-visual context. Towards this goal, we propose two deep neural network (DNN) architectures for Video-realistic Expressive Audio-Visual Text-To-Speech synthesis (EAVTTS) and evaluate them by comparing them directly both to traditional hidden Markov model (HMM) based EAVTTS, as well as a concatenative unit selection EAVTTS approach, both on the realism and the expressiveness of the generated talking head. Next, we investigate adaptation and interpolation techniques to address the problem of covering the large emotional space. We use HMM interpolation in order to generate different levels of intensity for an emotion, as well as investigate whether it is possible to generate speech with intermediate speaking styles between two emotions. In addition, we employ HMM adaptation to adapt an HMM-based system to another emotion using only a limited amount of adaptation data from the target emotion. We performed an extensive experimental evaluation on a medium sized audio-visual corpus covering three emotions, namely anger, sadness and happiness, as well as neutral reading style. Our results show that DNN-based models outperform HMMs and unit selection on both the realism and expressiveness of the generated talking heads, while in terms of adaptation we can successfully adapt an audio-visual HMM set trained on a neutral speaking style database to a target emotion. Finally, we show that HMM interpolation can indeed generate different levels of intensity for EAVTTS by interpolating an emotion with the neutral reading style, as well as in some cases, generate audio-visual speech with intermediate expressions between two emotions.}, keywords = {}, pubstate = {published}, tppubtype = {article} } High quality expressive speech synthesis has been a long-standing goal towards natural human-computer interaction. Generating a talking head which is both realistic and expressive appears to be a considerable challenge, due to both the high complexity in the acoustic and visual streams and the large non-discrete number of emotional states we would like the talking head to be able to express. In order to cover all the desired emotions, a significant amount of data is required, which poses an additional time-consuming data collection challenge. In this paper we attempt to address the aforementioned problems in an audio-visual context. Towards this goal, we propose two deep neural network (DNN) architectures for Video-realistic Expressive Audio-Visual Text-To-Speech synthesis (EAVTTS) and evaluate them by comparing them directly both to traditional hidden Markov model (HMM) based EAVTTS, as well as a concatenative unit selection EAVTTS approach, both on the realism and the expressiveness of the generated talking head. Next, we investigate adaptation and interpolation techniques to address the problem of covering the large emotional space. We use HMM interpolation in order to generate different levels of intensity for an emotion, as well as investigate whether it is possible to generate speech with intermediate speaking styles between two emotions. In addition, we employ HMM adaptation to adapt an HMM-based system to another emotion using only a limited amount of adaptation data from the target emotion. We performed an extensive experimental evaluation on a medium sized audio-visual corpus covering three emotions, namely anger, sadness and happiness, as well as neutral reading style. Our results show that DNN-based models outperform HMMs and unit selection on both the realism and expressiveness of the generated talking heads, while in terms of adaptation we can successfully adapt an audio-visual HMM set trained on a neutral speaking style database to a target emotion. Finally, we show that HMM interpolation can indeed generate different levels of intensity for EAVTTS by interpolating an emotion with the neutral reading style, as well as in some cases, generate audio-visual speech with intermediate expressions between two emotions. |