(+30) 210-772-2964
- antsiami@cs.ntua.gr
- Office 2.2.19
Biosketch
I was born in Athens in 1988. I received my Diploma degree in Electrical and Computer Engineering from the National Technical University of Athens in September 2012. My diploma thesis was about the Automatic Recognition of Music Signals.
Since July 2013 I am a PhD student in the CVSP Group, school of ECE, NTUA, under the supervision of Prof. Petros Maragos, working in the general area of Audio and Speech Processing and their applications. My research interests lie primary in the field of Multichannel Speech Enhancement, Microphone Array Processing, Acoustic Source Localization.
Publications
2018 |
A Tsiami, P Koutras, Niki Efthymiou, P P Filntisis, G Potamianos, P Maragos Multi3: Multi-sensory Perception System for Multi-modal Child Interaction with Multiple Robots Conference IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 2018. Abstract | BibTeX | Links: [PDF] @conference{multi3, title = {Multi3: Multi-sensory Perception System for Multi-modal Child Interaction with Multiple Robots}, author = {A Tsiami and P Koutras and Niki Efthymiou and P P Filntisis and G Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/publications/2018_TsiamiEtAl_Multi3-MultisensorMultimodalChildInteractMultRobots_ICRA.pdf}, year = {2018}, date = {2018-05-01}, booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}, address = {Brisbane, Australia}, abstract = {Child-robot interaction is an interdisciplinary research area that has been attracting growing interest, primarily focusing on edutainment applications. A crucial factor to the successful deployment and wide adoption of such applications remains the robust perception of the child's multimodal actions, when interacting with the robot in a natural and untethered fashion. Since robotic sensory and perception capabilities are platform-dependent and most often rather limited, we propose a multiple Kinect-based system to perceive the child-robot interaction scene that is robot-independent and suitable for indoors interaction scenarios. The audio-visual input from the Kinect sensors is fed into speech, gesture, and action recognition modules, appropriately developed in this paper to address the challenging nature of child-robot interaction. For this purpose, data from multiple children are collected and used for module training or adaptation. Further, information from the multiple sensors is fused to enhance module performance. The perception system is integrated in a modular multi-robot architecture demonstrating its flexibility and scalability with different robotic platforms. The whole system, called Multi3, is evaluated, both objectively at the module level and subjectively in its entirety, under appropriate child-robot interaction scenarios containing several carefully designed games between children and robots.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Child-robot interaction is an interdisciplinary research area that has been attracting growing interest, primarily focusing on edutainment applications. A crucial factor to the successful deployment and wide adoption of such applications remains the robust perception of the child's multimodal actions, when interacting with the robot in a natural and untethered fashion. Since robotic sensory and perception capabilities are platform-dependent and most often rather limited, we propose a multiple Kinect-based system to perceive the child-robot interaction scene that is robot-independent and suitable for indoors interaction scenarios. The audio-visual input from the Kinect sensors is fed into speech, gesture, and action recognition modules, appropriately developed in this paper to address the challenging nature of child-robot interaction. For this purpose, data from multiple children are collected and used for module training or adaptation. Further, information from the multiple sensors is fused to enhance module performance. The perception system is integrated in a modular multi-robot architecture demonstrating its flexibility and scalability with different robotic platforms. The whole system, called Multi3, is evaluated, both objectively at the module level and subjectively in its entirety, under appropriate child-robot interaction scenarios containing several carefully designed games between children and robots. |
A Tsiami, P P Filntisis, N Efthymiou, P Koutras, G Potamianos, P Maragos FAR-FIELD AUDIO-VISUAL SCENE PERCEPTION OF MULTI-PARTY HUMAN-ROBOT INTERACTION FOR CHILDREN AND ADULTS Conference Proc. IEEE Int'l Conf. Acous., Speech, and Signal Processing (ICASSP), Calgary, Canada, 2018. Abstract | BibTeX | Links: [PDF] @conference{tsiamifar, title = {FAR-FIELD AUDIO-VISUAL SCENE PERCEPTION OF MULTI-PARTY HUMAN-ROBOT INTERACTION FOR CHILDREN AND ADULTS}, author = {A Tsiami and P P Filntisis and N Efthymiou and P Koutras and G Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/publications/2018_TsiamiEtAl_FarfieldAVperceptionHRI-ChildrenAdults_ICASSP.pdf}, year = {2018}, date = {2018-04-01}, booktitle = {Proc. IEEE Int'l Conf. Acous., Speech, and Signal Processing (ICASSP)}, address = {Calgary, Canada}, abstract = {Human-robot interaction (HRI) is a research area of growing interest with a multitude of applications for both children and adult user groups, as, for example, in edutainment and social robotics. Crucial, however, to its wider adoption remains the robust perception of HRI scenes in natural, untethered, and multi-party interaction scenarios, across user groups. Towards this goal, we investigate three focal HRI perception modules operating on data from multiple audio-visual sensors that observe the HRI scene from the far-field, thus bypassing limitations and platform-dependency of contemporary robotic sensing. In particular, the developed modules fuse intra- and/or inter-modality data streams to perform: (i) audio-visual speaker localization; (ii) distant speech recognition; and (iii) visual recognition of hand-gestures. Emphasis is also placed on ensuring high speech and gesture recognition rates for both children and adults. Development and objective evaluation of the three modules is conducted on a corpus of both user groups, collected by our far-field multi-sensory setup, for an interaction scenario of a question-answering ``guess-the-object'' collaborative HRI game with a ``Furhat'' robot. In addition, evaluation of the game incorporating the three developed modules is reported. Our results demonstrate robust far-field audio-visual perception of the multi-party HRI scene.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Human-robot interaction (HRI) is a research area of growing interest with a multitude of applications for both children and adult user groups, as, for example, in edutainment and social robotics. Crucial, however, to its wider adoption remains the robust perception of HRI scenes in natural, untethered, and multi-party interaction scenarios, across user groups. Towards this goal, we investigate three focal HRI perception modules operating on data from multiple audio-visual sensors that observe the HRI scene from the far-field, thus bypassing limitations and platform-dependency of contemporary robotic sensing. In particular, the developed modules fuse intra- and/or inter-modality data streams to perform: (i) audio-visual speaker localization; (ii) distant speech recognition; and (iii) visual recognition of hand-gestures. Emphasis is also placed on ensuring high speech and gesture recognition rates for both children and adults. Development and objective evaluation of the three modules is conducted on a corpus of both user groups, collected by our far-field multi-sensory setup, for an interaction scenario of a question-answering ``guess-the-object'' collaborative HRI game with a ``Furhat'' robot. In addition, evaluation of the game incorporating the three developed modules is reported. Our results demonstrate robust far-field audio-visual perception of the multi-party HRI scene. |
2017 |
A C Dometios, A Tsiami, A Arvanitakis, P Giannoulis, X S Papageorgiou, C S Tzafestas, P Maragos Integrated Speech-based Perception System for User Adaptive Robot Motion Planning in Assistive Bath Scenarios Conference Proc. of the 25th European Signal Processing Conference - Workshop: "MultiLearn 2017 - Multimodal processing, modeling and learning for human-computer/robot interaction applications", Kos, Greece, 2017. Abstract | BibTeX | Links: [PDF] @conference{DTAGPTM17, title = {Integrated Speech-based Perception System for User Adaptive Robot Motion Planning in Assistive Bath Scenarios}, author = {A C Dometios and A Tsiami and A Arvanitakis and P Giannoulis and X S Papageorgiou and C S Tzafestas and P Maragos}, url = {http://www.eurasip.org/Proceedings/Eusipco/Eusipco2017/wpapers/ML5.pdf}, year = {2017}, date = {2017-09-01}, booktitle = {Proc. of the 25th European Signal Processing Conference - Workshop: "MultiLearn 2017 - Multimodal processing, modeling and learning for human-computer/robot interaction applications"}, address = {Kos, Greece}, abstract = {Elderly people have augmented needs in performing bathing activities, since these tasks require body flexibility. Our aim is to build an assistive robotic bath system, in order to increase the independence and safety of this procedure. Towards this end, the expertise of professional carers for bathing sequences and appropriate motions have to be adopted, in order to achieve natural, physical human - robot interaction. The integration of the communication and verbal interaction between the user and the robot during the bathing tasks is a key issue for such a challenging assistive robotic application. In this paper, we tackle this challenge by developing a novel integrated real-time speech-based perception system, which will provide the necessary assistance to the frail senior citizens. This system can be suitable for installation and use in conventional home or hospital bathroom space. We employ both a speech recognition system with sub-modules to achieve a smooth and robust human-system communication and a low cost depth camera or end-effector motion planning. With a variety of spoken commands, the system can be adapted to the user’s needs and preferences. The instructed by the user washing commands are executed by a robotic manipulator, demonstrating the progress of each task. The smooth integration of ll subsystems is accomplished by a modular and hierarchical decision architecture organized as a Behavior Tree. The system was experimentally tested by successful execution of scenarios from different users with different preferences.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Elderly people have augmented needs in performing bathing activities, since these tasks require body flexibility. Our aim is to build an assistive robotic bath system, in order to increase the independence and safety of this procedure. Towards this end, the expertise of professional carers for bathing sequences and appropriate motions have to be adopted, in order to achieve natural, physical human - robot interaction. The integration of the communication and verbal interaction between the user and the robot during the bathing tasks is a key issue for such a challenging assistive robotic application. In this paper, we tackle this challenge by developing a novel integrated real-time speech-based perception system, which will provide the necessary assistance to the frail senior citizens. This system can be suitable for installation and use in conventional home or hospital bathroom space. We employ both a speech recognition system with sub-modules to achieve a smooth and robust human-system communication and a low cost depth camera or end-effector motion planning. With a variety of spoken commands, the system can be adapted to the user’s needs and preferences. The instructed by the user washing commands are executed by a robotic manipulator, demonstrating the progress of each task. The smooth integration of ll subsystems is accomplished by a modular and hierarchical decision architecture organized as a Behavior Tree. The system was experimentally tested by successful execution of scenarios from different users with different preferences. |
2016 |
A Tsiami, A Katsamanis, P Maragos, A Vatakis Towards a behaviorally-validated computational audiovisual saliency model Conference Proc. IEEE Int'l Conf. Acous., Speech, and Signal Processing, Shanghai, China, 2016. Abstract | BibTeX | Links: [PDF] @conference{7472197, title = {Towards a behaviorally-validated computational audiovisual saliency model}, author = {A Tsiami and A Katsamanis and P Maragos and A Vatakis}, url = {http://robotics.ntua.gr/wp-content/publications/TKMV_BehaviorComputAVSaliencyModel_ICASSP2016.pdf}, doi = {10.1109/ICASSP.2016.7472197}, year = {2016}, date = {2016-03-01}, booktitle = {Proc. IEEE Int'l Conf. Acous., Speech, and Signal Processing}, pages = {2847-2851}, address = {Shanghai, China}, abstract = {Computational saliency models aim at predicting, in a bottom-up fashion, where human attention is drawn in the presented (visual, auditory or audiovisual) scene and have been proven useful in applications like robotic navigation, image compression and movie summarization. Despite the fact that well-established auditory and visual saliency models have been validated in behavioral experiments, e.g., by means of eye-tracking, there is no established computational audiovisual saliency model validated in the same way. In this work, building on biologically-inspired models of visual and auditory saliency, we present a joint audiovisual saliency model and introduce the validation approach we follow to show that it is compatible with recent findings of psychology and neuroscience regarding multimodal integration and attention. In this direction, we initially focus on the "pip and pop" effect which has been observed in behavioral experiments and indicates that visual search in sequences of cluttered images can be significantly aided by properly timed non-spatial auditory signals presented alongside the target visual stimuli.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Computational saliency models aim at predicting, in a bottom-up fashion, where human attention is drawn in the presented (visual, auditory or audiovisual) scene and have been proven useful in applications like robotic navigation, image compression and movie summarization. Despite the fact that well-established auditory and visual saliency models have been validated in behavioral experiments, e.g., by means of eye-tracking, there is no established computational audiovisual saliency model validated in the same way. In this work, building on biologically-inspired models of visual and auditory saliency, we present a joint audiovisual saliency model and introduce the validation approach we follow to show that it is compatible with recent findings of psychology and neuroscience regarding multimodal integration and attention. In this direction, we initially focus on the "pip and pop" effect which has been observed in behavioral experiments and indicates that visual search in sequences of cluttered images can be significantly aided by properly timed non-spatial auditory signals presented alongside the target visual stimuli. |
2015 |
Z I Skordilis, A Tsiami, P Maragos, G Potamianos, L Spelgatti, R Sannino Multichannel Speech Enhancement Using Mems Microphones Conference IEEE International Conference on Acoustics, Speech and Signal Processing, 2015, ISBN: 978-1-4673-6997-8. @conference{163, title = {Multichannel Speech Enhancement Using Mems Microphones}, author = { Z I Skordilis and A Tsiami and P Maragos and G Potamianos and L Spelgatti and R Sannino}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/SkorTsiamMarPotSpelSan_MEMS-MCSE_ICASSP2015.pdf}, doi = {10.1109/ICASSP.2015.7178467}, isbn = {978-1-4673-6997-8}, year = {2015}, date = {2015-01-01}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing}, pages = {2--6}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
P Maragos, V Pitsikalis, A Katsamanis, N Kardaris, E Mavroudi, I Rodomagoulakis, A Tsiami Multimodal Sensory Processing for Human Action Recognition in Mobility Assistive Robotics Conference Proc. IROS-2015 Workshop on Cognitive Mobility Assistance Robots, Hamburg, Germany, Sep. 2015, 2015. @conference{320, title = {Multimodal Sensory Processing for Human Action Recognition in Mobility Assistive Robotics}, author = { P Maragos and V Pitsikalis and A Katsamanis and N Kardaris and E Mavroudi and I Rodomagoulakis and A Tsiami}, url = {MaragosEtAl_MultiSensoryHumanActionRecogn-Robotics_IROS2015-Workshop.pdf}, year = {2015}, date = {2015-01-01}, booktitle = {Proc. IROS-2015 Workshop on Cognitive Mobility Assistance Robots, Hamburg, Germany, Sep. 2015}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
2014 |
Antigoni Tsiami, Isidoros Rodomagoulakis, Panagiotis Giannoulis, Athanasios Katsamanis, Gerasimos Potamianos, Petros Maragos ATHENA: A Greek Multi-Sensory Database for Home Automation Control uthor: Isidoros Rodomagoulakis (NTUA, Greece) Conference Proc. Int'l Conf. on Speech Communication and Technology (INTERSPEECH), Singapore, 2014. Abstract | BibTeX | Links: [PDF] @conference{tsiami2014athena, title = {ATHENA: A Greek Multi-Sensory Database for Home Automation Control uthor: Isidoros Rodomagoulakis (NTUA, Greece)}, author = {Antigoni Tsiami and Isidoros Rodomagoulakis and Panagiotis Giannoulis and Athanasios Katsamanis and Gerasimos Potamianos and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/publications/Tsiami+_AthenaDatabase_INTERSPEECH2014.pdf}, year = {2014}, date = {2014-09-01}, booktitle = {Proc. Int'l Conf. on Speech Communication and Technology (INTERSPEECH)}, pages = {1608-1612}, address = {Singapore}, abstract = {In this paper we present a Greek speech database with real multi-modal data in a smart home two-room environment. In total, 20 speakers were recorded in 240 one-minute long sessions. The recordings include utterances of activation keywords and commands for home automation control, but also phonetically rich sentences and conversational speech. Audio, speaker movements and gestures were captured by 20 condenser microphones installed on the walls and ceiling, 6 MEMS microphones, 2 close-talk microphones and one Kinect camera. The new publicly available database exhibits adverse noise conditions because of background noises and acoustic events performed during the recordings to better approximate a realistic everyday home scenario. Thus, it is suitable for experimentation on voice activity and event detection, source localization, speech enhancement and far-field speech recognition. We present the details of the corpus as well as baseline results on multi-channel voice activity detection and spoken command recognition.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } In this paper we present a Greek speech database with real multi-modal data in a smart home two-room environment. In total, 20 speakers were recorded in 240 one-minute long sessions. The recordings include utterances of activation keywords and commands for home automation control, but also phonetically rich sentences and conversational speech. Audio, speaker movements and gestures were captured by 20 condenser microphones installed on the walls and ceiling, 6 MEMS microphones, 2 close-talk microphones and one Kinect camera. The new publicly available database exhibits adverse noise conditions because of background noises and acoustic events performed during the recordings to better approximate a realistic everyday home scenario. Thus, it is suitable for experimentation on voice activity and event detection, source localization, speech enhancement and far-field speech recognition. We present the details of the corpus as well as baseline results on multi-channel voice activity detection and spoken command recognition. |
A Tsiami, A Katsamanis, P Maragos, G Potamianos Experiments in acoustic source localization using sparse arrays in adverse indoors environments Conference Proc. European Signal Processing Conference, Lisbon, Portugal, 2014. Abstract | BibTeX | Links: [PDF] @conference{tsiami2014localization, title = {Experiments in acoustic source localization using sparse arrays in adverse indoors environments}, author = {A Tsiami and A Katsamanis and P Maragos and G Potamianos}, url = {http://robotics.ntua.gr/wp-content/publications/Tsiami+_AcousticSourceLocalization_EUSIPCO2014.pdf}, year = {2014}, date = {2014-09-01}, booktitle = {Proc. European Signal Processing Conference}, pages = {2390-2394}, address = {Lisbon, Portugal}, abstract = {In this paper we experiment with 2-D source localization in smart homes under adverse conditions using sparse distributed microphone arrays. We propose some improvements to deal with problems due to high reverberation, noise and use of a limited number of microphones. These consist of a pre-filtering stage for dereverberation and an iterative procedure that aims to increase accuracy. Experiments carried out in relatively large databases with both simulated and real recordings of sources in various positions indicate that the proposed method exhibits a better performance compared to others under challenging conditions while also being computationally efficient. It is demonstrated that although reverberation degrades localization performance, this degradation can be compensated by identifying the reliable microphone pairs and disposing of the outliers.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } In this paper we experiment with 2-D source localization in smart homes under adverse conditions using sparse distributed microphone arrays. We propose some improvements to deal with problems due to high reverberation, noise and use of a limited number of microphones. These consist of a pre-filtering stage for dereverberation and an iterative procedure that aims to increase accuracy. Experiments carried out in relatively large databases with both simulated and real recordings of sources in various positions indicate that the proposed method exhibits a better performance compared to others under challenging conditions while also being computationally efficient. It is demonstrated that although reverberation degrades localization performance, this degradation can be compensated by identifying the reliable microphone pairs and disposing of the outliers. |
A. Katsamanis, I. Rodomagoulakis, G. Potamianos, P. Maragos, A. Tsiami Robust far-field spoken command recognition for home automation combining adaptation and multichannel processing Conference ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2014, ISSN: 15206149. Abstract | BibTeX | Links: [PDF] @conference{171, title = {Robust far-field spoken command recognition for home automation combining adaptation and multichannel processing}, author = { A. Katsamanis and I. Rodomagoulakis and G. Potamianos and P. Maragos and A. Tsiami}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/KatsamanisEtAl_MultichannelASR_DIRHA_icassp2014.pdf}, doi = {10.1109/ICASSP.2014.6854664}, issn = {15206149}, year = {2014}, date = {2014-01-01}, booktitle = {ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings}, pages = {5547--5551}, abstract = {The paper presents our approach to speech-controlled home automa- tion. We are focusing on the detection and recognition of spoken commands preceded by a key-phrase as recorded in a voice-enabled apartment by a set of multiple microphones installed in the rooms. For both problems we investigate robust modeling, environmental adaptation and multichannel processing to cope with a) insufficient training data and b) the far-field effects and noise in the apartment. The proposed integrated scheme is evaluated in a challenging and highly realistic corpus of simulated audio recordings and achieves F-measure close to 0.70 for key-phrase spotting and word accuracy close to 98% for the command recognition task.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } The paper presents our approach to speech-controlled home automa- tion. We are focusing on the detection and recognition of spoken commands preceded by a key-phrase as recorded in a voice-enabled apartment by a set of multiple microphones installed in the rooms. For both problems we investigate robust modeling, environmental adaptation and multichannel processing to cope with a) insufficient training data and b) the far-field effects and noise in the apartment. The proposed integrated scheme is evaluated in a challenging and highly realistic corpus of simulated audio recordings and achieves F-measure close to 0.70 for key-phrase spotting and word accuracy close to 98% for the command recognition task. |