2014 |
Georgios Pavlakos, Stavros Theodorakis, Vassilis Pitsikalis, Athanasios Katsamanis, Petros Maragos Kinect-based multimodal gesture recognition using a two-pass fusion scheme Conference 2014 IEEE International Conference on Image Processing, ICIP 2014, 2014, ISBN: 9781479957514. Abstract | BibTeX | Links: [PDF] @conference{165, title = {Kinect-based multimodal gesture recognition using a two-pass fusion scheme}, author = { Georgios Pavlakos and Stavros Theodorakis and Vassilis Pitsikalis and Athanasios Katsamanis and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/PTPΚΜ_MultimodalGestureRecogn2PassFusion_ICIP2014.pdf}, doi = {10.1109/ICIP.2014.7025299}, isbn = {9781479957514}, year = {2014}, date = {2014-01-01}, booktitle = {2014 IEEE International Conference on Image Processing, ICIP 2014}, pages = {1495--1499}, abstract = {We present a new framework for multimodal gesture recognition that is based on a two-pass fusion scheme. In this, we deal with a demanding Kinect-based multimodal dataset, which was introduced in a recent gesture recognition challenge. We employ multiple modalities, i.e., visual cues, such as colour and depth images, as well as audio, and we specifically extract feature descriptors of the hands' movement, handshape, and audio spectral properties. Based on these features, we statistically train separate unimodal gesture-word models, namely hidden Markov models, explicitly accounting for the dynamics of each modality. Multimodal recognition of unknown gesture sequences is achieved by combining these models in a late, two-pass fusion scheme that exploits a set of unimodally generated n-best recognition hypotheses. The proposed scheme achieves 88.2% gesture recognition accuracy in the Kinect-based multimodal dataset, outperforming all recently published approaches on the same challenging multimodal gesture recognition task.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } We present a new framework for multimodal gesture recognition that is based on a two-pass fusion scheme. In this, we deal with a demanding Kinect-based multimodal dataset, which was introduced in a recent gesture recognition challenge. We employ multiple modalities, i.e., visual cues, such as colour and depth images, as well as audio, and we specifically extract feature descriptors of the hands' movement, handshape, and audio spectral properties. Based on these features, we statistically train separate unimodal gesture-word models, namely hidden Markov models, explicitly accounting for the dynamics of each modality. Multimodal recognition of unknown gesture sequences is achieved by combining these models in a late, two-pass fusion scheme that exploits a set of unimodally generated n-best recognition hypotheses. The proposed scheme achieves 88.2% gesture recognition accuracy in the Kinect-based multimodal dataset, outperforming all recently published approaches on the same challenging multimodal gesture recognition task. |
2009 |
Athanassios Katsamanis, George Papandreou, Petros Maragos Face active appearance modeling and speech acoustic information to recover articulation Journal Article IEEE Transactions on Audio, Speech and Language Processing, 17 (3), pp. 411–422, 2009, ISSN: 15587916. Abstract | BibTeX | Links: [PDF] @article{130, title = {Face active appearance modeling and speech acoustic information to recover articulation}, author = {Athanassios Katsamanis and George Papandreou and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/KatsamanisPapandreouMaragos_AudiovisualSpeechInversion_ieee-j-aslp09.pdf}, doi = {10.1109/TASL.2008.2008740}, issn = {15587916}, year = {2009}, date = {2009-01-01}, journal = {IEEE Transactions on Audio, Speech and Language Processing}, volume = {17}, number = {3}, pages = {411--422}, abstract = {We are interested in recovering aspects of vocal tract's geometry$backslash$nand dynamics from speech, a problem referred to as speech inversion.$backslash$nTraditional audio-only speech inversion techniques are inherently$backslash$nill-posed since the same speech acoustics can be produced by multiple$backslash$narticulatory configurations. To alleviate the ill-posedness of the$backslash$naudio-only inversion process, we propose an inversion scheme which$backslash$nalso exploits visual information from the speaker's face. The complex$backslash$naudiovisual-to-articulatory mapping is approximated by an adaptive$backslash$npiecewise linear model. Model switching is governed by a Markovian$backslash$ndiscrete process which captures articulatory dynamic information.$backslash$nEach constituent linear mapping is effectively estimated via canonical$backslash$ncorrelation analysis. In the described multimodal context, we investigate$backslash$nalternative fusion schemes which allow interaction between the audio$backslash$nand visual modalities at various synchronization levels. For facial$backslash$nanalysis, we employ active appearance models (AAMs) and demonstrate$backslash$nfully automatic face tracking and visual feature extraction. Using$backslash$nthe AAM features in conjunction with audio features such as Mel frequency$backslash$ncepstral coefficients (MFCCs) or line spectral frequencies (LSFs)$backslash$nleads to effective estimation of the trajectories followed by certain$backslash$npoints of interest in the speech production system. We report experiments$backslash$non the QSMT and MOCHA databases which contain audio, video, and electromagnetic$backslash$narticulography data recorded in parallel. The results show that exploiting$backslash$nboth audio and visual modalities in a multistream hidden Markov model$backslash$nbased scheme clearly improves performance relative to either audio$backslash$nor visual-only estimation.}, keywords = {}, pubstate = {published}, tppubtype = {article} } We are interested in recovering aspects of vocal tract's geometry$backslash$nand dynamics from speech, a problem referred to as speech inversion.$backslash$nTraditional audio-only speech inversion techniques are inherently$backslash$nill-posed since the same speech acoustics can be produced by multiple$backslash$narticulatory configurations. To alleviate the ill-posedness of the$backslash$naudio-only inversion process, we propose an inversion scheme which$backslash$nalso exploits visual information from the speaker's face. The complex$backslash$naudiovisual-to-articulatory mapping is approximated by an adaptive$backslash$npiecewise linear model. Model switching is governed by a Markovian$backslash$ndiscrete process which captures articulatory dynamic information.$backslash$nEach constituent linear mapping is effectively estimated via canonical$backslash$ncorrelation analysis. In the described multimodal context, we investigate$backslash$nalternative fusion schemes which allow interaction between the audio$backslash$nand visual modalities at various synchronization levels. For facial$backslash$nanalysis, we employ active appearance models (AAMs) and demonstrate$backslash$nfully automatic face tracking and visual feature extraction. Using$backslash$nthe AAM features in conjunction with audio features such as Mel frequency$backslash$ncepstral coefficients (MFCCs) or line spectral frequencies (LSFs)$backslash$nleads to effective estimation of the trajectories followed by certain$backslash$npoints of interest in the speech production system. We report experiments$backslash$non the QSMT and MOCHA databases which contain audio, video, and electromagnetic$backslash$narticulography data recorded in parallel. The results show that exploiting$backslash$nboth audio and visual modalities in a multistream hidden Markov model$backslash$nbased scheme clearly improves performance relative to either audio$backslash$nor visual-only estimation. |
George Papandreou, Athanassios Katsamanis, Vassilis Pitsikalis, Petros Maragos Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition Journal Article IEEE Transactions on Audio, Speech and Language Processing, 17 (3), pp. 423–435, 2009, ISSN: 15587916. Abstract | BibTeX | Links: [PDF] [PDF] @article{131, title = {Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition}, author = {George Papandreou and Athanassios Katsamanis and Vassilis Pitsikalis and Petros Maragos}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/PapandreouKatsamanisPitsikalisMaragos_MultimodalFusionUncertaintyCompensationAvasr_ieee-j-aslp09.pdf http://www.scopus.com/inward/record.url?eid=2-s2.0-44949227080&partnerID=40&md5=6edf7efa047e4239c0ea003cf525bf63}, doi = {10.1109/TASL.2008.2011515}, issn = {15587916}, year = {2009}, date = {2009-01-01}, journal = {IEEE Transactions on Audio, Speech and Language Processing}, volume = {17}, number = {3}, pages = {423--435}, abstract = {While the accuracy of feature measurements heavily depends on changing environmental conditions, studying the consequences of this fact in pattern recognition tasks has received relatively little attention to date. In this paper, we explicitly take feature measurement uncertainty into account and show how multimodal classification and learning rules should be adjusted to compensate for its effects. Our approach is particularly fruitful in multimodal fusion scenarios, such as audiovisual speech recognition, where multiple streams of complementary time-evolving features are integrated. For such applications, provided that the measurement noise uncertainty for each feature stream can be estimated, the proposed framework leads to highly adaptive multimodal fusion rules which are easy and efficient to implement. Our technique is widely applicable and can be transparently integrated with either synchronous or asynchronous multimodal sequence integration architectures. We further show that multimodal fusion methods relying on stream weights can naturally emerge from our scheme under certain assumptions; this connection provides valuable insights into the adaptivity properties of our multimodal uncertainty compensation approach. We show how these ideas can be practically applied for audiovisual speech recognition. In this context, we propose improved techniques for person-independent visual feature extraction and uncertainty estimation with active appearance models, and also discuss how enhanced audio features along with their uncertainty estimates can be effectively computed. We demonstrate the efficacy of our approach in audiovisual speech recognition experiments on the CUAVE database using either synchronous or asynchronous multimodal integration models.}, keywords = {}, pubstate = {published}, tppubtype = {article} } While the accuracy of feature measurements heavily depends on changing environmental conditions, studying the consequences of this fact in pattern recognition tasks has received relatively little attention to date. In this paper, we explicitly take feature measurement uncertainty into account and show how multimodal classification and learning rules should be adjusted to compensate for its effects. Our approach is particularly fruitful in multimodal fusion scenarios, such as audiovisual speech recognition, where multiple streams of complementary time-evolving features are integrated. For such applications, provided that the measurement noise uncertainty for each feature stream can be estimated, the proposed framework leads to highly adaptive multimodal fusion rules which are easy and efficient to implement. Our technique is widely applicable and can be transparently integrated with either synchronous or asynchronous multimodal sequence integration architectures. We further show that multimodal fusion methods relying on stream weights can naturally emerge from our scheme under certain assumptions; this connection provides valuable insights into the adaptivity properties of our multimodal uncertainty compensation approach. We show how these ideas can be practically applied for audiovisual speech recognition. In this context, we propose improved techniques for person-independent visual feature extraction and uncertainty estimation with active appearance models, and also discuss how enhanced audio features along with their uncertainty estimates can be effectively computed. We demonstrate the efficacy of our approach in audiovisual speech recognition experiments on the CUAVE database using either synchronous or asynchronous multimodal integration models. |
Copyright Notice:
Some material presented is available for download to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
The work already published by the IEEE is under its copyright. Personal use of such material is permitted. However, permission to reprint/republish the material for advertising or promotional purposes, or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of the work in other works must be obtained from the IEEE.