2017 |
A Zlatintsi, P Koutras, G Evangelopoulos, N Malandrakis, N Efthymiou, K Pastra, A Potamianos, P Maragos COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization Journal Article EURASIP Journal on Image and Video Processing, 54 , pp. 1–24, 2017. Abstract | BibTeX | Links: [PDF] @article{ZKE+17, title = {COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization}, author = {A Zlatintsi and P Koutras and G Evangelopoulos and N Malandrakis and N Efthymiou and K Pastra and A Potamianos and P Maragos}, url = {http://robotics.ntua.gr/wp-content/publications/Zlatintsi+_COGNIMUSEdb_EURASIP_JIVP-2017.pdf}, doi = {doi 10.1186/s13640-017-0194}, year = {2017}, date = {2017-01-01}, journal = {EURASIP Journal on Image and Video Processing}, volume = {54}, pages = {1--24}, abstract = {Research related to computational modeling for machine-based understanding requires ground truth data for training, content analysis, and evaluation. In this paper, we present a multimodal video database, namely COGNIMUSE, annotated with sensory and semantic saliency, events, cross-media semantics, and emotion. The purpose of this database is manifold; it can be used for training and evaluation of event detection and summarization algorithms, for classification and recognition of audio-visual and cross-media events, as well as for emotion tracking. In order to enable comparisons with other computational models, we propose state-of-the-art algorithms, specifically a unified energy-based audio-visual framework and a method for text saliency computation, for the detection of perceptually salient events from videos. Additionally, a movie summarization system for the automatic production of summaries is presented. Two kinds of evaluation were performed, an objective based on the saliency annotation of the database and an extensive qualitative human evaluation of the automatically produced summaries, where we investigated what composes high-quality movie summaries, where both methods verified the appropriateness of the proposed methods. The annotation of the database and the code for the summarization system can be found at http://cognimuse.cs.ntua.gr/database.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Research related to computational modeling for machine-based understanding requires ground truth data for training, content analysis, and evaluation. In this paper, we present a multimodal video database, namely COGNIMUSE, annotated with sensory and semantic saliency, events, cross-media semantics, and emotion. The purpose of this database is manifold; it can be used for training and evaluation of event detection and summarization algorithms, for classification and recognition of audio-visual and cross-media events, as well as for emotion tracking. In order to enable comparisons with other computational models, we propose state-of-the-art algorithms, specifically a unified energy-based audio-visual framework and a method for text saliency computation, for the detection of perceptually salient events from videos. Additionally, a movie summarization system for the automatic production of summaries is presented. Two kinds of evaluation were performed, an objective based on the saliency annotation of the database and an extensive qualitative human evaluation of the automatically produced summaries, where we investigated what composes high-quality movie summaries, where both methods verified the appropriateness of the proposed methods. The annotation of the database and the code for the summarization system can be found at http://cognimuse.cs.ntua.gr/database. |
2013 |
Georgios Evangelopoulos, Athanasia Zlatintsi, Alexandros Potamianos, Petros Maragos, Konstantinos Rapantzikos, Georgios Skoumas, Yannis Avrithis Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention Journal Article IEEE Transactions on Multimedia, 15 (7), pp. 1553–1568, 2013, ISSN: 15209210. Abstract | BibTeX | Links: [PDF] @article{141, title = {Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention}, author = {Georgios Evangelopoulos and Athanasia Zlatintsi and Alexandros Potamianos and Petros Maragos and Konstantinos Rapantzikos and Georgios Skoumas and Yannis Avrithis}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/EZPMRSA_MultimodalSaliencyFusionMovieSumAVTattention_ieeetMM13.pdf}, doi = {10.1109/TMM.2013.2267205}, issn = {15209210}, year = {2013}, date = {2013-01-01}, journal = {IEEE Transactions on Multimedia}, volume = {15}, number = {7}, pages = {1553--1568}, abstract = {Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color, and orientation. Textual or linguistic saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color, and orientation. Textual or linguistic saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality. |
2007 |
Konstantinos Rapantzikos, Georgios Evangelopoulos, Petros Maragos, Yannis Avrithis An audio-visual saliency model for movie summarization Conference 2007 IEEE 9Th International Workshop on Multimedia Signal Processing, MMSP 2007 - Proceedings, 2007, ISBN: 1424412749. Abstract | BibTeX | Links: [PDF] @conference{219, title = {An audio-visual saliency model for movie summarization}, author = { Konstantinos Rapantzikos and Georgios Evangelopoulos and Petros Maragos and Yannis Avrithis}, url = {http://robotics.ntua.gr/wp-content/uploads/publications/RapantzikosEvangelopoulosMaragosEtAl_AVSaliencyDetectionSummarization_mmsp07.pdf}, doi = {10.1109/MMSP.2007.4412882}, isbn = {1424412749}, year = {2007}, date = {2007-01-01}, booktitle = {2007 IEEE 9Th International Workshop on Multimedia Signal Processing, MMSP 2007 - Proceedings}, pages = {320--323}, abstract = {A saliency-based method for generating video summaries is presented, which exploits coupled audiovisual information from both media streams. Efficient and advanced speech and image processing algorithms to detect key frames that are acoustically and visually salient are used. Promising results are shown from experiments on a movie database.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } A saliency-based method for generating video summaries is presented, which exploits coupled audiovisual information from both media streams. Efficient and advanced speech and image processing algorithms to detect key frames that are acoustically and visually salient are used. Promising results are shown from experiments on a movie database. |
Copyright Notice:
Some material presented is available for download to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
The work already published by the IEEE is under its copyright. Personal use of such material is permitted. However, permission to reprint/republish the material for advertising or promotional purposes, or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of the work in other works must be obtained from the IEEE.