Towards behaviour-aware multimodal video summarization : integrating visual, audio, and textual cues for human-centric content analysis
Islam, Md Moinul (2025-06-12)
Islam, Md Moinul
M. M. Islam
12.06.2025
© 2025 Md Moinul Islam. Ellei toisin mainita, uudelleenkäyttö on sallittu Creative Commons Attribution 4.0 International (CC-BY 4.0) -lisenssillä (https://creativecommons.org/licenses/by/4.0/). Uudelleenkäyttö on sallittua edellyttäen, että lähde mainitaan asianmukaisesti ja mahdolliset muutokset merkitään. Sellaisten osien käyttö tai jäljentäminen, jotka eivät ole tekijän tai tekijöiden omaisuutta, saattaa edellyttää lupaa suoraan asianomaisilta oikeudenhaltijoilta.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202506124422
https://urn.fi/URN:NBN:fi:oulu-202506124422
Tiivistelmä
Video summarization models for human-centric content face critical challenges in capturing behavioural cues that convey essential interpersonal communication signals, as existing approaches either rely solely on visual features or employ basic multimodal combinations that neglect synchronized behavioural signals across modalities. This thesis proposes a behaviour-aware multimodal video summarization pipeline that integrates visual, audio, and textual modalities to identify moments of communicative significance through synchronized behavioural signals including gestures, facial expressions, and vocal prosody. The pipeline employs two complementary approaches: a transformer-based framework with cross-modal attention mechanisms and a heuristic approach utilizing multimodal bonus words, terms emphasized across multiple modalities that improve semantic relevance and expressive clarity of summaries. To address dataset scarcity, a scalable pseudo-ground truth generation pipeline using large language models (LLMs) enables automated evaluation without human annotations. Experimental validation on the ChaLearn First Impressions dataset demonstrates substantial performance improvements over existing state-of-the-art methods across multiple evaluation metrics. Comprehensive analysis confirms that multimodal integration significantly outperforms individual modalities, while temporal alignment emerges as crucial for maintaining narrative coherence and summary quality. This work contributes a novel methodology for preserving both semantic content and behavioural expressiveness in video summarization, enabling applications in educational technology, content analysis, and human-computer interaction.
Kokoelmat
- Avoin saatavuus [42971]

