AffectVLM: Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D Facial Expression Recognition Using Vision-Language Model
Behzad, Muzammil; Zhao, Guoying (2025-08-06)
Behzad, Muzammil
Zhao, Guoying
IEEE
06.08.2025
M. Behzad and G. Zhao, "AffectVLM: Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D Facial Expression Recognition Using Vision-Language Model," 2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG), Tampa/Clearwater, FL, USA, 2025, pp. 1-6, doi: 10.1109/FG61629.2025.11099228
https://rightsstatements.org/vocab/InC/1.0/
© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/
© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202604232748
https://urn.fi/URN:NBN:fi:oulu-202604232748
Tiivistelmä
Abstract
In this paper, we introduce AffectVLM, a vision-language model designed to integrate multiviews for a semantically rich and visually comprehensive understanding of facial emotions from 3D/4D data. To effectively capture visual features, we propose a joint representation learning framework paired with a novel gradient-friendly loss function that accelerates model convergence towards optimal feature representation. Additionally, we introduce augmented textual prompts to enhance the model’s linguistic capabilities and employ mixed view augmentation to expand the visual dataset. We also develop a Streamlit app for a real-time interactive inference and enable the model for distributed learning. Extensive experiments validate the superior performance of AffectVLM across multiple benchmarks.
In this paper, we introduce AffectVLM, a vision-language model designed to integrate multiviews for a semantically rich and visually comprehensive understanding of facial emotions from 3D/4D data. To effectively capture visual features, we propose a joint representation learning framework paired with a novel gradient-friendly loss function that accelerates model convergence towards optimal feature representation. Additionally, we introduce augmented textual prompts to enhance the model’s linguistic capabilities and employ mixed view augmentation to expand the visual dataset. We also develop a Streamlit app for a real-time interactive inference and enable the model for distributed learning. Extensive experiments validate the superior performance of AffectVLM across multiple benchmarks.
Kokoelmat
- Avoin saatavuus [42834]
