Depression recognition from facial videos: Preprocessing and scheduling choices hide the architectural contributions
Lage Cañellas, Manuel; Álvarez Casado, Constantino; Nguyen, Le; Bordallo López, Miguel (2023-10-22)
Lage Cañellas, Manuel
Álvarez Casado, Constantino
Nguyen, Le
Bordallo López, Miguel
Institution of engineering and technology
22.10.2023
Lage Cañellas, M., Álvarez Casado, C., Nguyen, L. and Bordallo López, M. (2023), Depression recognition from facial videos: Preprocessing and scheduling choices hide the architectural contributions. Electron. Lett., 59: e12992. https://doi.org/10.1049/ell2.12992.
https://creativecommons.org/licenses/by-nc-nd/4.0/
© 2023 The Authors. Electronics Letters published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology. This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
https://creativecommons.org/licenses/by-nc-nd/4.0/
© 2023 The Authors. Electronics Letters published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology. This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
https://creativecommons.org/licenses/by-nc-nd/4.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202311293418
https://urn.fi/URN:NBN:fi:oulu-202311293418
Tiivistelmä
Abstract
Deep learning models have been widely applied in video-based depression detection. It is observed that the diversity of preprocessing, data augmentation, and optimization techniques makes it difficult to fairly compare model architectures. In this study, the typical ResNet-50 model is enhanced by using specific face alignment methods, improved data augmentation, optimization, and scheduling techniques. The extensive experiments on two popular benchmark datasets (AVEC2013 and AVEC2014) obtained competitive results, compared to sophisticated spatio-temporal models for single streams. Moreover, the score-level fusion approach based on two texture streams outperformed the state-of-the-art methods. It achieved mean square errors of 5.82 and 5.50 on AVEC2013 and AVEC2014, respectively. These findings suggest that the preprocessing and training configurations result in noticeable improvements, which have been originally attributed to the network architectures.
Deep learning models have been widely applied in video-based depression detection. It is observed that the diversity of preprocessing, data augmentation, and optimization techniques makes it difficult to fairly compare model architectures. In this study, the typical ResNet-50 model is enhanced by using specific face alignment methods, improved data augmentation, optimization, and scheduling techniques. The extensive experiments on two popular benchmark datasets (AVEC2013 and AVEC2014) obtained competitive results, compared to sophisticated spatio-temporal models for single streams. Moreover, the score-level fusion approach based on two texture streams outperformed the state-of-the-art methods. It achieved mean square errors of 5.82 and 5.50 on AVEC2013 and AVEC2014, respectively. These findings suggest that the preprocessing and training configurations result in noticeable improvements, which have been originally attributed to the network architectures.
Kokoelmat
- Avoin saatavuus [37683]