Hyppää sisältöön
    • FI
    • ENG
  • FI
  • /
  • EN
OuluREPO – Oulun yliopiston julkaisuarkisto / University of Oulu repository
Näytä viite 
  •   OuluREPO etusivu
  • Oulun yliopisto
  • Avoin saatavuus
  • Näytä viite
  •   OuluREPO etusivu
  • Oulun yliopisto
  • Avoin saatavuus
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Audiovisual synchrony detection with optimized audio features

Sieranoja, Sami; Kinnunen, Tomi; Komulainen, Jukka; Hadid, Abdenour (2019-01-03)

 
Avaa tiedosto
nbnfi-fe2020041415345.pdf (2.644Mt)
nbnfi-fe2020041415345_meta.xml (34.61Kt)
nbnfi-fe2020041415345_solr.xml (31.76Kt)
Lataukset: 

URL:
https://doi.org/10.1109/SIPROCESS.2018.8600424

Sieranoja, Sami
Kinnunen, Tomi
Komulainen, Jukka
Hadid, Abdenour
Institute of Electrical and Electronics Engineers
03.01.2019

S. Sieranoja, M. Sahidullah, T. Kinnunen, J. Komulainen and A. Hadid, "Audiovisual Synchrony Detection with Optimized Audio Features," 2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP), Shenzhen, 2018, pp. 377-381, https://doi.org/10.1109/SIPROCESS.2018.8600424

https://rightsstatements.org/vocab/InC/1.0/
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/
doi:https://doi.org/10.1109/SIPROCESS.2018.8600424
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2020041415345
Tiivistelmä

Abstract

Audiovisual speech synchrony detection is an important part of talking-face verification systems. Prior work has primarily focused on visual features and joint-space models, while standard mel-frequency cepstral coefficients (MFCCs) have been commonly used to present speech. We focus more closely on audio by studying the impact of context window length for delta feature computation and comparing MFCCs with simpler energy-based features in lip-sync detection. We select state-of-the-art hand-crafted lip-sync visual features, space-time auto-correlation of gradients (STACOG), and canonical correlation analysis (CCA), for joint-space modeling. To enhance joint space modeling, we adopt deep CCA (DCCA), a nonlinear extension of CCA. Our results on the XM2VTS data indicate substantially enhanced audiovisual speech synchrony detection, with an equal error rate (EER) of 3.68%. Further analysis reveals that failed lip region localization and beardedness of the subjects constitutes most of the errors. Thus, the lip motion description is the bottleneck, while the use of novel audio features or joint-modeling techniques is unlikely to boost lip-sync detection accuracy further.

Kokoelmat
  • Avoin saatavuus [37957]
oulurepo@oulu.fiOulun yliopiston kirjastoOuluCRISLaturiMuuntaja
SaavutettavuusselosteTietosuojailmoitusYlläpidon kirjautuminen
 

Selaa kokoelmaa

NimekkeetTekijätJulkaisuajatAsiasanatUusimmatSivukartta

Omat tiedot

Kirjaudu sisäänRekisteröidy
oulurepo@oulu.fiOulun yliopiston kirjastoOuluCRISLaturiMuuntaja
SaavutettavuusselosteTietosuojailmoitusYlläpidon kirjautuminen