Hyppää sisältöön
    • FI
    • ENG
  • FI
  • /
  • EN
OuluREPO – Oulun yliopiston julkaisuarkisto / University of Oulu repository
Näytä viite 
  •   OuluREPO etusivu
  • Oulun yliopisto
  • Avoin saatavuus
  • Näytä viite
  •   OuluREPO etusivu
  • Oulun yliopisto
  • Avoin saatavuus
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Understanding the impact of misaligned data on vision-language representation learning

Bashir, Suhail (2025-06-12)

 
Avaa tiedosto
nbnfioulu-202506124429.pdf (6.593Mt)
nbnfioulu-202506124429_mods.xml (11.91Kt)
nbnfioulu-202506124429_pdfa_report.xml (400.0Kt)
Lataukset: 


Bashir, Suhail
S. Bashir
12.06.2025
© 2025, Suhail Bashir. Tämä Kohde on tekijänoikeuden ja/tai lähioikeuksien suojaama. Voit käyttää Kohdetta käyttöösi sovellettavan tekijänoikeutta ja lähioikeuksia koskevan lainsäädännön sallimilla tavoilla. Muunlaista käyttöä varten tarvitset oikeudenhaltijoiden luvan.
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202506124429
Tiivistelmä
Recent advancements in deep learning have led to a shift from modality-specific architectures—such as Convolutional Neural Networks (CNNs) for images and Recurrent Neural Networks (RNNs) for text—towards increasingly multimodal models. Vision- Language Models (VLMs) represent a significant advancement in this direction as they are trained on paired image-text data to learn rich, joint visual-linguistic representations. Foundational VLMs are trained on extensive web-scale datasets, enabling them to generalize across a wide range of vision-and-language tasks. These models demonstrate impressive performance in zero-shot, few-shot, and transfer learning settings, often achieving results comparable to fully supervised approaches. However, a significant drawback of such large-scale datasets is the inherent issue of misalignment between image-text pairs, in which portions of the image are not described by the corresponding text, and vice versa, making them a sub-optimal source of supervision. Traditional methods try to address this problem of misalignment by discarding the poorly aligned data and selecting only high-quality pairs. While effective, this approach is resource- intensive and can even be impractical in some applications, such as medical imaging and Earth observation, where the acquisition of data is challenging and the data is inherently misaligned, as each sensor is designed to capture distinct and complementary information.

The aim of this thesis is to investigate the impact of image-text misalignment on state-of- the-art VLMs such as Contrastive Language-Image Pre-training (CLIP), Self-supervision with Images and Language Pre-training (SLIP), and Contrastive MultiModal (CoMM). To this end, we construct data splits with varying and systematically controlled levels of image-text alignment. We train different VLMs on these data splits and evaluate their performance across a range of downstream multimodal and vision tasks. Our findings indicate that CLIP performs poorly when trained on misaligned data, even at large scales (e.g., over one million samples). In contrast, recent models that incorporate additional self-supervised constraints in their pretraining objectives, such as CoMM, are more robust to this problem of misalignment in the pre-training data.
Kokoelmat
  • Avoin saatavuus [38841]
oulurepo@oulu.fiOulun yliopiston kirjastoOuluCRISLaturiMuuntaja
SaavutettavuusselosteTietosuojailmoitusYlläpidon kirjautuminen
 

Selaa kokoelmaa

NimekkeetTekijätJulkaisuajatAsiasanatUusimmatSivukartta

Omat tiedot

Kirjaudu sisäänRekisteröidy
oulurepo@oulu.fiOulun yliopiston kirjastoOuluCRISLaturiMuuntaja
SaavutettavuusselosteTietosuojailmoitusYlläpidon kirjautuminen