Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Youngtaek Oh1     Jae Won Cho2     Dong-Jin Kim3     In So Kweon1*     Junmo Kim1*
1KAIST         2Sejong University         3Hanyang University
(*Corresponding authors)
EMNLP 2024 (Long, Main)

Image to text retrieval examples on COCO-Counterfactuals. Ours consistently retrieves correct captions against the negatives, demonstrating superior compositional reasoning during retrieval.

Abstract

In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks.

Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations.

To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity.

Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.

Introduction

Compositional reasoning remains a challenge for vision-language models (VLMs) like CLIP, which struggle to understand complex and fine-grained relationships between images and text. Current fine-tuning methods aimed at improving compositionality often reduce performance in multi-modal tasks. This trade-off is mainly due to global hard negative loss applied to single vector representations, which fails to capture subtle differences between similar texts.

Holistic comparison of fine-tuning methods. Enhancing compositional reasoning often degrades multi-modal task performances in previous models. Our $\texttt{FSC-CLIP}$ bridges this gap, minimizing these trade-offs.

The models are evaluated across 11 compositionality tasks, 21 zero-shot classification tasks, and 3 image-text retrieval tasks.

To overcome this limitation, we introduce $\texttt{FSC-CLIP}$, a new fine-tuning method for CLIP designed to enhance compositional reasoning without sacrificing multi-modal task performance. By incorporating local hard negative loss and selective calibrated regularization, our approach provides fine-grained supervision while preserving the integrity of multi-modal representations.

Method

Our method is designed to fine-tune CLIP using hard negative captions, incorporating Local Hard Negative (LHN) Loss and Selective Calibrated Regularization (SCR) to improve compositional understanding while preserving multi-modal performance.

Overall $\texttt{FSC-CLIP}$ framework. It integrates Local Hard Negative (LHN) Loss and Selective Calibrated Regularization (SCR), in addition to a global hard negative loss. The LHN loss captures similarities between image patches and text tokens, enabling finer differentiation between original and hard negative texts. SCR combines focal loss with label smoothing, effectively reducing the negative impact of hard negative losses.

Local Hard Negative (LHN) Loss

  • The goal is to improve fine-grained image-text alignment by focusing on token-patch level representations. This approach captures subtle differences between the original and hard negative (HN) texts that are often missed when using global similarity alone.
  • LHN loss captures local similarity between image patches and text tokens. By aggregating the token-level similarities, it enhances the model's ability to differentiate between original and hard negative texts, leading to more effective hard negative loss computation.
  • Selective Calibrated Regularization (SCR)

  • Hard negative texts are often too similar to original texts, potentially degrading representations through HN loss. SCR addresses this by selectively focusing on challenging HN samples and adjusting label assignments to account for the potential positiveness in hard negative texts.
  • SCR combines focal loss, which emphasizes harder-to-classify HN samples, with label smoothing, which assigns a slight positive margin to account for the potential correctness of HN texts. This approach mitigates the adverse effects of HN losses, preserving model integrity while enhancing the learning of compositionality.
  • Training Objective

    $\texttt{FSC-CLIP}$ combines the standard CLIP loss ($\mathcal{L}_{\text{clip}}$) with global HN loss ($\mathcal{L}^{g}_{neg}$) and local HN loss ($\mathcal{L}^{l}_{neg}$). Both HN losses incorporate SCR, which uses focal loss and label smoothing to keep representation integrity.
    $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{clip}} + \lambda_g \mathcal{L}^{g}_{neg} + \lambda_l \mathcal{L}^{l}_{neg}.$

    Experiments

    Highlights

  • We fine-tune our model using three image-text datasets (COCO, CC-3M, LAION-COCO), each with 100K randomly sampled subset.
  • We provide a comprehensive evaluation of methods across 11 compositionality benchmarks, 21 zero-shot classification tasks, and 3 image-text retrieval tasks.
  • $\texttt{FSC-CLIP}$ enhances compositionality to a level comparable to the high-performing DAC-LLM, while simultaneously maintaining strong multi-modal task performance.
  • A holistic comparison of fine-tuning methods applied to the pre-trained CLIP ViT-B/32 model. $\texttt{FSC-CLIP}$ achieves superior compositionality scores while maintaining strong multi-modal task performances. For each fine-tuning dataset, the best numbers are $\textbf{bold}$, and the second-best numbers are $\underline{\text{underlined}}$.

    Related Links

    Also, please check out our $\texttt{vl-compo}$ package, which enabled the comprehensive evaluation across diverse tasks in our work. It supports evaluations for a wide range of compositional and multi-modal task benchmarks, integrating various pre-trained and fine-tuned VLMs, and is continuously evolving.

    Overall trends in pre-trained and fine-tuned CLIP models, covering 274 model checkpoints across 12 compositional reasoning tasks and 21 zero-shot classification tasks.

    All the models and benchmarks used for evaluation are integrated into our $\texttt{vl-compo}$ package.

    BibTeX

    If you find our work useful for your research, please cite with the following bibtex:

    @article{oh2024preserving, 
       title={Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality},
       author={Oh, Youngtaek and Cho, Jae Won and Kim, Dong-Jin and Kweon, In So and Kim, Junmo},
       journal={arXiv preprint arXiv:2410.05210},
       year={2024},
    }

    @article{oh2024exploring,
       title={Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition},
       author={Oh, Youngtaek and Ahn, Pyunghwan and Kim, Jinhyung and Song, Gwangmo and Lee, Soonyoung and Kweon, In So and Kim, Junmo},
       journal={arXiv preprint arXiv:2406.09388},
       year={2024},
    }