LLM2CLIP-EVA02-L-14-336.safetensors
V1

In this article, we introduce LLM2CLIP, an innovative solution that leverages the capabilities of Large Language Models (LLMs) to unlock the full potential of CLIP. By utilizing contrastive learning to fine-tune the LLM in the caption space, our method extracts the LLM’s advanced textual capabilities into output embeddings. This significantly enhances the textual discriminability of CLIP’s output layer, creating a more robust and effective framework.
Our unique training process positions the fine-tuned LLM as a powerful teacher for CLIP’s visual encoder. This novel approach overcomes the limitations of the vanilla CLIP text encoder, enabling the integration of longer and more complex captions. The result is a remarkable improvement in the performance of cross-modal tasks.
Key Achievements of LLM2CLIP:
Boosted the performance of the previous state-of-the-art (SOTA) EVA02 model by 16.5% in both long-text and short-text retrieval tasks.Transformed a CLIP model trained exclusively on English data into a state-of-the-art cross-lingual model.Delivered consistent improvements in benchmarks when integrated into multimodal training with models like Llava 1.5, outperforming CLIP across nearly all evaluation metrics.
Our experiments confirm that LLM2CLIP is a game-changer, bringing comprehensive enhancements to cross-modal and multilingual capabilities. As a converted model, this work highlights the potential of combining LLMs and CLIP to achieve state-of-the-art performance across diverse tasks.
Keywords: LLM2CLIP, CLIP enhancement, contrastive learning, cross-modal tasks, cross-lingual model, multimodal training, SOTA performance.
