Reader

The What and Why of Text-Image Modality Gap in CLIP Models

2024-08-26 15:56:36 +0200 +0200 | Jina AI | Default

You can't just use a CLIP model to retrieve text and images and sort the results by score. Why? Because of the modality gap. What is it, and where does it come from?