Checkout
Personalized AI apps
Build multi-agent systems without code and automate document search, RAG and content generation
Start free trial
Question

Contrastive Language–Image Pretraining - How does clip works?

Answer

By training an image encoder and a text encoder simultaneously, CLIP learns a multi-modal embedding space that maximizes the cosine similarity of the image and text embeddings of the actual pairs in the batch and minimizes the cosine similarity of the wrong pairings, where N ranges from 2 to N.