Personalized AI apps
Build multi-agent systems without code and automate document search, RAG and content generation
Start free trial Question
Contrastive Language–Image Pretraining - How does clip works?
Answer
By training an image encoder and a text encoder simultaneously, CLIP learns a multi-modal embedding space that maximizes the cosine similarity of the image and text embeddings of the actual pairs in the batch and minimizes the cosine similarity of the wrong pairings, where N ranges from 2 to N.