Checkout
Start free trial
Take Naologic for a spin today, no credit card needed and no obligations.
Start free trial
Question

Contrastive Language–Image Pretraining - How does clip works?

Answer

By training an image encoder and a text encoder simultaneously, CLIP learns a multi-modal embedding space that maximizes the cosine similarity of the image and text embeddings of the actual pairs in the batch and minimizes the cosine similarity of the wrong pairings, where N ranges from 2 to N.