This is done with two models in most standard biencoder approaches. This is how ...

This is done with two models in most standard biencoder approaches. This is how multimodal embedding search works. We want to train a model such that the location of the text embeddings that represent an item and the image embeddings for that item are colocated.