This is great to see. It looks like the size of the embedding vector is half the...

minimaxir · on Oct 26, 2023

The 768D-sized embeddings compared to OpenAI's 1536D embeddings are actually a feature outside of index size.

In my experience, OpenAI's embeddings are overspecified and do very poorly with cosine similarity out of the box as they match syntax more than semantic meaning (which is important as that's the metric for RAG). Ideally you'd want cosine similarity in the range of [-1, 1] on a variety of data but in my experience the results are [0.6, 0.8].

TeMPOraL · on Oct 26, 2023

Unless I'm missing something, it should be possible to map out in advance which dimensions represent syntactic aspects, and then downweigh or remove them for similarity comparisons. And that map should be a function of the model alone, i.e. fully reusable. Are there any efforts to map out the latent space of ada models like that?

karxxm · on Oct 26, 2023

You wrote „out of the box“, did you find a way to improve this?

teaearlgraycold · on Oct 26, 2023

You can do PCA or some other dimensionality reduction technique. That’ll reduce computation and improve signal/noise ratio when comparing vectors.

karxxm · on Oct 26, 2023

Unfortunately this is not feasible with a large amount of words due to the quadratic scaling. But thanks for the response!

minimaxir · on Oct 26, 2023

Not sure what you mean by large amount of words. You can fit a PCA on millions of vectors relatively performantly, then inference from it is just a matmul.

karxxm · on Oct 28, 2023

Not true. You need a distance matrix (for classical PCA it's a covariance matrix), which scales quadratically with the number of points you want to compare. If you have 1 Mio. vectors, each creating a float entry in the matrix, you will end up with approx (10^6)^2 / 2 unique values, which is roughly 2000Gb of memory.