MariaDB Vector: How it works. Part IV

This is the last post in the “MariaDB Vector: How it works” series. The first three were about storage, in-memory representation, HNSW modifications. Everything that was done in MariaDB 11.8. This post talks about new feature in MariaDB 12.3: optimized distance calculation.

As I mentioned earlier, distance calculation is the most time consuming part of the vector search, taking 80–90% of the total search time. Also it is linear on the number of dimension — computing the distance between vectors of 1536 dimensions takes twice as long compared to vectors of 768 dimensions. And the modern trend is to use longer vectors — latest OpenAI and Gemini models produce 3072-dimensional embeddings.

Let’s look at the OpenAI use case. OpenAI generates so-called Matryoshka embeddings. They have this magical feature that long embeddings can be truncated and vector search will still work. It will work faster (because vectors have less dimensions) but the recall will be worse. Typically a user is advised to do shortlisting and reranking: perform a vector search on short truncated vectors, then reorder results based on full long vectors.

But this is a manual process that requires users to think about Matryoshka embeddings and write complex queries to take advantage of them. Ideally the database would be able to make use of them automatically. And this paper inspired a new optimization that allows MariaDB 12.3 to do exactly that.

Very simply, instead of calculating the exact distance here $$|\mathbf{a}-\mathbf{b}|^2 = d_{exact} = \sum_{i=0}^N (a_i-b_i)^2$$ we’ll only do it for the vector prefixes of a small length $k \ll N$ and extrapolate: $$|\mathbf{a}-\mathbf{b}|^2 \approx d_{prefix} = {N \over k}\sum_{i=0}^k (a_i-b_i)^2$$ This relies on the fact that the extrapolated distance is close to the exact distance. And it is very true for Matryoshka embeddings:

This is the ratio $d_{prefix}/d_{exact}$ as a function of the prefix length. Note that starting from about 100-150 dimensions it is almost 1.

But simply using $d_{prefix}$ basically means “shortlisting”, performing a vector search on short truncated vectors. It doesn’t use any benefits of full long vectors. The smart trick in the paper was to conditionally fallback to the full length.

Note that during the vector search we are not interested in distances as such. We want to know whether a particular dataset vector is nearer to the query vector than other dataset vectors we’ve seen so far. That is, is we need to know if the distance is smaller than a certain threshold. So, let’s first extrapolate the distance from some short prefix of both vectors. If it predicts that the exact distance will be larger than the threshold — this dataset vector is too far from the query, reject it. Otherwise we cannot tell and have to compare the exact distance calculated over the full vector length. And this is the result:

It is good, but how do we know whether this optimization is applicable? This optimization is not limited to Matryoshka embeddings^[1] and we cannot really add a new option that means “My vectors have the property that the distance calculated on a prefix can be reasonably extrapolated to the exact distance”. MariaDB has to detect it automatically.

It turns out that a standard deviation of the ratio $d_{prefix} / d_{exact}$ is a very good criterion of whether the distance extrapolation property holds. Thus, for the first 10,000 distance calculations MariaDB does not use the optimization and only collects the statistics, calculating extrapolated and exact distances, but not using the extrapolated distance to return early. If it then finds that the standard deviation of this ratio lies below the threshold, it enables the optimization and starts short-cutting distance calculations. Otherwise the optimization is disabled completely. There is no recall or performance loss on the non-applicable datasets.

This is how MariaDB 12.3 is able to show 10-30% higher QPS at the same recall level compared to 11.8. The performance benefits will only grow when embeddings become longer, as the prefix length is fixed and the prefix distance is calculated in a constant time, independent from the embedding length.

^[1] This optimization does not require Matryoshka embeddings. Another way of creating an applicable dataset is to multiply all its vectors by a random orthogonal matrix — this is like a rotation in a multi-dimensional space and it preserves all distances and angles. Compare mnist dataset before and after this operation:

Note how the $d_{prefix}/d_{exact}$ fluctuates wildly for the original (blue) mnist dataset and stays close to 1 for the randomized (red) one.

Such a multiplication is easy to do in Python, for example:

rom = scipy.stats.ortho_group.rvs(dim = len(dataset[0]))
for i in range(len(dataset)):
    dataset[i] = rom.dot(dataset[i])