MariaDB Vector: How it works. Part III

In the previous parts of this series we’ve seen how MariaDB stores vector indexes in a table and how to implement HNSW for a good performance. But MariaDB is not implementing HNSW, it calls its vector search algorithm mHNWS, a modified HNSW. Let’s see how exactly it was modified.

Not so greedy!

HWNS, like many, if not most, graph based vector search algorithms is greedy. Think of it this way, when it needs to find just one nearest vector (ef=1), it will walk the graph always choosing the node that will take it the closest to the target at this particular step.

This picture shows the problem with the greedy search. It starts from the node A and wants to find the nearest node to the query node x. It walks from A to B. Then there are two possible paths and it selects C, because C is closer to x than E. And the search stops at D. Had it gone to E, it would have found F which is much closer to x. But a greedy search always chooses a decision that provides the most profit here and now.

MariaDB Vector’s mHNSW diverges from other graph search algorithms by introducing a leniency factor $\lambda$. Its search is no longer greedy. A leniency of $\lambda = 1.1$, for example, means the search will consider not only the nearest node but all nodes that are within 10% of the nearest one. Let’s see how it affects the search.

Effects of leniency

This chart shows the dramatic effect of leniency. Note that with leniency at just 20% ($\lambda=1.2$) the recall at M=4 is better than the recall at M=32 and zero leniency ($\lambda=1.0$, original HNSW greedy search)! It also shows how much the search speed drops — 20%-lenient search with M=4 is about half the speed of the greedy search with M=32, despite having much lower M.

Click on the image to enable built-in data point tooltips

On the other hand, inserts are much faster with smaller M, as the chart shows — with a lenient search one can build the index up to 10 times faster for the same recall! Comparing both charts one can see that $\lambda=1.1$ seems to be optimal at least for the range of recalls 0.8–0.99. Greedy search is faster for low recall, while higher lenience pays off only at very high recalls (about 0.999 and up).

Specifics of the cosine distance

All the above worked really well for the Euclidean distance, the charts above are for the gist-960-euclidean dataset. The first time I tried cosine I was in for a surprise — leniency didn’t quite work. That is, it worked, but made the search much slower, destroying all benefits of a lenient search.

In retrospect it was not all too difficult to see the problem. Leniency makes the search to consider “almost good but not quite” nodes and include them into the search. But if we use a cosine distance all vectors have essentially the same length and the dataset looks somewhat like

Here we are searching for the blue vector on the right. And unlike the Euclidean distance case, most irrelevant vectors are closely packed together on the other end of the circle. Meaning even a small leniency of 10% will cover a large number of very irrelevant vectors, they will all have to be considered by the search making it slow.

What we want to do here is to keep the search rather greedy when we are far from the target but gradually increase leniency when we get closer. This will capture the best of both approaches — won’t blow up the search with a large number of irrelevant vectors, but leniently consider alternative search paths when we’re close to the target.

MariaDB uses a sigmoid function for this. Any sigmoid would do, so I used a simple one: $ x \over \sqrt{1+x^2} $. It is scaled and shifted to produce 1.0 (no leniency, greedy search) for a farthest vector from the target and the desired leniency when a vector is close. For this to work MariaDB maintains the diameter (D) of the dataset — the largest distance between two vectors ever encountered and the final formula for the leniency becomes $$ 1+{\lambda – 1 \over 2} \left( 1 – {kx \over \sqrt{1+(k^2-1)x^2}} \right) $$ where $ x = {2d \over D} -1 $, the distance (d) between the current vector and a target scaled for the -1..1 range; $k=5$, a slope factor, and $\lambda$ is the leniency. This is the result (on the dbpedia-openai-500k-angular dataset):

Bits and pieces

MariaDB provides a tuning parameter mhnsw_ef_search. HNSW additionally introduced ef_construction, which works similarly but during the index construction time. Its default value in pgvector is 64, the original paper used 100, many HNSW implementations in ann-benchmarks are configured to set it to 200 or even 500. High ef_construction values are important to have a graph that can deliver good recall values in searches. But they drastically increase index construction time. MariaDB has ef_construction hard-coded to a very low value of 10. We can afford it, because a lenient search takes care of improving the recall. As a result, setting ef_construction to 200 only marginally improves recall in MariaDB, while making the index construction three times slower. Lenient search allows MariaDB to keep ef_construction low and the index construction fast without compromising the recall.

To be honest, MariaDB Vector has more deviations from HNSW, they were implemented in the early stages of MariaDB Vector development and provided improvement over the baseline. But when I re-benchmarked them for this blog post, it turned out that they no longer help. They will be removed in MariaDB 13.x.

To be continued

This is the main modification of HNSW in MariaDB. Everything from this and previous parts were in the first MariaDB vector release and is now part of MariaDB 11.8. The fourth blog post will explain how the optimized distance calculations in MariaDB 12.3 work.