MariaDB Vector: How it works. Part II
In the first post of this series, I’ve described how the vector index is stored in a table and how it achieves full transactional behavior and ACID properties compatible with the storage engine of the table the user created. But while the table provides persistent storage of the index, it’s in-memory part that gives it the performance. This is how it works.
Distance calculations
This is the most performance sensitive part of the HNSW. According to various estimates, distance calculations account for 80–90% of search time. And this operation time grows linearly with the vector length. OpenAI creates vectors of 1536 dimensions, it’s a lot. And there’s a trend to have even more, new OpenAI and Gemini models do 3072. Thus, this is the innermost of all loops, and every CPU command removed from it will matter.
Euclidean distance is defined as \( |\mathbf{a}-\mathbf{b}| = \sqrt{\sum (a_i-b_i)^2} \). For the purpose of searching, the actual distances are not important; we only care whether one distance is greater than the other. It means we can loose the expensive square root and compute \( |\mathbf{a}-\mathbf{b}|^2 = \sum (a_i-b_i)^2 \). Something like
// float *v1, *v2;
for (i = 0; i < vec_size; v1++, v2++, i++) {
float d = *v1 - *v2;
dist += d * d;
}
But let’s expand the parentheses. \( |\mathbf{a}-\mathbf{b}|^2 = \sum (a_i-b_i)^2 = \sum (a_i^2+b_i^2-2a_ib_i)=|\mathbf{a}|^2+|\mathbf{b}|^2-2\mathbf{a}\cdot\mathbf{b}\). The great thing about it is that we can precalculate vector absolute values and only calculate the dot product. Like this:
// float *v1, *v2;
for (i = 0; i < vec_size; v1++, v2++, i++)
dot_product += *v1 * *v2;
dist = v1_abs + v2_abs - 2 * dot_product;
which means one operation less in the loop. This is not much, but still noticeable and will become crucially important later.
Incidentally, this is the reason why MariaDB does not offer “inner product” distance that many other databases have. Inner product is not a proper distance (as it does not have a minimum when a vector is compared to itself), and it is only used as a hack because in those databases it is faster than a proper Euclidean or cosine distance. But in MariaDB it is not faster; MariaDB distance calculations already use inner product internally, there is no need to resort to hacks. Unfortunately, we keep getting these requests and even code contributions regularly.
Of course, the actual implementation, unlike the simplified loop above, uses SIMD instructions (for example AVX2 or AVX512 on x86-64, NEON on ARM, VSX on Power9) processing eight or sixteen floats in one iteration of the loop.
Quantization
Vectors store dimensions as 32-bit floats, that’s four bytes per dimension and it’s basically the standard. Pretty much every vector database does that. Can we do better? Fewer bytes per dimension mean less data to read from the table — saving both space and time — less data to store in memory — fitting more vectors in the same amount of memory — and less memory to send from RAM to the CPU for every dimension’s calculations, again saving time. Many vector databases optionally support 16-bit floats. There are two kinds of them: float16 (or fp16 or half-width float) which has a sign bit, 5 exponent bits and 10 mantissa bits, accommodating numbers from ±0.0000610 to ±65504. And bfloat16 with a sign bit, 8 exponent bits and 7 mantissa bits, that can store numbers from ±1.18·10−38 up to ±3.4·1038.
But are they a good fit for a vector search? Note that the vector absolute value depends quadratically on coordinate values. Consider a 2D vector (123, 1). Its absolute value is \( \sqrt{123²+1²} ≈ 123.004 \), it’s almost as if the second coordinate didn’t matter. Certainly supporting a range of values from 10−38 to 1038 and even from 10−5 to 105 makes no sense, when at least one coordinate is that large, the small one will have no effect at all. It’s really just a waste of bits to store the exponent, we’d better store what matters — significant digits in the mantissa. And this is exactly what MariaDB does — it makes one of the coordinates that large, ignores small ones, and uses all bits for the mantissa.
When indexing a vector MariaDB takes the largest (by absolute value) coordinate (\(\max(a_i)\)), introduces a scaling factor, \( \alpha = {32767\over \max(a_i)}\) — and multiplies every coordinate by this scaling factor. The resulting coordinates are stored as a 16-bit signed integer. Not float, there are no exponents here, all 15 bits are used for mantissa. The largest coordinate is always 32767, and very small coordinates become zeros. They won’t matter anyway, 32767 will completely overshadow them. The scale factor is calculated per vector and stored as a 32-bit float in the index alongside each vector.
Now, technically, there is a case when small numbers matter. Remember, we compute dot product, \(\sum a_ib_i\) — here it can happen that \(a_i\) is very large and \(b_i\) is very small. Say, \(a_i\) is 32767 and \(b_i\) is 1.01 before rounding, 1 after rounding, the product is 32767. When \(b_i\) is 0.99 before rounding it becomes 0 and the product is 0, quite a difference. But it still doesn’t matter. Remember, vector search needs to find vectors that are nearest to the query vector. Nearest vectors have similar values in all coordinates, a vector with such a striking difference in one component (large against small) will not be a nearest, will be rejected early and it won’t matter if the distance was calculated with an error.
All this allows to convert 32-bit floats into 16-bit integers, they take half the space, they have 15 bits of precision (vs. 10 in float16). And at this step it becomes very important to use dot product method for calculating distances. The resulting function looks like (simplified)
// int16_t *v1, *v2;
for (i = 0; i < vec_size; v1++, v2++, i++)
dot_product += *v1 * *v2;
dist = v1_abs + v2_abs - 2 * v1_scale * v2_scale * dot_product;
except that SIMD commands can process up to 32 coordinates in a single loop iteration. This function handles both Euclidean and cosine distances, they only differ in the value of the precalculated scale factor.
As a result, we halved the storage and memory requirements and got 30–50% speedup with practically no recall loss. There are no drawbacks and that’s why it’s not an optional feature that a user needs to configure, as in other databases, it’s the only way vector indexes work in MariaDB.
Bits and pieces
There are smaller bits of engineering that still play an important role in not making vector search slower, let’s just mention them in passing
- The graph is completely immutable for reads, many threads can search it in parallel, there are no locks and no communication between different search threads. This is why MariaDB Vector search scales well with the number of threads.
- Within one search, the distance for a pair of vectors is calculated only once, it’s then cached and never recalculated. Remember, distance calculations are expensive.
- A search needs to keep track of vectors it has visited. This is used to decide which neighbors of a current vector need to be analyzed. MariaDB uses a Bloom filter for that which is both very fast and memory-efficient. Bloom filter size is based on how many vectors one search is expected to visit and its estimation formula is adjusted heuristically after every search. Bloom filter is imprecise and we lose some recall there, but more than make up for it with speed.
- Bloom filter also uses SIMD instructions to check for 8 vectors in one call and the vector neighbor list is specifically organized in a way that can be fed directly to the Bloom filter.
To be continued
Phew, that was a long post. Let’s stop here. The third part will talk about modifications to the HNSW algorithm itself. Modifications that explain the letter “m” in mHNSW.