MariaDB Vector

About me

Sergei Golubchik

Chief Architect MariaDB
MariaDB Foundation, board member
MySQL core developer: 1998-2009
MariaDB core developer: 2010-…

What is “Vector Search”

Semantic text search
Image search
Music search
… hybid search, too

RAG — Retrieval Augmented Generation

How?

Convert data texts (images, audio) into vectors
Store vectors in the database
To search you convert the query text (image, audio) into a vector

Use vector search!

What is a “Vector”?

A list of floating point numbers

[0.4187, 0.8099, 0.82319, 0.5982, 0.03326]

Typical length: 20–2000 numbers
Search for the “nearest”
- the search is approximate

What is a “MHNSW”?

Vector search algorithm used in MariaDB
HNSW = Hierarchical Navigable Small World
- Industry standard graph-based approximate nearest neighbor vector search algorithm
MariaDB uses a modified variant with various enhancements

The table

CREATE TABLE embeddings (
        doc_id BIGINT UNSIGNED PRIMARY KEY,
        embedding VECTOR(1536) NOT NULL,
        VECTOR INDEX (embedding) M=8 DISTANCE=COSINE
);

Python

from openai import OpenAI

client = OpenAI()
model = "text-embedding-3-small"

def get_embedding(text):
   return client.embeddings.create(input = [text],
                     model=model).data[0].embedding

Indexing

import mariadb


v = get_embedding(document[i])

cur.execute("INSERT embeddings VALUES (%d, Vec_FromText(%s))",
                    (i, str(v)))

import mariadb
import array

v = get_embedding(document[i])

cur.execute("INSERT embeddings VALUES (%d, %s)",
                    (i, array.array("f", v).tobytes()))

Searching

import mariadb
import array

q = get_embedding(user_question)

cur.execute("""
    SELECT doc_id FROM embeddings
        ORDER BY VEC_DISTANCE(%s, embedding)
        LIMIT 5
""", array.array("f", q).tobytes()))

LlamaIndex

from llama_index.vector_stores.mariadb import MariaDBVector

vector_store = MariaDBVector.from_params(
    host="localhost",
    user="llamaindex",
    password="********",
    database="llamaindexdb",
    table_name="vectorstore",
    embed_dim=1536
)

LangChain

from langchain_mariadb import MariaDBStore, DistanceStrategy

vector_store = MariaDBStore(
    embeddings=OpenAIEmbeddings(),
    embedding_length=1536,
    distance_strategy=DistanceStrategy.COSINE,
    pool=mariadb.ConnectionPool(
        host="localhost",
        user="langchain",
        password="*********",
        database="langchaindb"
    )
)

Tuning is hard

Many (>10) algorithm parameters
Impossible to describe what they do
Complex interaction between parameters

MariaDB solution: hide that and auto-tune!

Server variables

SET @@mhnsw_max_cache_size = 16777216;

SET @@mhnsw_ef_search = 20;

SET @@mhnsw_default_distance = EUCLIDEAN;

SET @@mhnsw_default_m = 6;

mhnsw_max_cache_size

HNSW is in-memory algorithm
otherwise slow

set as large as possible
grows as needed up to the specified limit

DISTANCE (mhnsw_default_distance)

EUCLIDEAN or COSINE
depends on the problem area

CREATE TABLE embeddings (
        doc_id BIGINT UNSIGNED PRIMARY KEY,
        embedding VECTOR(100) NOT NULL,
        VECTOR INDEX (embedding) DISTANCE=COSINE
);

mhnsw_ef_search and M (mhnsw_default_m)

slow and precise ↔ fast and imprecise?
fast write / slow search ↔ slow write / fast search?

mhnsw_ef_search affects only the search
M (mhnsw_default_m) affects both write and search

Currently implemented

VECTOR(N) data type
Text and direct binary input
VECTOR index type using MHNSW algorithm
Two distance functions
Two tuning parameters
Comfortable performance

Currently implemented (more)

WHERE clause with SQL semantics (LIMIT after WHERE)
- post-filtering
- few vector database support it
- pgvector supports it since 0.8.0
16-bit precision
- half the size (of 32-bit float)
- not half-width floats (that have 8–10 bit precision)

Supported by

LlamaIndex — v0.12.11
SpringAI — v1.0.0-M5
Hibernate — 7.0
LangChain — discussions/29203
LangChain4J — pull/2225
LangChainJS — pull/7360

Limitations

Only one vector index per table
Only NOT NULL values
Only InnoDB, MyISAM, and Aria
Partitioning is not supported
XA is not supported

Future plans

Distances (Hamming? L1? Jaccard?)
Vector search algorithms (IVFFlat? DiskANN?)
Observability improvements
Hybrid search (with Resiprocal Rank Fusion)
More performance (pre-filtering, quantization, etc)
Remove limitations

MariaDB Vector

Sergei Golubchik
MariaDB

About me

What is “Vector Search”

How?

What is a “Vector”?

What is a “MHNSW”?

Example

The table

Python

Indexing

Searching

LlamaIndex

LangChain

Tuning

Tuning is hard

Server variables

mhnsw_max_cache_size

DISTANCE (mhnsw_default_distance)

mhnsw_ef_search and M (mhnsw_default_m)

qps vs recall

build time vs recall

Performance comparison

small: 60k vectors, 784 dimensions, euclidean

large: 1M vectors, 960 dimensions, euclidean

hard: 290k vectors, 256 dimensions, cosine

Feature summary

Currently implemented

Currently implemented (more)

Supported by

Limitations

Future plans

Where

MariaDB Vector

Sergei GolubchikMariaDB

About me

What is “Vector Search”

How?

What is a “Vector”?

What is a “MHNSW”?

Example

The table

Python

Indexing

Searching

LlamaIndex

LangChain

Tuning

Tuning is hard

Server variables

mhnsw_max_cache_size

DISTANCE (mhnsw_default_distance)

mhnsw_ef_search and M (mhnsw_default_m)

qps vs recall

build time vs recall

Performance comparison

small: 60k vectors, 784 dimensions, euclidean

large: 1M vectors, 960 dimensions, euclidean

hard: 290k vectors, 256 dimensions, cosine

Feature summary

Currently implemented

Currently implemented (more)

Supported by

Limitations

Future plans

Where

Sergei Golubchik
MariaDB