MariaDB Vector

Sergei Golubchik
MariaDB

About me

What is “Vector Search”

How?

What is a “Vector”?

What is a “MHNSW”?

Example

The table

CREATE TABLE embeddings (
        doc_id BIGINT UNSIGNED PRIMARY KEY,
        embedding VECTOR(1536) NOT NULL,
        VECTOR INDEX (embedding) M=8 DISTANCE=COSINE
);

Python

from openai import OpenAI

client = OpenAI()
model = "text-embedding-3-small"

def get_embedding(text):
   return client.embeddings.create(input = [text],
                     model=model).data[0].embedding
  

Indexing

import mariadb


v = get_embedding(document[i])

cur.execute("INSERT embeddings VALUES (%d, Vec_FromText(%s))",
                    (i, str(v)))
  
import mariadb
import array

v = get_embedding(document[i])

cur.execute("INSERT embeddings VALUES (%d, %s)",
                    (i, array.array("f", v).tobytes()))
  

Searching

import mariadb
import array

q = get_embedding(user_question)

cur.execute("""
    SELECT doc_id FROM embeddings
        ORDER BY VEC_DISTANCE(%s, embedding)
        LIMIT 5
""", array.array("f", q).tobytes()))
  

LlamaIndex

from llama_index.vector_stores.mariadb import MariaDBVector

vector_store = MariaDBVector.from_params(
    host="localhost",
    user="llamaindex",
    password="********",
    database="llamaindexdb",
    table_name="vectorstore",
    embed_dim=1536
)

LangChain

from langchain_mariadb import MariaDBStore, DistanceStrategy

vector_store = MariaDBStore(
    embeddings=OpenAIEmbeddings(),
    embedding_length=1536,
    distance_strategy=DistanceStrategy.COSINE,
    pool=mariadb.ConnectionPool(
        host="localhost",
        user="langchain",
        password="*********",
        database="langchaindb"
    )
)

Tuning

Tuning is hard

Server variables

SET @@mhnsw_max_cache_size = 16777216;

SET @@mhnsw_ef_search = 20;

SET @@mhnsw_default_distance = EUCLIDEAN;

SET @@mhnsw_default_m = 6;

mhnsw_max_cache_size

DISTANCE (mhnsw_default_distance)


CREATE TABLE embeddings (
        doc_id BIGINT UNSIGNED PRIMARY KEY,
        embedding VECTOR(100) NOT NULL,
        VECTOR INDEX (embedding) DISTANCE=COSINE
);

mhnsw_ef_search and M (mhnsw_default_m)

qps vs recall

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1000 2000 3000 4000 5000 M=4 ← ef_search=10 20 → 30 → 40 → 800 → M=5 M=6 M=8 M=12

build time vs recall

0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 M=3 M=4 M=5 M=6 M=8 M=12

Performance comparison

small: 60k vectors, 784 dimensions, euclidean

built time vs recallqps vs recall
0.996 0.997 0.998 0.999 1 0 10 20 30 40 50 60 mariadb pgvector qdrant redisearch weaviate 0.96 0.97 0.98 0.99 1 0 2000 4000 6000 8000 10000

large: 1M vectors, 960 dimensions, euclidean

built time vs recallqps vs recall
0.9 0.92 0.94 0.96 0.98 1 0 2000 4000 6000 8000 mariadb pgvector qdrant redisearch weaviate 0.9 0.92 0.94 0.96 0.98 1 0 200 400 600 800 1000

hard: 290k vectors, 256 dimensions, cosine

built time vs recallqps vs recall
0.8 0.85 0.9 0.95 1 0 1000 2000 3000 4000 mariadb pgvector qdrant redisearch weaviate 0.8 0.85 0.9 0.95 1 0 1000 2000 3000 4000 5000 6000

Feature summary

Currently implemented

Currently implemented (more)

Supported by

Limitations

Future plans

Where