MariaDB Vector
Sergei Golubchik
MariaDB
About me
- Chief Architect MariaDB
- MariaDB Foundation, board member
- MySQL core developer: 1998-2009
- MariaDB core developer: 2010-…
What is “Vector Search”
- Semantic text search
- Image search
- Music search
- … hybid search, too
- RAG — Retrieval Augmented Generation
How?
- Convert data texts (images, audio) into vectors
- Store vectors in the database
- To search you convert the query text (image, audio) into a vector
What is a “Vector”?
- A list of floating point numbers
[0.4187, 0.8099, 0.82319, 0.5982, 0.03326]
- Typical length: 20–2000 numbers
- Search for the “nearest”
- the search is approximate
What is a “MHNSW”?
- Vector search algorithm used in MariaDB
- HNSW = Hierarchical Navigable Small World
- Industry standard graph-based approximate nearest neighbor vector search algorithm
- MariaDB uses a modified variant with various enhancements
The table
CREATE TABLE embeddings (
doc_id BIGINT UNSIGNED PRIMARY KEY,
embedding VECTOR(1536) NOT NULL,
VECTOR INDEX (embedding) M=8 DISTANCE=COSINE
);
Python
from openai import OpenAI
client = OpenAI()
model = "text-embedding-3-small"
def get_embedding(text):
return client.embeddings.create(input = [text],
model=model).data[0].embedding
Indexing
import mariadb
v = get_embedding(document[i])
cur.execute("INSERT embeddings VALUES (%d, Vec_FromText(%s))",
(i, str(v)))
import mariadb
import array
v = get_embedding(document[i])
cur.execute("INSERT embeddings VALUES (%d, %s)",
(i, array.array("f", v).tobytes()))
Searching
import mariadb
import array
q = get_embedding(user_question)
cur.execute("""
SELECT doc_id FROM embeddings
ORDER BY VEC_DISTANCE(%s, embedding)
LIMIT 5
""", array.array("f", q).tobytes()))
LlamaIndex
from llama_index.vector_stores.mariadb import MariaDBVector
vector_store = MariaDBVector.from_params(
host="localhost",
user="llamaindex",
password="********",
database="llamaindexdb",
table_name="vectorstore",
embed_dim=1536
)
LangChain
from langchain_mariadb import MariaDBStore, DistanceStrategy
vector_store = MariaDBStore(
embeddings=OpenAIEmbeddings(),
embedding_length=1536,
distance_strategy=DistanceStrategy.COSINE,
pool=mariadb.ConnectionPool(
host="localhost",
user="langchain",
password="*********",
database="langchaindb"
)
)
Tuning is hard
- Many (>10) algorithm parameters
- Impossible to describe what they do
- Complex interaction between parameters
- MariaDB solution: hide that and auto-tune!
Server variables
SET @@mhnsw_max_cache_size = 16777216;
SET @@mhnsw_ef_search = 20;
SET @@mhnsw_default_distance = EUCLIDEAN;
SET @@mhnsw_default_m = 6;
mhnsw_max_cache_size
- HNSW is in-memory algorithm
- otherwise slow
- set as large as possible
- grows as needed up to the specified limit
DISTANCE (mhnsw_default_distance)
- EUCLIDEAN or COSINE
- depends on the problem area
CREATE TABLE embeddings (
doc_id BIGINT UNSIGNED PRIMARY KEY,
embedding VECTOR(100) NOT NULL,
VECTOR INDEX (embedding) DISTANCE=COSINE
);
mhnsw_ef_search and M (mhnsw_default_m)
- slow and precise ↔ fast and imprecise?
- fast write / slow search ↔ slow write / fast search?
- mhnsw_ef_search affects only the search
- M (mhnsw_default_m) affects both write and search
qps vs recall
|
build time vs recall
|
small: 60k vectors, 784 dimensions, euclidean
built time vs recall | | qps vs recall |
|
|
|
large: 1M vectors, 960 dimensions, euclidean
built time vs recall | | qps vs recall |
|
|
|
hard: 290k vectors, 256 dimensions, cosine
built time vs recall | | qps vs recall |
|
|
|
Currently implemented
- VECTOR(N) data type
- Text and direct binary input
- VECTOR index type using MHNSW algorithm
- Two distance functions
- Two tuning parameters
- Comfortable performance
Currently implemented (more)
- WHERE clause with SQL semantics (LIMIT after WHERE)
- post-filtering
- few vector database support it
- pgvector supports it since 0.8.0
- 16-bit precision
- half the size (of 32-bit float)
- not half-width floats (that have 8–10 bit precision)
Supported by
- LlamaIndex — v0.12.11
- SpringAI — v1.0.0-M5
- Hibernate — 7.0
- LangChain — discussions/29203
- LangChain4J — pull/2225
- LangChainJS — pull/7360
Limitations
- Only one vector index per table
- Only NOT NULL values
- Only InnoDB, MyISAM, and Aria
- Partitioning is not supported
- XA is not supported
Future plans
- Distances (Hamming? L1? Jaccard?)
- Vector search algorithms (IVFFlat? DiskANN?)
- Observability improvements
- Hybrid search (with Resiprocal Rank Fusion)
- More performance (pre-filtering, quantization, etc)
- Remove limitations
Where
- MariaDB Server 11.7
- MariaDB Enterpise Server 11.4