MariaDB Vector
- A project guided by the MariaDB Foundation, built with the MariaDB Server community. Main contributors – MariaDB plc, MariaDB Foundation, Amazon
- Enables fast vector search in a relational database.
- Keep your technology stack simple, no need for specialised datastores.
MariaDB Vector is now in preview!
MariaDB Vector brings vector similarity search in MariaDB. Vector similarity search is a general problem of storing and finding vectors based on a distance function in a database.
This project will introduce specialized syntax in MariaDB as well as a new INDEX type to allow fast searching for vectors.
What’s available in the preview
One can create vector indexes on binary data. We’ll add a VECTOR data type soon after.
The indexing algorithm is a modified version of HNSW.
CREATE TABLE products (
name varchar(128),
description varchar(2000),
embedding BLOB NOT NULL,
VECTOR INDEX (embedding)
ENGINE=InnoDB);
Vector distance functions:
# Euclidean distance
VEC_DISTANCE(embedding, Vec_FromText('[0.1, 0.4, 0.5, 0.3, 0.2]'))
Utility Vector Functions:
VEC_FromText('[...]')
VEC_ToText(<vector-bytes>)
Insert vectors
INSERT INTO products (name, description, embedding)
VALUES ('Coffee Machine',
'Built to make the best coffee you can imagine',
VEC_FromText('[0.3, 0.5, 0.2, 0.1]'))
Vector search
SELECT p.name, p.description
FROM products AS p
ORDER BY VEC_DISTANCE(p.embedding,
VEC_FromText('[0.3, 0,5, 0.1, 0.3]'))
LIMIT 10
The MariaDB optimizer is tuned to leverage the vector index if the SELECT query has an ORDER BY VEC_DISTANCE
clause and a LIMIT
clause.
MariaDB Vector Performance
MariaDB Vector preview implements a modified version of Hierarchical Navigable Small Worlds (HNSW) algorithm. The search performance is comparable with other vector search implementations and surpasses in scalability when multiple connections are used.
A full benchmark set is available at:
https://mariadb.com/resources/blog/how-fast-is-mariadb-vector/
Presentations
Overall introduction
Technical introduction
AI first applications with MariaDB Vector
Commentary at FOSDEM
Overview
For businesses: Use cases
- Recommendation systems:
- Build personalised product recommendations based on user preferences and behaviour.
- User interactions in natural language, not “search queries”.
- Similarity search:
- Implement powerful search functionalities to find similar images, documents, or multimedia content.
- Build your own technical guru with answers from your own documentation.
- Find related products in your store without manually labelling them.
- Machine learning:
- Store and retrieve vector representations of data for machine learning models.
- Easy clustering, get the closest data points, quickly.
For developers: How to use MariaDB Vector
- Get your preferred AI model set up to generate Vector Embeddings (OpenAI, LLama, Claude, Gemini, etc.). An easy one to try if you’re not familiar with any is https://huggingface.co/sentence-transformers.
- Add a vector column to your data table.
ALTER TABLE data ADD COLUMN embedding VECTOR(100);
- Create a specialised vector index.
CREATE VECTOR INDEX vec_index ON data (embedding);
- When adding data, ask your model for a Vector Embedding and store it alongside the document.
INSERT INTO data (document, embedding) VALUES (
'...a document...',
'...a vector with embeddings...');
- For any user prompt, ask your model first for the embedding and use it in the ORDER BY clause:
SELECT * FROM data WHERE document_owner_id=1234
ORDER BY VEC_DISTANCE(embedding, '...embeddings...') LIMIT 10
- This will get you the 10 most similar documents to the user prompt.
For contributors: How the code is being developed
MariaDB Foundation coordinates:
- We set the direction
- We define specifications to match community requirements
- We review contributions on technical merits
- We ensure full CI/CD on buildbot.mariadb.org
- We promote use cases
- We evangelise and create learning tools for MariaDB
The MariaDB Server community develops:
- MariaDB plc codes vector index creation and search
- MariaDB plc codes other central infrastructure
- Numerous big tech industry players contribute developer resources and technical input: Amazon, Alibaba, Google, Microsoft, Automattic, PlanetScale, Acronis, Crayon, Wikimedia Foundation
- Individual community contributors develop alternative search algorithms
- Contributors work together in joint testing and benchmarking
Team
MariaDB Foundation
- Anna Widenius – Project manager
- Vicențiu Ciorbaru – Chief Development Officer MariaDB Foundation
- Vlad Bogolin – Use Case Developer, QA
- Kaj Arnö – MariaDB Foundation CEO
MariaDB plc
- Sergei Golubchik – Server Architect MariaDB Plc
- Michael Widenius – Founder of MySQL and MariaDB
Contributors
- Hugo Wen – Software Developer – Amazon
- Robert Silén – Use Case Evangelist, QA
- Patrick Reynolds – Software Engineer at PlanetScale
Project Technical Description
The current plan (subject to change):
- Utilize
VARBINARY
as a datatype for the first iteration. Other datatypes can be added later. - Introduce a vector distance function in MariaDB:
VEC_DISTANCE(v1, v2)
- Introduce a way to create indexes on vector columns
CREATE [HNSW|IVFFLAT|etc..] INDEX idx_name ON table (vector_column
) - Syntax of the form:
SELECT ... FROM table ORDER BY VEC_DISTANCE(column, constant) LIMIT n
will produce the approximate nearestn
vectors toconstant
, according toVEC_DISTANCE
.
Why is vector similarity search needed?
Vector search is important for any application looking to build any AI features. The base use case is for finding relevant documents based on a free-form user query.
To understand how this works we need to define a few concepts:
Vector embedding
A vector embedding is a way to represent words, phrases, or documents as dense vectors of real numbers. Words or phrases with similar meanings or contexts will have similar vector representations.
Large Language Model (LLM) for Embeddings
Large Language Models (LLMs) are powerful neural network models trained on vast amounts of text data to learn the patterns, structures, and meanings of language and output vector embeddings.
Vector distances
Vector distances are measures used to quantify the similarity or dissimilarity between two vector embeddings. Common used metric distances: cosine similarity, euclidean distance.
A thing to note, one can not mix-and-match embeddings from different models, because the key property of the embeddings generated is their distance in relation to each other. This distance will vary from model to model. Each model has its strengths and weaknesses and you will have to test out different strategies to see which one works best.
User experience
We’ll now assume we have a web store selling various consumer appliances. The simplest process to make use of Vector Embeddings to create an AI powered search feature:
- For each product in the shop’s inventory, concatenate the product name and description and send it to an Embedding LLM to generate a vector embedding. Store the resulting Vector in the same row as the product, as an additional column.
- Take user input as a string. Ex: “An espresso machine with a frother and a mill that can be connected to water mains directly and has a dedicated milk container.”
- Run the user input through the same Embedding LLM to generate a vector embedding.
- Retrieve the top 10 closest matches with an SQL query of the form:
SELECT name, description
FROM products p
ORDER BY VEC_DISTANCE(p.embedding, <embedding-from-user-query>)
LIMIT 10
This project seeks to optimize the performance of the last query, by using approximate nearest neighbour search.