MariaDB Vector

  • A project guided by the MariaDB Foundation, built with the MariaDB Server community. Main contributors – MariaDB Corporation, MariaDB Foundation, Amazon
  • Enables fast vector search in a relational database.
  • Keep your technology stack simple, no need for specialised datastores.

MariaDB Vector is now in preview, soon available in 11.7 RC!

MariaDB Vector brings vector similarity search in MariaDB. Vector similarity search is a general problem of storing and finding vectors based on a distance function in a database.

This project will introduce specialized syntax in MariaDB as well as a new INDEX type to allow fast searching for vectors.

What’s available in the upcoming RC

One can create vector indexes on binary data. We’ll add a VECTOR data type soon after.
The indexing algorithm is a modified version of HNSW.

CREATE TABLE products (
    name varchar(128),
    description varchar(2000),
    embedding VECTOR(1500) NOT NULL,
    VECTOR INDEX (embedding) M=8 DISTANCE=euclidean
ENGINE=InnoDB);

Vector distance functions:

# Euclidean distance
VEC_DISTANCE(embedding, Vec_FromText('[0.1, 0.4, 0.5, 0.3, 0.2]'))

Utility vector functions:

VEC_FromText('[...]')  # JSON array of floats.
VEC_ToText(<vector-bytes>)

Insert vectors:

INSERT INTO products (name, description, embedding)
VALUES ('Coffee Machine',
        'Built to make the best coffee you can imagine',
        VEC_FromText('[0.3, 0.5, 0.2, 0.1]'))

Vector search:

SELECT p.name, p.description
FROM products AS p
ORDER BY VEC_DISTANCE(p.embedding,
                      VEC_FromText('[0.3, 0,5, 0.1, 0.3]'))
LIMIT 10

The MariaDB optimizer is tuned to leverage the vector index if the SELECT query has an ORDER BY VEC_DISTANCE clause and a LIMIT clause.

MariaDB Vector Performance

MariaDB Vector preview implements a modified version of Hierarchical Navigable Small Worlds (HNSW) algorithm. The search performance is comparable with other vector search implementations and surpasses in scalability when multiple connections are used.

Single threaded query performance (up and to the right is better)
Multi threaded query performance (up and to the right is better)

A full benchmark set is available at:
https://mariadb.com/resources/blog/how-fast-is-mariadb-vector/


Presentations

Overall introduction

Technical introduction

AI first applications with MariaDB Vector

Commentary at FOSDEM

Overview

For businesses: Use cases

  • Recommendation systems:
    • Build personalised product recommendations based on user preferences and behaviour.
    • User interactions in natural language, not “search queries”.
  • Similarity search:
    • Implement powerful search functionalities to find similar images, documents, or multimedia content.
    • Build your own technical guru with answers from your own documentation.
    • Find related products in your store without manually labelling them.
  • Machine learning:
    • Store and retrieve vector representations of data for machine learning models.
    • Easy clustering, get the closest data points, quickly.

For developers: How to use MariaDB Vector

  • Get your preferred AI model set up to generate Vector Embeddings (OpenAI, LLama, Claude, Gemini, etc.). An easy one to try if you’re not familiar with any is https://huggingface.co/sentence-transformers.
  • Add a vector column to your data table.
ALTER TABLE data ADD COLUMN embedding VECTOR(100);
  • Create a specialised vector index.
CREATE VECTOR INDEX vec_index ON data (embedding);
  • When adding data, ask your model for a Vector Embedding and store it alongside the document.
INSERT INTO data (document, embedding) VALUES (
  '...a document...',
  '...a vector with embeddings...');
  • For any user prompt, ask your model first for the embedding and use it in the ORDER BY clause:
SELECT * FROM data WHERE document_owner_id=1234
ORDER BY VEC_DISTANCE(embedding, '...embeddings...') LIMIT 10
  • This will get you the 10 most similar documents to the user prompt.

For contributors: How the code is being developed

MariaDB Foundation coordinates:

  • We set the direction
  • We define specifications to match community requirements
  • We review contributions on technical merits
  • We ensure full CI/CD on buildbot.mariadb.org
  • We promote use cases
  • We evangelise and create learning tools for MariaDB

The MariaDB Server community develops:

  • MariaDB plc codes vector index creation and search
  • MariaDB plc codes other central infrastructure
  • Numerous big tech industry players contribute developer resources and technical input: Amazon, Alibaba, Google, Microsoft, Automattic, PlanetScale, Acronis, Crayon, Wikimedia Foundation
  • Individual community contributors develop alternative search algorithms
  • Contributors work together in joint testing and benchmarking

Team

MariaDB Foundation

MariaDB Corporation

Contributors

Project Technical Description

The current plan:

  • Provide a VECTOR datatype.
  • Introduce a vector distance functions in MariaDB: VEC_DISTANCE(v1, v2) VEC_DISTANCE_COSINE(v1, v2)
  • Introduce a way to create indexes on vector columns CREATE VECTOR INDEX idx_name ON table (vector_column)
  • Syntax of the form:
    SELECT ... FROM table ORDER BY VEC_DISTANCE(column, constant) LIMIT n will produce the approximate nearest n vectors to constant, according to VEC_DISTANCE.

Why is vector similarity search needed?

Vector search is important for any application looking to build any AI features. The base use case is for finding relevant documents based on a free-form user query.
To understand how this works we need to define a few concepts:

Vector embedding
A vector embedding is a way to represent words, phrases, or documents as dense vectors of real numbers. Words or phrases with similar meanings or contexts will have similar vector representations.

Large Language Model (LLM) for Embeddings
Large Language Models (LLMs) are powerful neural network models trained on vast amounts of text data to learn the patterns, structures, and meanings of language and output vector embeddings.

Vector distances
Vector distances are measures used to quantify the similarity or dissimilarity between two vector embeddings. Common used metric distances: cosine similarity, euclidean distance.

A thing to note, one can not mix-and-match embeddings from different models, because the key property of the embeddings generated is their distance in relation to each other. This distance will vary from model to model. Each model has its strengths and weaknesses and you will have to test out different strategies to see which one works best.

User experience

We’ll now assume we have a web store selling various consumer appliances. The simplest process to make use of Vector Embeddings to create an AI powered search feature:

  • For each product in the shop’s inventory, concatenate the product name and description and send it to an Embedding LLM to generate a vector embedding. Store the resulting Vector in the same row as the product, as an additional column.
  • Take user input as a string. Ex: “An espresso machine with a frother and a mill that can be connected to water mains directly and has a dedicated milk container.”
  • Run the user input through the same Embedding LLM to generate a vector embedding.
  • Retrieve the top 10 closest matches with an SQL query of the form:
SELECT name, description
FROM products p
ORDER BY VEC_DISTANCE(p.embedding,
                      Vec_FromString(<embedding-from-user-query>))
LIMIT 10

This project seeks to optimize the performance of the last query, by using approximate nearest neighbour search.