MariaDB Vector in Laravel: insights on choosing an embedding model

laravel-mariadb-vector is an open-source project by Erik Ros, bringing MariaDB’s native vector search to Laravel’s Eloquent ORM. In his guest post, Erik shares how it works, and his insights about picking an embedding model.

I maintain laravel-mariadb-vector, a small open source package that brings MariaDB’s native vector search to Laravel’s Eloquent ORM. It’s my first open source project, it has over 100 installs, no marketing budget, and it exists because I needed it.

This post is a quick introduction and an experiment with 2,942 job titles in English and Dutch that shows why the embedding model you pick and how you use it matters far more than you might expect.

The gist

Why vector search in MariaDB
How to use the package: migration, model, similarity query
The experiment: how much model choice, prompting, and language impact retrieval quality

Why MariaDB for vectors

I am a LAMP guy (Linux Apache MySQL PHP). Well, back in the day I was, now the stack includes Nginx and MariaDB. I like to build REST backends in Laravel. Last year, I wanted to do some stuff where semantic search/matching might work well: for a recruitment platform I needed to match candidates to job openings, for an AI newsroom, I used it for subject matching and dedup and I am also building Archivus, an open-source air-gapped document archive with RAG search (in development). All applications that would benefit from embeddings stored with the application data.

The ‘conventional’ answer in 2025 was: add a vector database, or switch to Postgres for pgvector. I am a creature of habit, so neither appealed. A separate vector store is another moving part to deploy, back up, and keep consistent with the primary database. And pgvector is good software, but not my stack. Migrating databases to gain one column type seemed excessive.

So I landed on MariaDB 11.8 LTS with vector search: a VECTOR(N) column type, VEC_DISTANCE_COSINE/VEC_DISTANCE_EUCLIDEAN functions, and a modified-HNSW VECTOR INDEX. Vectors land in the database I already run. Rock&Roll.

The only thing missing was Laravel integration. Eloquent didn’t support MariaDB VECTOR. So I built the missing piece.

The package in three snippets

Install:

composer require devilsberg/laravel-mariadb-vector

Migration — a vector column and its index are one line each:

Schema::create('job_titles', function (Blueprint $table) {
    $table->id();
    $table->string('title_en')->nullable();
    $table->string('title_nl')->nullable();
    $table->vector('embedding', 768)->nullable();
    $table->vectorIndex('embedding');
    $table->timestamps();
});

Model + query — VectorCast converts between MariaDB’s binary vector format and plain PHP float arrays, and the query macros generate the VEC_DISTANCE_* SQL:

class JobTitle extends Model
{
    protected function casts(): array
    {
        return ['embedding' => VectorCast::class];
    }
}

$neighbors = JobTitle::query()
    ->selectVectorDistance('embedding', $queryVector, 'distance')
    ->orderByVectorDistance('embedding', $queryVector)
    ->limit(5)
    ->get();

That’s it.

What does it do?

Semantic search is meant to help you find different words with similar meaning, for instance: if you search for a software engineer, you may also want software developers in your search results. Semantic search will help you get there. Working with embedded data was new for me. To get a better understanding of what to expect, I ran an experiment.

At the time I was working on a Dutch / bilingual recruitment platform. So I needed to make sure that Dutch job titles can be searched semantically. For instance, the Dutch word for (software) developer is programmeur. If someone searches programmeur, you want to find someone who calls themselves a software developer.

I built the experiment on public data: the ESCO occupation taxonomy — 2,942 occupations, each with an official English and Dutch title for the same concept. That gives a decent test: embed all titles, then for each Dutch title search the nearest English titles. The matching concept’s English title should rank first. “Verpleegkundige” should find “nurse”.

I started with two locally-run models via Ollama, both popular defaults:

all-MiniLM — 384 dimensions, English-only
EmbeddingGemma 300m — 768 dimensions, multilingual (requires more storage)

Semantic search ranks results by proximity in vector space, scored as a distance (lower = closer) or similarity (higher = closer); the range depends on the metric. I ran the test over the 2,822 pairs with distinct EN/NL labels:

Model	recall@1	recall@5	MRR@10
all-MiniLM (English-only)	14.2%	23.2%	0.181
EmbeddingGemma (multilingual)	52.2%	71.3%	0.602

That surprised me: a 3.7× jump in how often the right title ranks first — recall@1 went from 14% to 52%, just from changing the model.

Specific failures are more instructive than the numbers. Ask all-MiniLM for the nearest English titles to “advocaat” (lawyer) and its top match is “advertising assistant”. It never learned Dutch, so it falls back to surface features, and “adv…” looks similar. “Leraar basisonderwijs” (primary school teacher) returns roofer, weaver, riveter. EmbeddingGemma puts lawyer and primary school teacher at rank 1.

The counterpoint: “loodgieter” (plumber) stumps the multilingual model too: its top-3 is gauger, rigger, verger… (I had to actually look up what those mean), so no perfect score for either. The lesson isn’t “use model X”; it’s evaluate before you commit.

Qualitative examples

The full top-3 makes the pattern obvious. Numbers in parentheses are cosine distance — smaller means closer in meaning:

Dutch query (expected English)	English-only model (all-MiniLM)	Multilingual model (EmbeddingGemma)
advocaat (lawyer)	advertising assistant (0.56), agronomist (0.63), advertising specialist (0.63)	lawyer (0.07), corporate lawyer (0.19), legal consultant (0.20)
leraar basisonderwijs (primary school teacher)	roofer (0.66), weaver (0.68), riveter (0.68)	primary school teacher (0.27), early years teacher (0.29), language school teacher (0.30)
vrachtwagenchauffeur (cargo vehicle driver)	dairy products maker (0.66), roustabout (0.67), confectioner (0.68)	cargo vehicle driver (0.16), moving truck driver (0.20), carriage driver (0.21)
dierenverpleegkundige (veterinary nurse)	conservator (0.66), dramaturge (0.67), subtitler (0.69)	veterinary nurse (0.21), animal physiotherapist (0.22), veterinary technician (0.24)
loodgieter (plumber)	riveter (0.50), clarifier (0.52), ombudsman (0.57)	gauger (0.25), rigger (0.28), verger (0.32) — both miss

The English-only model’s wrong answers cluster on surface spelling (advocaat → advertising) or pure noise (plumber → riveter); the multilingual model not only lands the right title but sits much closer to it (0.07 vs 0.56 for lawyer).

Going deeper: how you use the model matters too

Picking EmbeddingGemma over all-MiniLM was a no-brainer, but the next one caught me off guard: the same model gets meaningfully better when you prompt it correctly. EmbeddingGemma is trained to read a short instruction glued to the front of the text — task: search result | query: … for a search, title: none | text: … for a stored item. I had been embedding raw titles. Adding the prompts lifted recall@1 from 52% to 59.6% — another seven points.

Model	recall@1	recall@5	MRR@10
all-MiniLM (English-only)	14.2%	23.2%	0.181
EmbeddingGemma — raw titles	52.2%	71.3%	0.602
EmbeddingGemma — with prompts	59.6%	77.2%	0.673
bge-m3 (1024-dim, multilingual)	58.5%	76.8%	0.663

No single model wins

That last table row is an interesting one. bge-m3 is bigger and more aggressively multilingual than EmbeddingGemma — on paper the “better” model. It only tied. And when I looked at where each model succeeds, they turned out to have opposite strengths. Take one concept, “bookkeeper”:

In English, EmbeddingGemma nails it: bookkeeper → accountant, accounting assistant, billing clerk. bge-m3 bungled: bookkeeper → bookmaker, bookshop manager, zookeeper — matching letters, not meaning.
In Dutch, they swap. EmbeddingGemma gives boekhouder → uitgever van boeken (book publisher); bge-m3 recovers: boekhouder → hoofd boekhouding, assistent-boekhouder.

The model that’s stronger in English is weaker in Dutch, and vice versa. You can’t read that off a benchmark leaderboard — the only way to know is to test on your data, in your language. Because the embeddings live in MariaDB right next to that data, that test is just a query. To me that’s a real case for vectors in the database: the ease with which we evaluate and iterate.

Where this is going

For Archivus I’ve since combined semantic search with MariaDB’s full-text search: for a RAG search function using Reciprocal Rank Fusion to rank the best results from the two searches. That’s probably worth its own post.

The package is MIT-licensed:

Issues and PRs welcome, especially from anyone running it on datasets larger than mine. Questions are also welcome, you can reach me through my website or on LinkedIn.

The author Erik Ros is the founder of Devilsberg, where he leads engineering teams. He builds: