MariaDB Vector in Laravel: insights on choosing an embedding model
laravel-mariadb-vector is an open-source project by Erik Ros, bringing MariaDB’s native vector search to Laravel’s Eloquent ORM. In his guest post, Erik shares how it works, and his insights about picking an embedding model.
I maintain laravel-mariadb-vector, a small open source package that brings MariaDB’s native vector search to Laravel’s Eloquent ORM. It’s my first open source project, it has over 100 installs, no marketing budget, and it exists because I needed it.
This post is a quick introduction and an experiment with 2,942 job titles in English and Dutch that shows why the embedding model you pick and how you use it matters far more than you might expect.
The gist
- Why vector search in MariaDB
- How to use the package: migration, model, similarity query
- The experiment: how much model choice, prompting, and language impact retrieval quality
Why MariaDB for vectors
I am a LAMP guy (Linux Apache MySQL PHP). Well, back in the day I was, now the stack includes Nginx and MariaDB. I like to build REST backends in Laravel. Last year, I wanted to do some stuff where semantic search/matching might work well: for a recruitment platform I needed to match candidates to job openings, for an AI newsroom, I used it for subject matching and dedup and I am also building Archivus, an open-source air-gapped document archive with RAG search (in development). All applications that would benefit from embeddings stored with the application data.
The ‘conventional’ answer in 2025 was: add a vector database, or switch to Postgres for pgvector. I am a creature of habit, so neither appealed. A separate vector store is another moving part to deploy, back up, and keep consistent with the primary database. And pgvector is good software, but not my stack. Migrating databases to gain one column type seemed excessive.
So I landed on MariaDB 11.8 LTS with vector search: a VECTOR(N) column type, VEC_DISTANCE_COSINE/VEC_DISTANCE_EUCLIDEAN functions, and a modified-HNSW VECTOR INDEX. Vectors land in the database I already run. Rock&Roll.
The only thing missing was Laravel integration. Eloquent didn’t support MariaDB VECTOR. So I built the missing piece.
The package in three snippets
Install:
composer require devilsberg/laravel-mariadb-vector
Migration — a vector column and its index are one line each:
Schema::create('job_titles', function (Blueprint $table) {
$table->id();
$table->string('title_en')->nullable();
$table->string('title_nl')->nullable();
$table->vector('embedding', 768)->nullable();
$table->vectorIndex('embedding');
$table->timestamps();
});
Model + query — VectorCast converts between MariaDB’s binary vector format and plain PHP float arrays, and the query macros generate the VEC_DISTANCE_* SQL:
class JobTitle extends Model
{
protected function casts(): array
{
return ['embedding' => VectorCast::class];
}
}
$neighbors = JobTitle::query()
->selectVectorDistance('embedding', $queryVector, 'distance')
->orderByVectorDistance('embedding', $queryVector)
->limit(5)
->get();
That’s it.
What does it do?
Semantic search is meant to help you find different words with similar meaning, for instance: if you search for a software engineer, you may also want software developers in your search results. Semantic search will help you get there. Working with embedded data was new for me. To get a better understanding of what to expect, I ran an experiment.
At the time I was working on a Dutch / bilingual recruitment platform. So I needed to make sure that Dutch job titles can be searched semantically. For instance, the Dutch word for (software) developer is programmeur. If someone searches programmeur, you want to find someone who calls themselves a software developer.
I built the experiment on public data: the ESCO occupation taxonomy — 2,942 occupations, each with an official English and Dutch title for the same concept. That gives a decent test: embed all titles, then for each Dutch title search the nearest English titles. The matching concept’s English title should rank first. “Verpleegkundige” should find “nurse”.
I started with two locally-run models via Ollama, both popular defaults:
- all-MiniLM — 384 dimensions, English-only
- EmbeddingGemma 300m — 768 dimensions, multilingual (requires more storage)
Semantic search ranks results by proximity in vector space, scored as a distance (lower = closer) or similarity (higher = closer); the range depends on the metric. I ran the test over the 2,822 pairs with distinct EN/NL labels:
| Model | recall@1 | recall@5 | MRR@10 |
|---|---|---|---|
| all-MiniLM (English-only) | 14.2% | 23.2% | 0.181 |
| EmbeddingGemma (multilingual) | 52.2% | 71.3% | 0.602 |
That surprised me: a 3.7× jump in how often the right title ranks first — recall@1 went from 14% to 52%, just from changing the model.
Specific failures are more instructive than the numbers. Ask all-MiniLM for the nearest English titles to “advocaat” (lawyer) and its top match is “advertising assistant”. It never learned Dutch, so it falls back to surface features, and “adv…” looks similar. “Leraar basisonderwijs” (primary school teacher) returns roofer, weaver, riveter. EmbeddingGemma puts lawyer and primary school teacher at rank 1.
The counterpoint: “loodgieter” (plumber) stumps the multilingual model too: its top-3 is gauger, rigger, verger… (I had to actually look up what those mean), so no perfect score for either. The lesson isn’t “use model X”; it’s evaluate before you commit.
Qualitative examples
The full top-3 makes the pattern obvious. Numbers in parentheses are cosine distance — smaller means closer in meaning:
| Dutch query (expected English) | English-only model (all-MiniLM) | Multilingual model (EmbeddingGemma) |
|---|---|---|
| advocaat (lawyer) | advertising assistant (0.56), agronomist (0.63), advertising specialist (0.63) | lawyer (0.07), corporate lawyer (0.19), legal consultant (0.20) |
| leraar basisonderwijs (primary school teacher) | roofer (0.66), weaver (0.68), riveter (0.68) | primary school teacher (0.27), early years teacher (0.29), language school teacher (0.30) |
| vrachtwagenchauffeur (cargo vehicle driver) | dairy products maker (0.66), roustabout (0.67), confectioner (0.68) | cargo vehicle driver (0.16), moving truck driver (0.20), carriage driver (0.21) |
| dierenverpleegkundige (veterinary nurse) | conservator (0.66), dramaturge (0.67), subtitler (0.69) | veterinary nurse (0.21), animal physiotherapist (0.22), veterinary technician (0.24) |
| loodgieter (plumber) | riveter (0.50), clarifier (0.52), ombudsman (0.57) | gauger (0.25), rigger (0.28), verger (0.32) — both miss |
The English-only model’s wrong answers cluster on surface spelling (advocaat → advertising) or pure noise (plumber → riveter); the multilingual model not only lands the right title but sits much closer to it (0.07 vs 0.56 for lawyer).
Going deeper: how you use the model matters too
Picking EmbeddingGemma over all-MiniLM was a no-brainer, but the next one caught me off guard: the same model gets meaningfully better when you prompt it correctly. EmbeddingGemma is trained to read a short instruction glued to the front of the text — task: search result | query: … for a search, title: none | text: … for a stored item. I had been embedding raw titles. Adding the prompts lifted recall@1 from 52% to 59.6% — another seven points.
| Model | recall@1 | recall@5 | MRR@10 |
|---|---|---|---|
| all-MiniLM (English-only) | 14.2% | 23.2% | 0.181 |
| EmbeddingGemma — raw titles | 52.2% | 71.3% | 0.602 |
| EmbeddingGemma — with prompts | 59.6% | 77.2% | 0.673 |
| bge-m3 (1024-dim, multilingual) | 58.5% | 76.8% | 0.663 |
No single model wins
That last table row is an interesting one. bge-m3 is bigger and more aggressively multilingual than EmbeddingGemma — on paper the “better” model. It only tied. And when I looked at where each model succeeds, they turned out to have opposite strengths. Take one concept, “bookkeeper”:
- In English, EmbeddingGemma nails it: bookkeeper → accountant, accounting assistant, billing clerk. bge-m3 bungled: bookkeeper → bookmaker, bookshop manager, zookeeper — matching letters, not meaning.
- In Dutch, they swap. EmbeddingGemma gives boekhouder → uitgever van boeken (book publisher); bge-m3 recovers: boekhouder → hoofd boekhouding, assistent-boekhouder.
The model that’s stronger in English is weaker in Dutch, and vice versa. You can’t read that off a benchmark leaderboard — the only way to know is to test on your data, in your language. Because the embeddings live in MariaDB right next to that data, that test is just a query. To me that’s a real case for vectors in the database: the ease with which we evaluate and iterate.
Where this is going
For Archivus I’ve since combined semantic search with MariaDB’s full-text search: for a RAG search function using Reciprocal Rank Fusion to rank the best results from the two searches. That’s probably worth its own post.
The package is MIT-licensed:
- packagist.org/packages/devilsberg/laravel-mariadb-vector
- github.com/erik-ros-devilsberg/laravel-mariadb-vector
Issues and PRs welcome, especially from anyone running it on datasets larger than mine. Questions are also welcome, you can reach me through my website or on LinkedIn.
The author Erik Ros is the founder of Devilsberg, where he leads engineering teams. He builds:
- 1m.news, a minimalist indy news digest (private beta).
- laravel-mariadb-vector is his first open source package.