Youtube Semantic Search is the winner of MariaDB AI RAG Hackathon innovation track

Dymtro Abramov - MariaDB AI RAG Hackathon innovation track winner

Last week we announced the winners of  the MariaDB AI RAG hackathon that we organized together with the Helsinki Python meetup group. Now is time to go for a deep dive into the innovation track winner. Dymtro Abramov put together an end to end RAG application for making technical meetup videos searchable with semantic relevance (based on their caption texts). We were impressed by the idea and the implementation that is definitely beyond just a proof of concept for AI RAG with MariaDB Vector.

Dmytro, tell us about yourself and why you decided to join the Hackathon?

My professional background is mainly in backend and platform engineering, but my master’s degree was actually in Data Science. Back then, we experimented with early text-generation models like LSTM and GRU, before modern LLMs became popular. Those experiments were fascinating, and I wanted to reconnect with that field.

With recent advancements in machine learning and AI making this area even more exciting, joining this hackathon felt like the perfect opportunity. Also, since I already had some experience with MariaDB and MySQL, it was easier for me to focus on creating something innovative instead of spending too much time learning entirely new tools.

Tell us about the use case, how did you come up with it? 

The idea I implemented is about making videos searchable using semantic search. 

The motivation came from how hard it is to get into events like the Python Helsinki meetup. I often end up on the waiting list and miss out. At the same time, there’s a ton of new content from meetups and conferences—both local and international—but not enough time to watch everything.

So I wanted a way to quickly search through video recordings from such events, even across different countries.

Take us through your demo, please.

There are two phases: preparation and running. 

The preparation phase includes data fetching and processing: 

The tool starts by fetching captions from YouTube videos – where available. 

Captions on YouTube are typically rough: no punctuation, arbitrary time-based chunks, and little structure. I used a punctuation restoration model to reintroduce natural sentence boundaries, and semantic-aware chunking to combine related ideas into coherent passages.

The processed text is embedded using an open-source model (everything runs locally), and each chunk is stored in a MariaDB database, along with:

  • The vector representation
  • Original text
  • Metadata (e.g., video ID, timestamp, title)
  • A direct link to the relevant segment in the video

How does the user then use the tool? 

I made two interfaces: 

  • A command-line tool for quick processing and searching
  • A lightweight web frontend (built in minutes with Cursor AI on top of already implemented business logicI)

Users can search for phrases like “dependency injection in Python” or “Temporal” (a workflow orchestration tool), and the system returns the most relevant video segments—even if those keywords aren’t in the video title or description.

Technically what happens is, 

  • The app runs the same embedding model on the user’s query.
  • Finds the nearest neighbors in vector space.
  • Returns links to the most relevant segments in the YouTube videos.

How does your app interact with MariaDB?

The app uses MariaDB as a vector store. After extracting and chunking captions from YouTube videos, it generates semantic embeddings for each chunk and stores them in MariaDB using the new VECTOR data type. Each entry includes metadata like video ID, timestamps, and the original text. 

When a user submits a search query, it’s also embedded and matched against the stored vectors using Euclidean distance to find the most relevant segments. It’s possible to specify and configure a number of nearest neighbours to return.

The app uses raw SQL queries for now, but I structured the storage logic to be modular for easy migration to an ORM later if needed. Both MariaDB and the search app itself run in Docker containers.

You embed locally – what model did you use? How long does it take to prepare an hour worth of videos? 

I used the open source models from Hugging Face and ran them fully locally on my MacBook Pro.

The embedding model I used was all-MiniLM-L6-v2 for generating the embeddings locally. Preparing an hour of video (from captions) typically took up to 30 seconds, which includes fetching captions, restoring punctuation using the model oliverguhr/fullstop-punctuation-multilang-large, splitting captions into sentences, chunking by tokens, generating embeddings, and inserting them into the MariaDB Vector store. But I designed the code in a way that it is very easy to plug-in any other models as well.

How does it fetch captions?

It uses a reverse-engineered YouTube captions AP using youtube-transcript-api Python library. It’s not scalable at large volumes—you risk being banned by YouTube—but it’s fine for demos and small-scale use.

Does it capture screen text as well, or only captions?

Currently, it only uses captions. But yes, ideally we could add screen text or slides later to improve the accuracy and add even more context and better chunking of data into segments.

How do I select Youtube videos? Can I give a whole playlist as input? 

Currently, videos are selected manually by specifying individual YouTube video IDs or using a predefined list (JSON file). The application doesn’t directly support importing an entire YouTube playlist yet, but that’s a feature I’d definitely consider adding next, since it would make populating the database even more convenient. In particular, getting the data for a specific playlist should be supported by the official Youtube API.

Cool! Where can I try it out? 

The repo resides at https://github.com/abramovd/yt-semantic-search, and a demo video is available here (6:19).