Amazon contributes to MariaDB Vector
MariaDB Vector preview was recently released, bringing much awaited Vector Search functionality to MariaDB Server. One of the major open source contributors to MariaDB Vector has been Amazon. To share the excitement and get an inside view about what it’s like to contribute to MariaDB Server, I had a chat with software engineer Hugo Wen on the Amazon RDS team.
Hugo’s contributions to MariaDB Vector
Hugo Wen’s work on vector similarity search in MariaDB and MySQL started when Amazon’s leadership identified Vector Search functionality as a critical addition and decided to invest Amazon RDS team’s time on contributing to MariaDB Vector. The goal was not just to enable an addition to the Amazon RDS offering, but to create a community contribution for anyone to use and improve on.
Why add vectors to a relational database? Hugo pointed out that there are already databases designed specifically for vectors, but they will lead to additional layers in a database architecture and increase overall costs for the service. Supporting vectors inside the relational database will reduce complexity and simplify maintenance, thus making it more cost efficient. With MariaDB Server, users can have data and vectors in two tables of the same database and access them both with one single query.
During the design phase, Hugo looked at different existing implementations. Postgres’ pgvector is a good example that is fully open source. Hugo also looked at MySQL, but its implementation is hidden in the Heatwave backend service, only available for enterprise users. MySQL has since then published a Vectors data type, but it still does not include the most critical aspect of vectors: indexing and search based on distance calculation functions.
Collaboration was kicked off by discussions on design and implementation between the Amazon team and MariaDB developers, including Sergei Golubchik (Chief Architect at MariaDB plc) and Vicentiu Ciorbaru of MariaDB Foundation.
During the development phase, Hugo has submitted several Pull Requests for implementations of the algorithm itself, for the graph storage and for benchmarking.
Let’s start in reverse order.
Benchmarking is important during the whole development process, in order to see how any changes affect the performance of MariaDB Vector. ANN-Benchmarks is a third party tool used for comparing algorithms, and Hugo applied it to MariaDB early on. The ANN-benchmark tool’s results was used in Sergei’s blog How fast is MariaDB Vector that indicates good results for MariaDB – surprisingly good for a first Preview release. Hugo’s Pull Request #3094 is still marked as draft, as Hugo is planning to update the tool for the final release of MariaDB Vector.
The good performance results are thanks to optimization of MariaDB Vectors, which Hugo also has contributed to. The HNSW index was initially saving connections instead of nodes and neighbour lists, which would become slow if a node had lots of neighbours. Hugo suggested to save only the node and a neighbour list, which improved performance significantly: “Comparing with the previous approach, the insert speed is 5 times faster, search speed improves by 23%, and storage usage is reduced by 73%, based on ann-benchmark tests with random-xs-20-euclidean and random-s-100-euclidean datasets.” More performance improvements have been added since then, with Pull Request #3197.
Now that the basic indexing and search functionality is in place for MariaDB Vectors, more features are being considered. For the preview release Hugo also contributed to Vector delete and update (PR #3321) as well as various bug fixes.
The preview release is out now and Hugo is interested in feedback on for example MariaDB Vector’s data type. There has been discussion on whether to keep the current binary data type, or to change it to using a new vector data type.
Hugo comments that people also seem to be very curious about performance metrics for indexing and searching for specific amounts of data and vector dimensions. Results are naturally highly dependent on their instance type and data. An option could be to make the ANN-benchmarking easily available for users to generate metrics themselves.
If you have any feedback on MariaDB Vector, send it to the public MariaDB discuss mailing list at mariadb.com/kb/en/mailing-lists. For private feedback, please email foundation@mariadb.org. If you prefer chat, you can find us at mariadb.zulipchat.com.
Hugo’s Amazon contributions
MariaDB Vectors is not Hugo Wen’s or Amazon’s first contribution to MariaDB. As one of the largest cloud providers, Amazon not only improves on MariaDB for the benefit of Amazon Web Services, but also actively shares back fixes and improvements to anyone running MariaDB.
Amazon’s contributions have not gone unnoticed by the community: “Nice to see Cloud Providers as major contributor at #MariaDB. For me, this is an interesting counter argument to Cloud Providers only taking without giving back to Open Source.” Jean-François Gagné on X.
Some other cases that Hugo has contributed to are
- MDEV-33479, a security related improvement to the MariaDB Authentication plugin – Unix-socket, that is being released with MariaDB 11.6
- MDEV-27342, a bug fix related to using a system snapshot to make sure data is recoverable after a crash. Merged with a minor release of MariaDB 10.6.
- MDEV-31151, a bug fix for crashes that occured after a long time at very heavy work loads after migrating to ARM-service. The fix was related to pinbox allocator.
Hugo commented that he has always gotten a very fast response from Sergei Golubchik (Chief Architect at MariaDB plc) on his contributions. Final time to a merge has varied, but his suggestions have been taken seriously and discussed by MariaDB engineers to be optimal for MariaDB.
Forward and onwards
Hugo is super proud to have contributed to the MariaDB open source ecosystem and looks forward to keeping the collaboration going, in order to deliver more great features and bug fixes.
From us at MariaDB Foundation, thank you for your contributions Hugo and Amazon!