A GSoC internship with the MariaDB Foundation

Introduction

Hi, my name is Kartik Soneji, and I am a second year student at Thadomal Shahani Engineering College, Mumbai. I have been programming since the age of 13. I started out with Java, then learnt a little C++ before diving head first into web development with HTML, CSS and JavaScript. I also programmed a bit in Python and Rust to see what all the hype was about.

My primary motivation to contribute to the open source community is because I strongly believe in the idea that software is free, to copy, modify and study.

This year, I had the incredible opportunity to work with the MariaDB foundation as a GSoC intern, and this blog post is a summary of my journey.

Description of the project

The problem

Issue: MDEV-12933 Sort out the compression library chaos
As MariaDB is getting more storage engines and as they’re getting more features, MariaDB can optionally use more and more compression libraries for various purposes.

InnoDB, TokuDB, RocksDB – they all can use different sets of compression libraries. Compiling them all in would result in a lot of run-time/rpm/deb dependencies, most of which will be never used by most of the users. Not compiling them in, would result in requests from users to compile them in. While most users don’t need all these libraries, many users use some of these libraries.

Loading the compression libraries dynamically at runtime will allow the server to support any combination of libraries as per their availability.
This will also allow the server to implement support for more compression libraries without forcing them as hard dependencies.

What possible solutions we thought of

We considered several alternatives, neither one of them is easy to implement and each one has their own advantages and shortcomings.

Using Squash
Squash provides a single API to access many compression libraries, allowing applications a great deal of flexibility in choosing compression algorithms, including the option to pass that choice along to the user.
- Pros
  - Gain access to ~30 libraries at once.
  - The API is already built.
  - This is a more stable, long-term solution than services.
- Cons
  - Since Squash is an abstraction library, it doesn’t provide access to library-specific functions like:
    - LZ4_loadDict
    - LZ4_loadDictHC
    - ZSTD_compress_usingDict
    - ZSTD_compress_usingCDict
  - Adds another external dependency.
  - Need to propagate changes into 3rd party code.
Write our own Squash-like API.
- Pros
  - Doesn’t add another external dependency.
  - We can tailor the API to our needs.
  - This is also a more stable, easier to maintain long-term solution.
    It is less likely to break and easier to fix if one of the (many) components change.
- Cons
  - Not a trivial task.
  - Still need to propagate changes into 3rd party code.
Implement the dynamic loading as MariaDB services.
- Pros
  - Might not need to modify 3rd party code.
  - The server handles version-checking and API/ABI mismatch.
  - Faster to implement.
- Cons
  - Might have unintended consequences.
  - Changes in the storage engines or libraries might break the implementation.
  - Thus, this might turn into a high-maintenance project for some engines if they update the API frequently (like RocksDB).
    Although Connect, InnoDB and Mroonga seem to have quite a stable API.
  - There is still a chance that we might need to change 3rd party code, but there are high chances that we might not need to.

The solution we picked

We decided to go with the services approach, for the reasons outlined in issue MDEV-22895:

Using Squash requires changing code in all the storage engines using compression libraries, including 3rd party storage engines and this is problematic because it is very unlikely we will be able to propagate these changes upstream.
In the unified API it provides, Squash does not support all the bits and parts our current storage engines use from compression libraries. Tweaking the storage engines code to get rid of API calls that Squash doesn’t support is too complicated and amplifies the problem stated at point 1.

Project internals

Service Architecture

Overview

MariaDB services are a generic mechanism for making server code available to MariaDB plugins.

Our plan was to use them as a proxy to provide compression APIs to plugins, if and only if the corresponding compression libraries are installed on the host and the MariaDB server can find them during startup.

The overall idea is that compression services use #defines to replace library functions with calls to a function pointer.

That is, if the original function call from a storage engine is

FOO_compress_buffer(src, srcLen, dst, dstLen);

then, the service turns it into

compression_service_foo->FOO_compress_buffer(src, srcLen, dst, dstLen);

with the help of preprocessor defines like this

#define FOO_compress_buffer(...) compression_service_foo->FOO_compress_buffer_ptr(__VA_ARGS__)

And in theory, this way we should be able to “trick” the storage engine plugins into calling our proxy function pointers (defined within a compression service) without needing to alter their source code directly.

Each library is handled by a separate compression service.
The “fake” library headers are in include/compression/<head name>.h, and the loading code and dummy functions are in sql/compression/<library>.h.

All the function pointers are put into one struct per compression service.

struct compression_service_foo_st{
    /* full type of function pointer */ FOO_compress_buffer_ptr;
    //more functions
}

The functions pointers call the real functions (from the .so file) when the library is loaded, but the service has nowhere to forward the calls if the library is not present on the user’s system.
For that case, the pointers are initialized to dummy functions that return an error code, which is handled by the calling storage engine.

compression_service_foo->FOO_compress_buffer_ptr = DUMMY_FOO_compress_buffer;

The dummy function is defined to return an error code.

foo_ret DUMMY_FOO_compress_buffer(const char *src, int srcLen, char *dst, int *dstLen){
    return FOO_INTERNAL_ERROR;
}

This project also adds a new switch, --use-compression, to the server, which allows libraries to be specified on server startup.
For example, --use-compression=lzma,lzo will only load the LZMA and LZO libraries, even if others are present.
The default behaviour is to load as many libraries as possible.

The server tries to load the specified libraries using dlopen.

void *library_handle = dlopen("libfoo.so");

void *FOO_compress_buffer_ptr = dlsym(library_handle, "FOO_compress_buffer");

Finally, if the server is able to open the library and resolve all symbols successfully, then it replaces the dummy functions with the resolved ones.

compression_service_foo->FOO_compress_buffer_ptr = (/* typecast */) FOO_compress_buffer_ptr;

For a more through explanation and some sample code: Create a new compression service

The biggest challenges I faced during coding period

Most of the project went smoothly, in large part due to the excellent roadmap laid out by Robert and Sergei Golubchik.
Even then, some things didn’t go as planned.

I remember looking at the massive codebase (that is under development since 1995) and wondering how I would ever get comfortable with it. It was the first time I had to work with so much code in one place.
Compile errors was not fun, but Linker errors were by far the worst kind of issues to debug.
A lot of it was guess-and-check. Having no experience with the linker beyond basic theory also didn’t help.
Parts of the project have improper CMake usage, which needed workarounds that caused some delays.
Specifically, the mechanism to detect and include library headers is hardcoded even though CMake has in-built Find<package> functions.

The state of the project at the end of the program

\	BZip2	LZ4	LZ4HC	LZMA	LZO	Snappy	ZStandard
Connect	*
InnoDB	Y	Y		Y	Y	Y
Mroonga		Y					Y
RocksDB	Y	?	?			Y	?

Y → Service is confirmed working, and tests are written.
* → Service is working, but needs tweaking or additional tests.
? → Service is implemented, but it might not be worth the additional complexity.

RocksDB proved to be quite difficult to work with, as it is tightly integrated with LZ4 and ZStandard.
From what I can understand by reading the code, it also doesn’t have fallback mechanisms in case of compression failure.
That means it is unlikely to work with our approach, which relies on the caller (ie RocksDB) to gracefully handle error conditions.

More details can be found on my GitLab repository.

Delivering my work and saying goodbye

All the changes are covered by this pull request.
The current patch implements support for all libraries, along with global status variables like Compression_loaded_<library> that indicate if the library is loaded or not.

These four months have been a great experience, and I had an awesome time learning about how the MariaDB server worked. Robert has been a great mentor, and patiently listened to my ideas.
I learnt a lot about writing safe and performant code. This was my first time working with a project of this size and impact, so I was a little nervous, but the experience will help me grow as a developer. I am more than a little sad about leaving the GSoC program, but I do plan to keep contributing to MariaDB, and other open source projects as well.

While we were able to get a lot done during GSoC, the project is still not quite finished. There are still some approaches I want to explore, and some features that I want to add.
I also want to fix some of the issues in the codebase that I noticed along the way.