Writing User-Defined Functions in Rust

One of the most straightforward ways to add functionality to MariaDB or MySQL server instances is by creating a user-defined function (UDF). These are compiled functions loaded from dynamic binaries that can be much more performant and flexible than what can be written in SQL, providing the same capabilities as builtin functions.

These UDFs are typically written in C or C++, but a library is now available that makes it easy to write them in Rust. This blog discusses some of the reasoning for writing this library, followed by a very basic usage example that doesn’t require any experience with the language.

Why Rust?

Extensions to MariaDB can be written in anything that can produce a compiled dynamic library, which is typically C or C++ (the languages of the server’s source itself). There is nothing wrong with this current approach, but being able to write them in Rust has some advantages:

  • The compiler always guarantees protection from the most common relevant CWEs (specifically overread/overwrite, use after free, null dereference, and race conditions)
  • Good use of type safety can enforce code correctness
  • Smart pointers prevent memory leaks (similar to C++, but Rust’s implementation is somewhat more straightforward)
  • Incredible toolchain; cargo is Rust’s default build system. Out of the box, you get:
    • Compiling (cargo check / cargo build)
    • Linting (cargo clippy)
    • Testing (cargo test, alias cargo t)
    • Documentation (cargo doc, usually cargo doc --document-private-items --open for libraries)
    • Dependency management (configured in Cargo.toml)

Databases are applications where performance bottlenecks can easily bog down the web services they drive. They must also take extreme care to avoid security issues, since easy to miss things like buffer overreads can mean compromising sensitive data. Rust is particularly well suited to this niche: it provides performance similar to or better than C and C++, but guarantees against those languages’ most common security pitfalls.

If you aren’t familiar with the language, you might question how C interfaces and lower-level programming are possible with the guarantees. The answer is fairly straightforward; anything that might introduce unsoundness (pointer operations, potentially thread-unsafe things, inline assembly, C FFI) are possible, but only within an unsafe {...} block. This means that these isolated few lines of code can be easily identified and thoroughly verified; any safe API that builds upon these blocks is then guaranteed to be sound as long as the unsafe sections are. Since the udf library handles all these unsafe operations for you, it is possible to write almost any UDF entirely in safe code!

Example UDF walkthrough

In this section, we will implement an extremely simple user-defined function and cover its writing, building, and usage aspects. To follow along, you will need a Rust compiler with version ≥1.65 (because of dependence on GATs); run rustup update if needed to get to the latest version. If you don’t yet have Rust, get it from https://rustup.rs/. Get the rust-analyzer language server if you are using an IDE (not required, but nice to have).

Workspace setup

The first step is to create a new project; Cargo makes this pretty easy;

# Create a new Rust project called test-udf
$ cargo new --lib test-udf

$ cd test-udf

# Simply validate that the code compiles, same as `cargo c`
$ cargo check

# Run the unit test, same as `cargo t`
$ cargo test

The above creates a directory called test-udf with Cargo.toml and src/lib.rs, then verifies everything is working correctly (the cargo new command makes a super simple example function and test). We need to updateCargo.toml to tell Cargo to produce the correct kind of output (a dynamic library) and to use udf as a dependency:

[package]
name = "test-udf"
version = "0.1.0"
edition = "2021"
publish = false # prevent accidentally publishing to crates.io

# Add this section to specify we want to create a C dynamic library
[lib]
crate-type = ["cdylib"]

[dependencies]
udf = "0.5" # our dependency on the `udf` crate

You can delete everything in lib.rs and our setup is complete.

UDF architecture

Let’s write a very simple UDF that performs a running total of integers.

A UDF typically needs to make three symbols available to the server:

  • An init call that validates argument type and performs memory allocation
  • A process call run once per row that produces a result
  • A deinit call that frees any memory from the setup

Common interfaces like this in Rust are grouped into traits. The BasicUdf trait is of interest here and provides interfaces for init and process (deinit is handled automatically). This trait should be implemented on a structure representing data to be shared among calls to process, which is called once per row. In this case, the data is just our current total.

struct RunningTotal(i64);

I am using an “tuple struct” syntax here which means you can access fields with numbers (some_struct.0, some_struct.1) rather than by names (some_struct.field). This is just a convenience as we only have one field, but you are more than welcome to use a standard struct (they’re identical behind the scenes):

struct RunningTotal {
  total: i64
}

In either case, our struct name will get converted to snake case to give the SQL function name (so Runningtotal becomes running_total(...)). This will be the basis of our implementation.

Basic structure

We now need to do three things:

  • Import needed types and functions. udf has a prelude module with the most commonly needed imports, so we can just import everything there;
  • Implement the BasicUdf trait for our struct; and
  • Add the #[register] macro to create the correct symbols.

The minimum compiling code looks like this

use udf::prelude::*;

struct RunningTotal(i64);

#[register]
impl BasicUdf for  RunningTotal {
    type Returns<'a> = i64;

    fn init(cfg: &UdfCfg<Init>, args: &ArgList<Init>) -> Result<Self, String> {
        todo!()
    }

    fn process<'a>(
        &'a mut self,
        cfg: &UdfCfg<Process>,
        args: &ArgList<Process>,
        error: Option<NonZeroU8>,
    ) -> Result<Self::Returns<'a>, ProcessError> {
        todo!()
    }
}

(hint: if you just type impl BasicUdf for RunningTotal {} then open the quick fix inside the brackets (ctrl+. on VSCode); it will offer to autofill the function signatures for you.)

Woah function signatures! The docs on BasicUdf go into detail about what everything here does, but let’s break it down simply:

type Returns<'a> = i64;

This is just where we specify the return type of our UDF. See the docs for more information about possible return types, but we just use i64 to represent a non-null integer. (Ignore the <'a> – that is relevant only when returning references)

fn init(cfg: &UdfCfg<Init>, args: &ArgList<Init>) -> Result<Self, String> {
    todo!()
}

This is our initialization function, which takes a configuration object cfg and a list of arguments args. Init on these types is just a marker to indicate where they’re being used (it’s tied to what methods are available).

This returns a Result, a builtin enum to represent a value that has different types for Ok and Success (Rust enums are “tagged unions” or “sum types”). So, from our function signature, we can tell that the type of a successful function call will be Self (i.e. RunningTotal, which gets saved for later use) and an error will be a String (which gets displayed to the user). Makes sense, right?

todo!() is a builtin macro that just matches whatever type signature is needed to compile. If we actually tried to run it would fail, but we can at least verify that there are no errors in our code structure:

$ cargo c
warning: `test-udf` (lib) generated 5 warnings (run `cargo fix --lib -p test-udf` to apply 5 suggestions)
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s

There are warnings about unused arguments, but the basic structure is all set!

UDF implementation: init

Now that we have our basic structure, let’s take a look at how to get some results.

The main goal of our init function is to validate arguments. Let’s look at the implementation then break it down:

fn init(_cfg: &UdfCfg<Init>, args: &ArgList<Init>) -> Result<Self, String> {
    if args.len() != 1 {
        return Err(format!("expected 1 argument; got {}", args.len()));
    }

    // Coerce everything to an integer
    args.get(0).unwrap().set_type_coercion(SqlType::Int);

    Ok(Self(0))
}

The first part here checks out argument count:

if args.len() != 1 {
    return Err(format!("expected 1 argument; got {}", args.len()));
}

The number of arguments should be one. If not, it creates a formatted error message string and returns it (Err(something) is how to construct a Result enum error variant).

The second logical block:

args.get(0).unwrap().set_type_coercion(SqlType::Int);

Uses .get(0) to attempt to get the first argument. This returns an Option<SqlArg> which is another builtin enum type like Result.

Option<T> has two possible variants: Some(T) to represent an existing value of type T, and None to represent nothing. unwrap() is used to get the inner value out of a Some() value, or panic if there is None, so we use it to get the argument at index 0 (first argument).

Note that panicking is a very bad idea in UDFs since it can cause the server to crash. We have already verified that there is a single argument here though, so unwrapping will not fail and is thus okay to use here.

Once we have our first argument representation, we are able to call set_type_coercion. This instructs the server to either coerce the first argument to an integer or cancel the function call with an error.

Ok(Self(0))

The last line simply creates a Self instance with 0 as its inner value, and wraps it in Ok to indicate success. In Rust, the last line of any block is that block’s value if it doesn’t end in a semicolon, so this statement is our return value. That’s all that is needed in the init phase.

UDF implementation: process

The process function also has a fairly simple body:

fn process<'a>(
    &'a mut self,
    _cfg: &UdfCfg<Process>,
    args: &ArgList<Process>,
    _error: Option<NonZeroU8>,
) -> Result<i64, ProcessError> {
    // Get the value as an integer and add it to our total
    self.0 += args.get(0).unwrap().value().as_int().unwrap_or(0);

    // The result is just our running total
    Ok(self.0)
}

The first line contains most of the logic, and uses combinators to keep things
terse. It does the following:

  • args.get(0).unwrap(): this gets the first argument, as discussed above. We are again okay to unwrap here because we validated our arguments (init and process get the same number of arguments when they are called, and we performed validation in init)
  • .value() gets the argument’s value, which is a SqlResult. This is an enum with variants for String, Real, Int, and Decimal. .as_int() is a convenience function that returns an Option. If the value is a non-null integer, it will return Some(i64); any other possibilities will return None. Because we set type coercion in init, we can reasonably expect that all values will be a potentially null integer.
  • unwrap_or(0) acts the same as unwrap() when the value is Some, but substitutes the specified value of 0 when the value is None. This means that any null values will use 0 so they don’t affect our running sum.

We add this to our struct’s inner value, self.0, which represents our current running total, and return it with Ok(self.0). With that, our process function is complete!

Unit testing

The udf crate provides functionality to thoroughly test UDF implementations without even loading them into SQL. If you aren’t familiar, it is worth looking at the basics of Rust’s unit testing for a brief outline, but basically any function marked #[test] will get run as one.

We need to update our udf dependency to use the mock feature so we can access that feature-gated module. Change the dependency line to look like the following:

udf = { version = "0.5", features = ["mock"] }

And add the following below our test UDF:

#[cfg(test)]
mod tests {
    // Use our parent module
    use super::*;
    use udf::mock::*;

    #[test]
    fn test_basic() {
        // Create a mock `UdfCfg` and mock `UdfArgs`
        let mut cfg = MockUdfCfg::new();

        // Create an array of mock row arguments. We will "run" our function with each
        // one and verify the result
        let mut row_args = [
            // Each entry here acts as a row. Format for the macro is
            // `([type] value, "attribute name", nullable)`
            mock_args![(Int None, "", false)],
            mock_args![(10, "", false)],
            mock_args![(Int None, "", false)],
            mock_args![(-20, "", false)],
        ];

        // This is the expected output after each of the above calls
        let outputs = [0i64, 10, 10, -10];

        // Run the `init` function on our mock data
        let mut rt = RunningTotal::init(cfg.as_init(), row_args[0].as_init()).unwrap();

        // Iterate through our list of arguments
        for (arglist, outval) in row_args.iter_mut().zip(outputs.iter()) {
            // Run the process function and verify the result
            let res = rt.process(cfg.as_process(), arglist.as_process(), None);
            assert_eq!(res, Ok(*outval));
        }
    }
}

Let’s check the result:

running 1 test
test tests::test_basic ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Excellent!

See this blog post’s repository for some more example unit tests. With good unit testing, it is possible to be highly confident that a UDF performs as expected without even loading it into our server.

Loading the function

The final test for our UDF is to actually run it in a server. Building a .so or .dll file that can be loaded into MariaDB is easy – cargo build --release produces the output, which will be located in target/release. If you have Docker however, testing in an isolated environment is even easier:

FROM rust:1.66 AS build

WORKDIR /build

COPY . .

# Use Docker buildkit to cache our build directory for quicker recompilations
RUN --mount=type=cache,target=/usr/local/cargo/registry \
    --mount=type=cache,target=/build/target \
    cargo build --release \
    && mkdir /output \
    && cp target/release/libMY_CRATE_NAME.so /output

FROM mariadb:10.10

COPY --from=build /output/* /usr/lib/mysql/plugin/

This docker image uses cache, which is a feature of Docker buildkit. Make sure you are using a newer version of Docker, or have the environment variable set correctly to enable it. Or, remove the cache indicators.

Be sure to update file name at MY_CRATE_NAME (libtest_udf.so if you followed the earlier suggestion). The following runs and builds our image:

# Build the image
$ docker build . --tag mdb-blog-udf

# Run the image and name it mariadb_blog_udf for convenience
$ docker run --rm -d -e MARIADB_ROOT_PASSWORD=example --name mariadb_blog_udf mdb-blog-udf

# Enter the SQL console
$ docker exec -it mariadb_blog_udf mysql -pexample

Let’s load our function and test it, first with too many arguments and then with the correct argument count:

MariaDB [(none)]> CREATE FUNCTION running_total RETURNS integer SONAME 'libtest_udf.so';
Query OK, 0 rows affected (0.003 sec)

MariaDB [(none)]> select running_total(1, 2, 3);
ERROR 1123 (HY000): Can't initialize function 'running_total'; xpected 1 argument; got 3

MariaDB [(none)]> select running_total(10);
+-------------------+
| running_total(10) |
+-------------------+
|                10 |
+-------------------+
1 row in set (0.000 sec)

So far so good! Now a slightly harder test

MariaDB [(none)]> create database db; use db; create table t1 (val int);
Query OK, 1 row affected (0.000 sec)

Database changed
Query OK, 0 rows affected (0.023 sec)

MariaDB [db]> insert into t1(val) values (1),(2),(3),(NULL),(-100),(50),(123456789);
Query OK, 7 rows affected (0.002 sec)
Records: 7  Duplicates: 0  Warnings: 0

MariaDB [db]> select val, running_total(val) from t1;
+-----------+--------------------+
| val       | running_total(val) |
+-----------+--------------------+
|         1 |                  1 |
|         2 |                  3 |
|         3 |                  6 |
|      NULL |                  6 |
|      -100 |                -94 |
|        50 |                -44 |
| 123456789 |          123456745 |
+-----------+--------------------+
7 rows in set (0.000 sec)

Perfect!

Wrapup

We have successfully written a simple UDF that validates its arguments and stores some data among calls, all with just a few lines of code. We also performed both unit and (non-automated) integration testing, which are easy steps to make sure our program works as expected.

As mentioned, the code for this example is in this blog post’s repository. For anyone looking to explore a little further, the code could easily be turned into an aggregate UDF by implementing the AggregateUdf trait.

Helpful links:

Try the example or try your own function; you are welcome to join us on Zulip to discuss the results. Happy programming!