MariaDB Hidden Gem: Create Aggregate Function

Have you ever written a query where the GROUP BY was easy, but the aggregate was the problem?

You know how to group the rows.
You know what result you want for each group.
But none of the built-in aggregate functions really match your logic.

So you end up with a long expression using SUM(), CASE, IF(), GROUP_CONCAT(), JSON functions, or application-side code. It works, but it is not beautiful. And if you need the same logic in several places, it becomes even worse.

This is exactly the kind of problem MariaDB’s CREATE AGGREGATE FUNCTION solves.

It lets you create your own aggregate function in SQL, using stored-routine syntax, and iterate over the rows of the current group.

In other words, instead of forcing your problem into a complicated expression, you can write the aggregate almost the way you would describe it:

for each row in the group:
    apply my logic
return the result

So, in this blog, let me shine a spotlight on one of MariaDB’s features that deserves more attention, a hidden gem: CREATE AGGREGATE FUNCTION.

It is small, elegant, and very powerful.

And, in my opinion, MariaDB’s approach is better than PostgreSQL’s for many real-world use cases. PostgreSQL has custom aggregates, of course, but they are defined around a state type, transition functions, optional final functions, combine functions, serialization functions, and so on. That model is extremely powerful, especially for planner integration and parallel execution, but it is not how most users naturally think when they want to implement an aggregate.

MariaDB takes a different path.

In MariaDB, a stored aggregate function looks like a stored function that can iterate over the rows of the current group.

That is the important part.

You write the aggregate logic in SQL.

You initialize variables, loop through the rows of the group, and consume each row with FETCH GROUP NEXT ROW, and return the final value when there is no more data for that group.

This makes the custom aggregates approachable. You do not need to decompose your idea into multiple support functions. You do not need to invent an artificial internal state type. You can just write the algorithm.

A reminder: what is a custom aggregate?

An aggregate function takes multiple rows and returns one result.

For example:

SELECT department, SUM(salary)
FROM employees
GROUP BY department;

SUM() receives all salary values for each department and returns a single value per department.

With CREATE AGGREGATE FUNCTION, we can create our own SUM(), AVG(), MEDIAN(), or something much more domain-specific.

This is especially useful when the aggregation is not just arithmetic.

MariaDB’s syntax

The basic structure is:

DELIMITER //

CREATE AGGREGATE FUNCTION my_aggregate(x INT) RETURNS INT
BEGIN
  DECLARE result INT DEFAULT 0;

  DECLARE CONTINUE HANDLER FOR NOT FOUND
    RETURN result;

  LOOP
    FETCH GROUP NEXT ROW;

    -- use x here
    SET result = result + x;
  END LOOP;
END //

The magic instruction is:

FETCH GROUP NEXT ROW;

Each time it is executed, MariaDB fetches the next row from the current group and refreshes the function parameters with the values from that row.

When there are no more rows in the group, the NOT FOUND handler is executed, and the aggregate returns the final result.

This is a very natural mental model:

MariaDB Server gives you a group.
You loop over the rows in that group.
You compute your result.
You return it.

That’s it.

Why I like this approach

PostgreSQL’s aggregate model is based on state transition functions:

state + next row -> new state
state -> final result

That is a good abstraction, but it is also a framework. You need to create functions around the aggregate, then bind them together with CREATE AGGREGATE.

MariaDB’s model is more direct:

for each row in the group:
    do something
return result

That means the implementation is often much closer to the pseudo-code you had in your head.

For DBAs and developers who already know stored routines, this is a very low-friction feature. You can use SQL variables, conditions, temporary tables, loops, handlers, and built-in functions. You are writing an aggregate using the language you already use inside MariaDB Server.

MySQL does not have an equivalent SQL-level CREATE AGGREGATE FUNCTION. It has built-in aggregate functions, and it has some spatial aggregate functionality such as ST_Collect(), but it does not let you define your own aggregate in SQL like this. The same applies to forks that closely track MySQL behavior: there is nothing comparable to MariaDB’s stored aggregate functions.

A simple example: longest_streak()

Let’s start with a simple example that is useful and cannot be written as a normal SQL aggregate expression.

Imagine we have a table tracking whether a user was active on a given day:

CREATE TABLE user_activity ( 
   user_id INT NOT NULL, 
   activity_date DATE NOT NULL, 
   active TINYINT NOT NULL, 
   PRIMARY KEY (user_id, activity_date)
);

Some sample data:

INSERT INTO user_activity VALUES 
   (1, '2025-01-01', 1), 
   (1, '2025-01-02', 1), 
   (1, '2025-01-03', 0), 
   (1, '2025-01-04', 1), 
   (1, '2025-01-05', 1), 
   (1, '2025-01-06', 1), 
   (2, '2025-01-01', 1), 
   (2, '2025-01-02', 0), 
   (2, '2025-01-03', 1);

What we want is simple:

For each user, what is the longest streak of consecutive active days?

For user 1, the longest streak is 3, from 2025-01-04 to 2025-01-06.

This is not a normal aggregate like SUM(active).

SUM(active) would tell us how many active days the user had in total. It would not tell us the longest uninterrupted sequence.

To compute a streak, we need to remember what happened while reading the rows:

if the current row is active, increase the current streak
if the current row is not active, reset the current streak
keep track of the longest streak seen so far

That is exactly the kind of logic MariaDB’s CREATE AGGREGATE FUNCTION can express very naturally:

DELIMITER // 

CREATE OR REPLACE AGGREGATE FUNCTION longest_streak(active TINYINT) 
RETURNS INT
DETERMINISTIC
NO SQL 
BEGIN 
  DECLARE current_streak INT DEFAULT 0; 
  DECLARE best_streak INT DEFAULT 0; 
  
  DECLARE CONTINUE HANDLER FOR NOT FOUND 
     RETURN best_streak; 
  
  LOOP FETCH GROUP NEXT ROW; 
  
    IF active = 1 THEN  
      SET current_streak = current_streak + 1; 
      
      IF current_streak > best_streak THEN 
         SET best_streak = current_streak; 
      END IF; 
    ELSE 
      SET current_streak = 0; 
    END IF; 
  END LOOP; 
END // 

DELIMITER ;

Now we can use it like any other aggregate:

SELECT
  user_id,
  longest_streak(active) AS longest_active_streak
FROM user_activity
GROUP BY user_id;

The result is:

+---------+-----------------------+
| user_id | longest_active_streak |
+---------+-----------------------+
|       1 |                     3 |
|       2 |                     1 |
+---------+-----------------------+

Of course, this kind of problem can often be solved with a more complex query using window functions, subqueries, or common table expressions. But that is the point: the logic becomes a query trick.

This is a small example, but it clearly shows the value of stored aggregate functions. While useful, it may not work in more complex queries because it depends on the ordering of rows within a group. Fixing it is possible, but outside the scope of this post.

A more advanced example: group_union() for geometries

The previous example was intentionally simple. Now let’s look at something a bit more domain-specific.

MariaDB has ST_UNION(g1, g2), which returns the union of two geometries. That is useful, but sometimes we want to union all geometries in a group.

Something like this:

SELECT region_id, group_union(geom)
FROM parcels
GROUP BY region_id;

The idea is simple:

start with NULL
read each geometry in the group
ignore NULL values
if this is the first geometry, keep it
otherwise union it with the accumulated result
return the final geometry

Here is a simple implementation:

DELIMITER //

CREATE OR REPLACE AGGREGATE FUNCTION group_union(g GEOMETRY)
RETURNS GEOMETRY
DETERMINISTIC
NO SQL
BEGIN
  DECLARE result GEOMETRY DEFAULT NULL;

  DECLARE CONTINUE HANDLER FOR NOT FOUND
    RETURN result;

  LOOP
    FETCH GROUP NEXT ROW;

    IF g IS NOT NULL THEN
      IF result IS NULL THEN
        SET result = g;
      ELSE
        SET result = ST_UNION(result, g);
      END IF;
    END IF;
  END LOOP;
END //

DELIMITER ;

Now we can create a small example table:

CREATE TABLE parcels (
  region_id INT NOT NULL,
  parcel_id INT NOT NULL,
  geom GEOMETRY NOT NULL
);

Insert some polygons:

INSERT INTO parcels VALUES
  (1, 1, ST_GeomFromText('POLYGON((0 0,0 1,1 1,1 0,0 0))')),
  (1, 2, ST_GeomFromText('POLYGON((1 0,1 1,2 1,2 0,1 0))')),
  (2, 3, ST_GeomFromText('POLYGON((10 10,10 11,11 11,11 10,10 10))')),
  (2, 4, ST_GeomFromText('POLYGON((11 10,11 11,12 11,12 10,11 10))'));

And then aggregate them:

SELECT
  region_id,
  ST_AsText(group_union(geom)) AS union_wkt
FROM parcels
GROUP BY region_id;

The result is one geometry per region_id.

+-----------+------------------------------------------+
| region_id | union_wkt                                |
+-----------+------------------------------------------+
|         1 | POLYGON((0 0,0 1,2 1,2 0,0 0))           |
|         2 | POLYGON((10 10,10 11,12 11,12 10,10 10)) |
+-----------+------------------------------------------+

This is exactly the kind of problem where a custom aggregate is useful. The function is not just collecting values. It applies domain-specific logic to all rows in a group.

Why not just use ST_Collect()?

ST_Collect() and ST_UNION() are not the same operation.

Collecting geometries means putting them into a collection. Unioning geometries means computing the geometric union: boundaries can be dissolved, overlapping shapes can be merged, and the result can be a geometry different from the individual inputs.

So group_union() is useful when the business question is:

“What is the resulting area covered by all these geometries?”

not merely:

“Can you put all these geometries in one collection?”

More than spatial

The geometry example is nice because it is visual and practical, but this feature is not limited to GIS.

Stored aggregate functions are useful when the aggregate requires logic that is awkward, unreadable, or impossible to achieve with built-in SQL aggregates alone.

Some examples:

first_non_empty(value)
json_object_merge_agg(doc)
histogram_bucket(value)
custom_score(event_type, event_weight, created_at)
median(value)

The important point is that the aggregate can be expressed as procedural SQL over the rows of a group.

That is a very powerful abstraction.

A note about performance

This does not mean every custom aggregate should be written this way.

Built-in aggregate functions are optimized, implemented internally, and should always be preferred when they already do what you need.

A stored aggregate function is best when:

the logic is domain-specific
the aggregate does not exist as a built-in
keeping the logic inside the database simplifies the application
the function improves readability and reuse
correctness and maintainability are more important than trying to force everything into a huge SQL expression

As always: test with your data, check execution plans, and benchmark.

But the productivity gain is real.

List them all

As a DBA, you might want to list all available custom aggregate procedures on your MariaDB Server.

This is the query to fetch them:

SELECT db, name, type, aggregate, created 
FROM mysql.proc WHERE type = 'FUNCTION'   
 AND aggregate = 'GROUP' ORDER BY db, name;
+------+-----------------+----------+-----------+---------------------+
| db   | name            | type     | aggregate | created             |
+------+-----------------+----------+-----------+---------------------+
| fred | group_union     | FUNCTION | GROUP     | 2026-06-03 22:54:17 |
| fred | longest_streak  | FUNCTION | GROUP     | 2026-06-04 11:11:20 |
+------+-----------------+----------+-----------+---------------------+
2 rows in set (0.001 sec)

Conclusion

CREATE AGGREGATE FUNCTION makes MariaDB more extensible from SQL itself.

You do not need a plugin.
You do not need a C UDF.
You do not need to change the server.
You do not need to push the logic to the application.

You can create a reusable aggregate directly in the database.

And the syntax maps very naturally to how people think about aggregation: loop over the rows of the current group and return a result.

That is simple.
That is powerful.
That is very MariaDB.

For me, this is one of those features that shows how MariaDB Server can innovate in practical ways while still keeping SQL approachable.

PostgreSQL’s model is powerful, but MariaDB’s model is nicer to write.

And MySQL has nothing comparable.

So next time you need a custom aggregation, remember that in MariaDB, the answer might be as simple as: CREATE AGGREGATE FUNCTION ... and a loop over the group!

As usual, enjoy MariaDB!