Post-mortem: PHP and MariaDB Docker issue

Years ago, I watched a film with my children (now adults themselves) called Meet the Robinsons. A running theme from the film is that it is OK to make mistakes because it is from those mistakes we can learn and “keep moving forward”.

An unfortunate perfect storm of several problems occurred which meant that on the 21st February 2024, the mariadb:latest Docker image would not work with PHP and NodeJS clients. Now, one of the things I helped introduce into the MariaDB Foundation is the concept of doing post-mortems, not just when things go wrong but when they go right too. These are not to figure out who was at fault, or pin blame. But to figure out failings in our processes that allowed things to happen, and we can learn not to repeat them. To put things in place so they can’t happen again. We actually do these for each release cycle to document what happened and to see if there are things we can automate and improve in the future.

In this case, it affected many users of PHP applications who use Docker to get MariaDB Server. So, I figured we should be as transparent as possible as to what happened, how we resolved it, and how we intend to stop it happening in future.

With that, let’s dive in.

Foreword

The Docker update of mariadb:latest to 11.3 included a Debian package configuration file change for the default collation. That collation triggered a bug in PHP mysqlnd and the NodeJS connector which disallowed applications to login.

In MySQL 4.0 the default collation was sent from server to client as a single byte ID from 1-255. In MySQL 4.1 (released in 2003) this communication direction was changed and is now client to server, but server to client was still supported for compatibility, and remains to this day. MariaDB’s newer collations use an ID higher than 255, so the compatibility collation byte starts with zero and there are extended bytes elsewhere.

The configuration file in Debian packages was updated to use a more modern default server collation in PR 2775, which was merged in 11.3. PHP’s mysqlnd and NodeJS’s connector still validates the server collation, when they see a zero, they generate an error. This error aborts the authentication process.

Sequence of events

To shed light on what happened, here is a detailed account of events. If you are interested in the big picture, feel free to skip until the header “Causes”.

Where provided, times are in UTC.

2023-10-19

A pull request was merged which changed the default server collation for Debian packages in a configuration file. Note: MariaDB Docker images use Debian packaging.

2023-12-08

Jira MDEV-32975 was opened, in which a user found that the above commit broke PHP applications when using the Debian packages for MariaDB 11.3.1.

2024-01-26

Alexander Barkov created a compatibility patch as a resolution for MDEV-32975 and put it up for code review.

2024-02-20 23:45

MariaDB Docker’s image was updated so that mariadb:latest pointed to 11.3.2 instead of 11.2.3, as this is the new latest stable release of MariaDB.

2024-02-21 07:34

A MariaDB Docker issue was opened by the community by a user who was having authentication issues when using the Docker image.

2024-02-21 09:44

A PHP project GitHub issue was opened by the community for the same reason.

2024-02-21 10:21

Andrew Hutchings was alerted to the problem via Twitter and started investigating.

2024-02-21 10:34

Both Vicențiu Ciorbaru (who noticed via the Docker GitHub issue) and Andrew alerted the rest of the Foundation staff internally at the same time.

2024-02-21 11:32

Andrew had identified the cause as MDEV-32975. He requested it escalated to Blocker status (it immediately was). Information was communicated to the core MariaDB developers as to why the severity change.

2024-02-21 11:52

After some internal discussions, Andrew posted a workaround on the MariaDB Docker GitHub ticket. Within half an hour, the community had tested this, and it appeared to be successful.

2024-02-21 16:59

After lots of research and conversation, Andrew had figured out the details of the root cause, he updated the PHP ticket with the details, which caused it to be reopened.

2024-02-21 17:45

Andrew internally passed on all the information so far to the rest of the team at the end of his day, but continued to assist community members with this issue into the night. He later communicated to the team that NodeJS was also found to be affected.

2024-02-21 21:56

Daniel Black, who manages the MariaDB Docker image, came on-line and started to work on the problem.

2024-02-21 22:50

Daniel opens CONJS-281 to cover the issue for NodeJS.

2024-02-22 00:52

Daniel opened a pull request to MariaDB’s Docker repository to work around the problem.

2024-02-22 01:06

Daniel created a pull request for Docker which pointed to the MariaDB Docker update with the workaround. It was merged 4 minutes later.

2024-02-22 04:43

The updated Docker image was published on Docker Hub.

2024-02-22 05:04

The MariaDB Docker GitHub ticket was marked as closed. External communication happened afterwards.

Causes

Unfortunately, a perfect storm of several problems caused this. If even one of them had not occurred, we wouldn’t have had the problems we had:

  1. PHP mysqlnd and the NodeJS connector tried to validate data on a byte that hasn’t been used for 20 years. This was unexpected by us, but understandable that this check was never removed.
  2. The bug was actually fixed by MDEV-32975 (Default charset doesn’t work with PHP MySQLi extension), but the patch was not reviewed prior to 11.3 GA so it wasn’t in the code base.
  3. The combination of the above two led to the configuration change in PR 2775 (MDEV-32336 deb default config – use collation-server = utf8mb4_uca1400_ai_ci) causing the breakage.
  4. We have CI test that would have picked up the issue early, but MDBF-637 (missing debuginfo packages due to reprepro) means that the CI builder in question is currently broken.

Insights

  1. The community around WordPress and Nextcloud in particular are very responsive and communicative when there are issues.
  2. Debian packages for 11.3 are still affected, under the current release model there will be no more 11.3 releases, so this will be fixed in 11.4.
  3. We do have processes in place to avoid and detect issues like this, but when stars are badly aligned, processes do fail.

Recommendations

These are the recommendations we are putting into place to resolve this properly and stop something similar happening in the future.

  1. The patch for MDEV-32975 needs to be reviewed and merged, or corrected and merged if more work is needed.
  2. The release notes for 11.3 need to be updated to notify Debian users that could be affected by this, along with how to implement the workaround.
  3. We need to fix MDBF-637, to make sure that the tests which would have caught this are run and make amd64-rhel8-wordpress a required tester for pull requests.
  4. We should make this post-mortem public via a blog post (this blog post).

Conclusion

Hopefully, this provides an insight as to what happens when a problem arises with a MariaDB Server release. We would like to thank the community for their rapid reporting, response and feedback. Without you, we wouldn’t have been able to resolve this as quickly as we did.

Featured image from Digits.co.uk, used under a CC license.

Published by Andrew Hutchings

Chief Contributions Officer for the MariaDB Foundation