Post-mortem: PHP and MariaDB Docker issue
Years ago, I watched a film with my children (now adults themselves) called Meet the Robinsons. A running theme from the film is that it is OK to make mistakes because it is from those mistakes we can learn and “keep moving forward”.
An unfortunate perfect storm of several problems occurred which meant that on the 21st February 2024, the mariadb:latest
Docker image would not work with PHP and NodeJS clients. Now, one of the things I helped introduce into the MariaDB Foundation is the concept of doing post-mortems, not just when things go wrong but when they go right too. These are not to figure out who was at fault, or pin blame. But to figure out failings in our processes that allowed things to happen, and we can learn not to repeat them. To put things in place so they can’t happen again. We actually do these for each release cycle to document what happened and to see if there are things we can automate and improve in the future.
In this case, it affected many users of PHP applications who use Docker to get MariaDB Server. So, I figured we should be as transparent as possible as to what happened, how we resolved it, and how we intend to stop it happening in future.
With that, let’s dive in.
Foreword
The Docker update of mariadb:latest
to 11.3 included a Debian package configuration file change for the default collation. That collation triggered a bug in PHP mysqlnd and the NodeJS connector which disallowed applications to login.
In MySQL 4.0 the default collation was sent from server to client as a single byte ID from 1-255. In MySQL 4.1 (released in 2003) this communication direction was changed and is now client to server, but server to client was still supported for compatibility, and remains to this day. MariaDB’s newer collations use an ID higher than 255, so the compatibility collation byte starts with zero and there are extended bytes elsewhere.
The configuration file in Debian packages was updated to use a more modern default server collation in PR 2775, which was merged in 11.3. PHP’s mysqlnd and NodeJS’s connector still validates the server collation, when they see a zero, they generate an error. This error aborts the authentication process.
Sequence of events
To shed light on what happened, here is a detailed account of events. If you are interested in the big picture, feel free to skip until the header “Causes”.
Where provided, times are in UTC.
2023-10-19
A pull request was merged which changed the default server collation for Debian packages in a configuration file. Note: MariaDB Docker images use Debian packaging.
2023-12-08
Jira MDEV-32975 was opened, in which a user found that the above commit broke PHP applications when using the Debian packages for MariaDB 11.3.1.
2024-01-26
Alexander Barkov created a compatibility patch as a resolution for MDEV-32975 and put it up for code review.
2024-02-20 23:45
MariaDB Docker’s image was updated so that mariadb:latest
pointed to 11.3.2 instead of 11.2.3, as this is the new latest stable release of MariaDB.
2024-02-21 07:34
A MariaDB Docker issue was opened by the community by a user who was having authentication issues when using the Docker image.
2024-02-21 09:44
A PHP project GitHub issue was opened by the community for the same reason.
2024-02-21 10:21
Andrew Hutchings was alerted to the problem via Twitter and started investigating.
2024-02-21 10:34
Both Vicențiu Ciorbaru (who noticed via the Docker GitHub issue) and Andrew alerted the rest of the Foundation staff internally at the same time.
2024-02-21 11:32
Andrew had identified the cause as MDEV-32975. He requested it escalated to Blocker status (it immediately was). Information was communicated to the core MariaDB developers as to why the severity change.
2024-02-21 11:52
After some internal discussions, Andrew posted a workaround on the MariaDB Docker GitHub ticket. Within half an hour, the community had tested this, and it appeared to be successful.
2024-02-21 16:59
After lots of research and conversation, Andrew had figured out the details of the root cause, he updated the PHP ticket with the details, which caused it to be reopened.
2024-02-21 17:45
Andrew internally passed on all the information so far to the rest of the team at the end of his day, but continued to assist community members with this issue into the night. He later communicated to the team that NodeJS was also found to be affected.
2024-02-21 21:56
Daniel Black, who manages the MariaDB Docker image, came on-line and started to work on the problem.
2024-02-21 22:50
Daniel opens CONJS-281 to cover the issue for NodeJS.
2024-02-22 00:52
Daniel opened a pull request to MariaDB’s Docker repository to work around the problem.
2024-02-22 01:06
Daniel created a pull request for Docker which pointed to the MariaDB Docker update with the workaround. It was merged 4 minutes later.
2024-02-22 04:43
The updated Docker image was published on Docker Hub.
2024-02-22 05:04
The MariaDB Docker GitHub ticket was marked as closed. External communication happened afterwards.
Causes
Unfortunately, a perfect storm of several problems caused this. If even one of them had not occurred, we wouldn’t have had the problems we had:
- PHP mysqlnd and the NodeJS connector tried to validate data on a byte that hasn’t been used for 20 years. This was unexpected by us, but understandable that this check was never removed.
- The bug was actually fixed by MDEV-32975 (Default charset doesn’t work with PHP MySQLi extension), but the patch was not reviewed prior to 11.3 GA so it wasn’t in the code base.
- The combination of the above two led to the configuration change in PR 2775 (MDEV-32336 deb default config – use collation-server = utf8mb4_uca1400_ai_ci) causing the breakage.
- We have CI test that would have picked up the issue early, but MDBF-637 (missing debuginfo packages due to reprepro) means that the CI builder in question is currently broken.
Insights
- The community around WordPress and Nextcloud in particular are very responsive and communicative when there are issues.
- Debian packages for 11.3 are still affected, under the current release model there will be no more 11.3 releases, so this will be fixed in 11.4.
- We do have processes in place to avoid and detect issues like this, but when stars are badly aligned, processes do fail.
Recommendations
These are the recommendations we are putting into place to resolve this properly and stop something similar happening in the future.
- The patch for MDEV-32975 needs to be reviewed and merged, or corrected and merged if more work is needed.
- The release notes for 11.3 need to be updated to notify Debian users that could be affected by this, along with how to implement the workaround.
- We need to fix MDBF-637, to make sure that the tests which would have caught this are run and make amd64-rhel8-wordpress a required tester for pull requests.
- We should make this post-mortem public via a blog post (this blog post).
Conclusion
Hopefully, this provides an insight as to what happens when a problem arises with a MariaDB Server release. We would like to thank the community for their rapid reporting, response and feedback. Without you, we wouldn’t have been able to resolve this as quickly as we did.
Featured image from Digits.co.uk, used under a CC license.
This was poorly handled. Instead of this, the mariadb team could have
* reverted commit https://github.com/MariaDB/server/pull/2775/files
* released 11.3.3 (with the fix) and announced that this release was an exception and there will be no more updates due to the new release model at https://mariadb.com/kb/en/mariadb-release-model/
Now every new user trying out mariadb between today and 11.4 stable release will either give up after trying to connect their app to the server OR edit the default config like a caveman (workaround). You will lose many prospective new clients who downloads the latest stable (11.3) from the official website due to this.
Hi Jon! We appreciate your input on this.
In fact the first thought that did go through our minds was to consider an emergency release. We always do this when we release a server that is “unusable”, regardless of release model. For this case, we took a step back and analyzed the affected users.
1. Most of the affected users are the ones running the docker images using the latest tag. For these users the problem is 100% resolved as of now.
2. RPM and binary tarball users are not affected at all.
3. The majority of Debian & Ubuntu users still primarily use their respective distribution repositories, which are on one of the LTS branches, unaffected.
4. The specific users that are affected in this case make use of the DEB repositories on mariadb.org. For these users, yes, they will have to change the config to get the server to work. We will add a specific note for this release on our downloads page about the need for the config workaround.
5. Publishing a new release takes at least one or two days, from initial commit to when the packages are live. It also involves a lot of mirror syncing. Overall it takes some time till the fix actually reaches users through the normal channels and our focus was on fixing the problem for the ones who are actually reporting outages, primarily container users in this case as quickly as possible.
6. We may still decide to do an 11.3.3 release, if we get additional complaints from the community.
I like the timeline, and the sequence of events and full disclosure and what not.
But I can’t help wondering . Can you post-mortem MySQL Connector/NET breakage for 1.5 years (https://stackoverflow.com/questions/74060289/mysqlconnection-open-system-invalidcastexception-object-cannot-be-cast-from-d/78025438#78025438 , in an LTS release, including apps depending on this connector such as PowerBI, and Toad) that did not bother you much? Debian was released with this LTS, and this connector would not work. There was no configuration file, that could be changed, to fix it. The only workaround was recompilation with another connector (this is of course out of consideration for PowerBI or Toad). No urgency, no fixes in firehouse mode, no talks about re-release, Why not?
While there are clearly miss steps and mistakes, the response seems to have a lot of successes. Clear communication, a focus on immediate work-arounds and stability while coming up with a plan for the long term solution, collaboration with the affected communities on long term fixes they can implement… Lots of good things.
One thing that stands out to me is that :latest is generally considered an unsafe and moving tag. I only use it in my CI testing to keep an eye out for possible problems like this. That might color my interpretation of the incident quite a bit and the fact that it was moved to 11.3.2 caused a good bit of confusion for me as I would have expected 11.3.0 to have been the cut-over so I was looking in the wrong places and having trouble understanding some of the communications. It wasn’t until this write up that it really made sense.
Another thing that stands out, do major language connections and standard query features not get tested against each release? This seems like it might be a step that would trigger blockers or at least some sort of outreach to language maintainers about potential breaking changes.
FWIW, the Drupal community was quietly watching and ready to respond as needed too 😉
Thanks for the great write up!
Hi James,
Many thanks for your comment.
I believe it is Docker convention that `:latest` is the latest stable release of the project. 11.3.2 is the newly-released GA for the 11.3 series, which is why it switched to that from the 11.2 series at that point (11.3.1 was an RC). But, we should consider a bleeding-edge tag of some description, exactly for the use case you mention. I shall talk to the team about this.
We do normally test against many language connections and features. But, clearly we need to improve on that, there are steps being taken right now to bring those issues to the surface.
Glad the Drupal project was watching, hoping to see some of your members at the CloudFest Hackathon 🙂