How deep can a bug be?

Last year I filed a bug report MDEV-33603 on what a looked like a benign problem with an optimizer taking a different code path in a particular trivial looking test. Its benign looking nature lead to me not looking at it until last week. The “benign” bug as it turned out is a bug in an OpenSSL optimization on IBM POWER, which maybe not the lowest level of “How deep”, but its certainly a long way from the high level (above storage engines) optimizer decisions in MariaDB.

I feel I need to start this story justifying why it was left so long. I’ll contain this to a paragraph. In summary the tests failure, wasn’t continuous integration worker dependent (so not a hardware fault), was exclusively on RHEL 9 and Centos Stream 9, and only on POWER (ppc64le). It hit two tests, and was in a MariaDB Connector, so my assumption was a quirky one off low impact compiler bug. It also started in MariaDB-11.4 for what will remain a bit of a mystery until later.

So down the rabbit hole we start looking at what’s going on. As this test case was exclusively on PPC64LE with on a particular OS, getting that environment was key. With IBM as a platinum sponsor, that assists in paying the entire foundation staff in achieving general openness, adoption and continuity which translated into me working at my own volition solving this bug, they also provide a Buildbot sponsor of hardware, allowing us to see this bug, and have access to a remote accessible test environment to debug this.

Our buildbot environment is container based, so a pull down of quay.io/mariadb-foundation/bb-worker:centosstream9 and I’ve got the environment covered. A quick build with clone of the server code, run the test, connect.mysql_index, and luckily discover it to be an always repeatable test. The test case got the wrong result not on the first query, so the test case was simplified to contain the minimal SQL setup and the query that generated the wrong result.

The SQL query below was an incorrect query plan. There’s no ORDER BY in the query, so the rows returned were actually a valid answer, just not executed in the way the test planned.

SELECT * FROM t2 WHERE id <= 3

The way to debug how queries are executed, like a normal DBA on MariaDB, is to use the excellent Optimizer Trace, which shows how exactly what was considered, measured, and decided upon as the way to execute a query. An extract from what occurred on POWER in the test environment was:

                  "setup_range_conditions": [],
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      {
                        "index": "PRIMARY",
                        "ranges": ["(id) <= (3)"],
                        "rowid_ordered": false,
                        "using_mrr": false,
                        "index_only": false,
                        "rows": 2,
                        "cost": 0.004554858,
                        "chosen": false,
                        "cause": "cost"
                      }
                    ],
...
                      "considered_access_paths": [
                        {
                          "access_type": "scan",
                          "rows": 7,
                          "rows_after_filter": 2,
                          "rows_out": 2,
                          "cost": 0.113371474,
                          "index_only": false,
                          "chosen": true
                        }
                      ],

This shows the optimizer, considered the PRIMARY KEY, was going to get an estimated two rows, it was going to cost 0.004554858 (expected relative time), but rejected (chosen: false) this on the cost, and instead chose a 0.113371474 cost plan, a 25 times more expensive plan, instead. At this point I’m still considering an odd but plausible compiler bug, but to prove it I’ll need to find the cause, so I can do a Red Hat bug report to correct this. Before going further, checking the optimizer trace in a working environment yielded:

+                    "range_scan_alternatives": [
+                      {
+                        "index": "PRIMARY",
+                        "ranges": ["(id) <= (3)"],
+                        "rowid_ordered": false,
+                        "using_mrr": false,
+                        "index_only": false,
+                        "rows": 2,
+                        "cost": 0.004554858,
+                        "chosen": true
+                      }

Same cost calculation, and it was chosen. This narrows down the codebase where I should be looking. Using rr recording of the same trimmed down test case locally, and some mtr --manual-gdb in container terminals, restarted many many times (unfortunately IBM POWER doesn’t support rr yet) and narrowed down to where their execution differed and the query planning structures changed. This ended up in the get_key_scans_params, which as happens is the function adding chosen: true and chosen: false to the optimizer trace. There’s a particular point in this function that has the code form:

if (read_time > found_read_time ....)
...
   trace_idx.add("chosen", true);

At this point, the debugger showed the strangest thing, read_time was 0. It had a value when the function was started the cost 0.113371474. The found_time, 0.004554858 (lower but not less than 0), is not chosen as a query plan, because the time for a better query is now 0 (zero, how do I do a faster than 0 query plan!?!), instead of the original value at the function start. To get the obvious out of the way, there’s no code that changes its value (until after this if condition), and its not passed by reference (pointer) anywhere, and its passed by value (copied) in here, so it’s just changed value without apparent cause. A small peak at the disassembled code (I’ll save you from the beauty) for this function shows its passed in as register f1, it’s then copied to f29, and the code (above) that does the comparison is using f29. The 64-bit ELF ABI Specification for OpenPOWER Architecture document (shorter version is “how function calls work on OpenPOWER”) describes that the f29 is to be saved (and restored) by the callee (not the caller), and something in the stack isn’t doing that. Hours later I find myself 20 layers deeper in the call stack as the CONNECT storage engine, the type of the t2 table, is performing a TLS connection to a remote server. Why TLS? It’s default from 11.4, even when doing a connection over a unix socket so it appears. To confirm what I was seeing, I disabled TLS connections for UNIX sockets instigated from the MariaDB server with a minor test patch below.

diff --git a/sql-common/client.c b/sql-common/client.c
index e6ad0e97c18..7cb8c070f1c 100644
--- a/sql-common/client.c
+++ b/sql-common/client.c
@@ -2089,7 +2089,7 @@ static int send_client_reply_packet(MCPVIO_EXT *mpvio,
   if (mpvio->db)
     mysql->client_flag|= CLIENT_CONNECT_WITH_DB;
 
-  if (vio_type == VIO_TYPE_NAMEDPIPE)
+  if (vio_type == VIO_TYPE_NAMEDPIPE || vio_type == VIO_TYPE_SOCKET)
   {
     mysql->server_capabilities&= ~CLIENT_SSL;
     mysql->options.use_ssl= 0;

Test case resolved, but that doesn’t solve the problem. A real world user just needs to do a CONNECT engine to a remote server and they’ll hit the same issue. The TLS nature of a connection shouldn’t be changing a query plan. One Red Hat bug report against OpenSSL (RHEL-122599), a somewhat troubled sleep later, and it’s verifying if the problem is still in OpenSSL upstream. Yes, as the case may be (issue 28961) and now that I’ve found the functions to blame (ppc_aes_gcm_encrypt/ppc_aes_gcm_decrypt) that clobber the f29 register – 34 functions higher in the call stack. Notably it clobbers the v29 vector register in the AES calculation, which shared half its register with the f29 floating point register in the POWER architecture. My own attempts to implement POWER assembly save/restore failed but in the mean time, git blame allows me to ping the original author Danny Tsen from IBM.

Danny Tsen responds and comes back with a fix in PR #28990 two days later. Thank you Danny ❤️.

So a little benign bug from last year is now described on the PR (by me) as:

On ppc64le systems, applications may exhibit undefined behavior because the ppc_aes_gcm_encrypt and ppc_aes_gcm_decrypt functions overwrite certain floating-point and VSX registers. As a result, values stored in these registers by calling functions may be replaced with intermediate AES-GCM computation data upon return, leading to data corruption or unpredictable results.
ref: https://github.com/openssl/openssl/pull/28990#issuecomment-3458481314

Any POWER/ppc64le system with an application calling into OpenSSL that’s using a specific set of VSX or floating point register might be exhibiting unexpected behavior because of this. Yes, yikes, code that has been core in OpenSSL for almost 4 years and I’m the first to find it?

IBM people inadvertently created this bug in software for their hardware, they are only human, and submitted a fix rapidly when advised, and that’s the nature of open source. Quick bug fixes, without contracts. Just communication and rapid evolution. It’s IBM provisioning of hardware to open source projects like MariaDB that enabled this bug to be visible, able to be debugged, and many will benefit.

It’s IBM’s generosity in being a Foundation Sponsor, a major sponsor at that, that enables me to dive into these rabbit holes. Sometimes that can create very tangible benefits for the entire ecosystem, and in this case to many of their customers. The stability of year on year sponsorship facilitates stable employment where engineers can grow skills, fix bugs, and occasionally blog about them.

Engineering makes open source happen, smoothly and often unnoticed effort. Effort paid for via donations and sponsorship, or whatever funding model a project has, is the only way it survives.

Now, after validating the fix by Danny Tsen, I’m just waiting for the fix to be reviewed, merged into OpenSSL. After that it gets backported, picked up by RHEL (+ FIPS re-validation?) and other distributions, and finally back to MariaDB Foundation for us to update our Buildbot containers. After that we’ll have at least two tests green again. In the mean time, I’m enjoying a coffee, about to start the next rabbit hole adventure.