MariaDB metrics errors – a post-mortem
I’m going to start this blog post by saying that I made a mistake, a mistake that means all of the metrics blog posts so far have been made with erroneous data. As part of our openness value I will give a post-mortem of the issue here.
Before we look into what went wrong, I first need to give a bit of background. The commit metrics are generated using a tool called “gitdm”, this is a “Git Data Miner” that was designed to generate commit statistics for the Linux Kernel. Our fork of this is in the metrics repository which includes some customisations that fit MariaDB Server’s needs better.
The way gitdm works is you need to generate a git log and pipe this into the tool, the tool will generate the metrics output files based on this data. We then have our own scripts to automate all of this for time periods that matter for the MariaDB Server metrics that the MariaDB Foundation generates.
When generating the git log to be processed it is important that the commits for all active development branches are included. This is because there are fixes that could go into one version which are not necessary for later versions. Git is smart enough to de-duplication commits that have been merged between branches when you are retrieving a commit log for multiple branches. This means that code changes that go into many versions of MariaDB should only be counted once.
When initially settings things up and trialling gitdm I used the
--all parameter on git to do retrieve the commit information from all branches. This was then included in the scripts. What I neglected is that this is a little bit too broad for MariaDB Server.
MariaDB Server’s git tree also includes things such as feature branches by the MariaDB developers, as well as preview branches. In addition the
--all parameter pulls in git tags and pretty much everything that could be considered a commit.
I realised the mistake a couple of weeks ago, notified the MariaDB Foundation CEO and estimated that the error rate should be in the region of 5-10%. Last week I made a correction so that the script only used the commits from the active development branches (the branches called 10.0 – 10.12). Based on this I was able to find the extent of the problem.
I wasn’t too far off with my estimates for 2019, the error was around 13%. Unfortunately things get worse with the data for subsequent years.
I then investigated why the error rate was so high and found two causes, one expected, one unexpected.
The expected cause was due to feature branches that had not yet been merged. There is a lot of in-progress development in MariaDB. Some of it has been ongoing for years, but a few things have been left behind. It should be expected that the amount of in-development features will increase as we get closer to the present.
As for the unexpected, this is to do with preview branches. There is a branch called
preview-10.11-preview as an example, which has a different history from
10.11 even though it contains many of the same commits. This means that many commits in the preview branch were counted twice. With
10.11 and the preview branch being more recent branches, it stands to reason that this would also cause a higher error rate as we get closer to the present.
This second part actually took a little while to figure out. This is because it took several dives into the git log around unexpected differences. From there I had to dig into the history and origin of each one to find the source of the problem.
Learnings from this
There is the old adage “validate your inputs”. In this case it is very true. I neglected to validate the input to gitdm well enough to spot the mistake I had made. That being said, when a mistake was found it was raised immediately and fully investigated.
I have corrected the scripts in the metrics repository. Should you be eager enough to run them, they will already generate more accurate metrics. In the more likely scenario that you would expect us to provide the corrected numbers. I ask you to please wait until our next contribution metrics blog post in December. We will then have fresh content on the upcoming releases.