Machine Learning straight through SQL

Machine learning is one area that cannot succeed without data. Traditionally, machine learning frameworks read it from CSV files or similar data sources. This brings an interesting set of challenges because in most cases the data is stored in databases, not simple raw files. It takes time and effort to move data from one format to another. Additionally, one needs to write some code (usually python) to prepare the data just like the ML framework expects it.

I was pleasantly surprised when I saw during the MariaDB Server Fest that MindsDB, an automatic machine learning system, presented their integration with MariaDB. I decided to give it a try and as I do have a background in Artificial Intelligence, I figured I’d use one of the challenges on kaggle.com to test it out. There is a challenge I did a long time ago for predicting the number of rental bike rides given weather conditions such as temperature, wind and the time of day.

My goal was to see if I can get a decent result (significantly better than the baseline) with MindsDB and MariaDB, without actively touching Machine Learning frameworks or doing tricks, such as feature engineering, data augmentation, etc.
Note, this post is not a simple step-by-step tutorial. Instead I also dig a bit deeper to explain how things work.
With all this out of the way let’s proceed with the experiment!

Initial setup

There are a few pieces required to get the system to work. I will try to explain the commands I ran, but they may differ slightly depending on your Linux distribution or if you are using a different Operating System, such as Windows or OS X.

First of all, I installed MariaDB. I happen to be using a development version of MariaDB (10.6), but this should not affect the results in any way. There are plenty of tutorials on how to install MariaDB. The simplest method for me is via the repository config tool.

MindsDB docs suggest to use the root user within MariaDB for easy setup, but I wanted to see what rights it actually needs. So I created a user mindsdb@localhost, with a password. To get it to work, I granted the global FILE privilege and all rights on the mindsdb database.

CREATE USER mindsdb@localhost;
SET PASSWORD for mindsdb@localhost=PASSWORD("password");
GRANT FILE on *.* to mindsdb@localhost;
GRANT ALL on mindsdb.* to mindsdb@localhost;

Second, I installed MindsDB. As it’s written in Python and I want to keep my system clean, I prefer to use virtual environments. Thus I created a virtual environment and installed MindsDB with pip.

mkdir ~/mindsdb/ && cd ~/mindsdb/ && virtualenv venv -p python3
pip install mindsdb

MindsDB uses a lot of external modules so this took some time. Once installation is done, we’re almost ready to get going. One final step is to configure MindsDB to talk to the right MariaDB Server.

MindsDB takes in a JSON configuration file. This is the one I used:

{
    "api": {
        "http": {
            "host": "0.0.0.0",
            "port": "47334"
        },
        "mysql": {
            "host": "127.0.0.1",
            "log": {
                "console_level": "INFO",
                "file": "mysql.log",
                "file_level": "INFO",
                "folder": "logs/",
                "format": "%(asctime)s - %(levelname)s - %(message)s"
            },
            "port": "47335",
            "user": "root"
            "password": "password",
        }
    },
    "config_version": "1.2",
    "debug": true,
    "integrations": {
        "default_mariadb": {
            "enabled": true,
            "host": "localhost",
            "password": "root",
            "port": 3306,
            "type": "mariadb",
            "user": "root"
        }
    },
    "interface": {
        "datastore": {
            "enabled": true
        },
        "mindsdb_native": {
            "enabled": true
        }
    },
    "storage_dir": "/home/vicentiu/Workspace/MindsDB/datastore"
}

With this configuration file in place, one needs to start MindsDB. Using the virtualenv:

source venv/bin/activate
python -m mindsdb --config=./config.json --api=http,mysql

The way MindsDB works is rather interesting. On one hand, it connects to a MariaDB Server to create some special tables, called AI Tables. The connection details are specified in the “integrations” part of the config. These AI Tables are actually tables using the CONNECT engine to connect to an outside database. On the other hand the outside database is actually MindsDB itself! Let’s look at these AI Tables:

MariaDB> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mindsdb            |
| mysql              |
| performance_schema |
| test               |
+--------------------+
MariaDB> USE mindsdb;
MariaDB> SHOW CREATE TABLE commands\G
CREATE TABLE `commands` (
  `command` varchar(500) DEFAULT NULL
) ENGINE=CONNECT DEFAULT CHARSET=latin1
CONNECTION='mysql://root_default_mariadb:password@127.0.0.1:47335/mindsdb/commands'
`TABLE_TYPE`=MYSQL

Notice the commands table: it’s connecting to MindsDB on the port specified in the config file: 47335. In this case, MindsDB is pretending to be a MySQL/MariaDB Server! That’s how MariaDB is able to get information back from MindsDB.

The data

Now that we have the system running, time to put it to the test. The Kaggle challenge I was referencing has the training data as a CSV file. I created a table for the data and I filled it using LOAD DATA INFILE.

MariaDB> USE test;
MariaDB> CREATE TABLE test.`bike_data` (
  `datetime` datetime DEFAULT NULL,
  `season` int(11) DEFAULT NULL,
  `holiday` int(11) DEFAULT NULL,
  `workingday` int(11) DEFAULT NULL,
  `weather` int(11) DEFAULT NULL,
  `temp` double DEFAULT NULL,
  `atemp` double DEFAULT NULL,
  `humidity` double DEFAULT NULL,
  `windspeed` double DEFAULT NULL,
  `casual` int(11) DEFAULT NULL,
  `count` int(11) DEFAULT NULL);

MariaDB> LOAD DATA INFILE '/home/vicentiu/Downloads/train.csv'
         INTO TABLE bike_data columns terminated by ',';

Note: LOAD DATA INFILE tries to insert the first line, which is the table header. I deleted that from the file before loading.

Training the model

Finally, we can get MindsDB to start training. To start training, a single INSERT command into an AI Table does the trick:

INSERT INTO `predictors`
       (`name`, `predict`, `select_data_query`)
VALUES ('bikes_model', 'count', 'SELECT * FROM test.bike_data');

What this does is it tells MindsDB to create a prediction model named “bikes_model” and to predict the “counts” column. The data used to create the model is fetched via the SELECT query SELECT * FROM test.bike_data. 30 minutes later, I have the model trained.

When a model is trained, MindsDB creates another table in the MariaDB’s mindsdb database. This table has the same name as the model (bikes_model) and again, is set-up as a CONNECT table.

Inspecting the model

Remember the --api=http for starting MindsDB? That is for starting up MindsDB Scout, a graphical web tool for inspecting the model’s performance. It’s accessible over the port specified in the config file (http://0.0.0.0:47334)

A confusion matrix gives some insight into where the model is strong and where it is weak.

We can see the model is good at predicting small number of rides, but weaker with larger numbers.

Using the model

The final step is actually doing predictions. To run a prediction, one must select from the table matching the model previously trained.

MariaDB> SELECT count, count_confidence
         FROM mindsdb.bikes_model
         WHERE datetime='2011-01-20 00:00:00' AND
               season='1' AND
               holiday='0' AND
               workingday='1' AND
               weather='1' AND
               temp='10.66' AND
               atemp='11.365' AND
               humidity='56' AND
               windspeed='26.0027';
+-------+------------------+
| count | count_confidence |
+-------+------------------+
| 33    |           0.9675 |
+-------+------------------+

The parameters used by MindsDB to predict values are in the WHERE clause. I tried different ways of passing parameters, but currently it only works with fixed values. Passing in a non-deterministic function (such as RAND) or a separate column will not work. However simple arithmetic and constant functions do work as MariaDB’s optimizer substitutes the arithmetic to constant values before passing the whole query onward to the CONNECT Engine. (To see this in action you can use EXPLAIN FORMAT=JSON on the select query).

Final scoring

To generate all the predictions for the test set from Kaggle.com, I wrote a short python script to generate all the necessary selects (one for each test case). A little bit of text editing magic afterwards to format the output as was required for Kaggle and I generate a submission.

The final score placed MindsDB around halfway through the competition scores. The baseline was a score of approximately 1.5, where a lower number is better. My best take was 0.5 with a lot of effort, while MindsDB performed decently well with a score of 0.9. I’m sure that with a bit more work, perhaps some feature augmentation and training different models for different seasons, I could help MindsDB’s AutoML functionality get closer to my best score.

With that said, I did not expect MindsDB to beat my score, but to provide a “good enough” result, with no data science tricks. For that I can consider the experiment a success!

Some future ideas after trying to do AI from within MariaDB:

Configurable location for where the MindsDB tables are created. Not using a hard coded database named “mindsdb” would be nice.
More feedback about the training process and outcome exposed via SQL would be useful. For example MindsDB outputs the progress on the command line such as epochs trained, model architecture, etc. Having this exposed (and any other relevant details) through an AI Table can help users debug and fine tune their data.
Potentially a hard one: Train models and keep them up-to-date with streaming data (for example by reading the replication logs)

Having Machine Learning available straight from MariaDB can be a valuable tool for any application developer seeking to use MariaDB as their database of choice. There is a little bit of work to get all the pieces moving, but you only have to do it once. Afterwards, any ORM framework or database communication libraries should work without a problem, as it’s only about running SQL queries behind the scenes. Looking forward to hearing more about future development in this area!