Crunching XML files with MariaDB

In January I wrote my first post about node.js and MariaDB . In February I continued with a second post about using jQuery and some GIS calculations. Now it is time for the third and this time the main focus is not so much on GIS functionality, but instead on the capabilities MariaDB has for handling piles of unstructured data. In this case I’ll be focusing on crunching a pile of XML files without importing the XML data itself.

MariaDB 10.0, which had its GA launch in early April, includes the Connect engine. It has been developed to dynamically access all kinds of data sources, from other databases over ODBC connections to web log files. The Connect engine establishes a connection to the data source and exposes the data source to MariaDB as a table. Let’s get started!

To begin with, check that you do not have the Connect engine installed, which you shouldn’t if you haven’t played with it before. It’s a separate install. To verify if it’s installed (or not) run the SHOW ENGINES command. The Connect engine is simply called CONNECT in the output. I’m running this demo on Ubuntu 13.10 so the following instructions focus on that platform, but I’ll indicate how it’s done on other platforms as well.

Make sure you’ve configured APT to use the MariaDB repository. Instructions for that are found in the MariaDB repository configuration tool part of downloads.mariadb.org. Once done with adding the repository the Connect engine can be installed with a normal apt-get command. For example:

sudo apt-get install mariadb-connect-engine-10.0
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libpango1.0-0 libpangox-1.0-0 libpangoxft-1.0-0
Use 'apt-get autoremove' to remove them.
The following extra packages will be installed:
  autotools-dev libltdl-dev libltdl7 libodbc1 libtool libxml2-dev odbcinst
  odbcinst1debian2 unixodbc unixodbc-dev
Suggested packages:
  libtool-doc libmyodbc odbc-postgresql tdsodbc unixodbc-bin autoconf automaken
  gfortran fortran95-compiler gcj pkg-config
The following NEW packages will be installed:
  autotools-dev libltdl-dev libltdl7 libodbc1 libtool libxml2-dev
  mariadb-connect-engine-10.0 odbcinst odbcinst1debian2 unixodbc unixodbc-dev
0 upgraded, 11 newly installed, 0 to remove and 7 not upgraded.
Need to get 2,472 kB of archives.
After this operation, 9,807 kB of additional disk space will be used.
Do you want to continue [Y/n]?

Notice that on RPM platforms you would do something similar with yum, i.e. yum install MariaDB-connect-engine .

Next, tell MariaDB to enable the Connect engine:

INSTALL SONAME 'ha_connect';

Now we’re ready with all prerequisites to actually start implementing the solution. Since I’m continuing on the development demo I built earlier, which is about showing running tracks on Google Maps I’m going to continue with that. Also in this case I will use GPX files. GPX files are stored by many GPS devices including running watches and smartphones. GPX itself is a data representation format with a specific XML schema:

<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" creator="runtastic - makes sports funtastic, http://www.runtastic.com"
  xsi:schemaLocation="http://www.topografix.com/GPX/1/1
    http://www.topografix.com/GPX/1/1/gpx.xsd
    http://www.garmin.com/xmlschemas/GpxExtensions/v3
    http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd
    http://www.garmin.com/xmlschemas/TrackPointExtension/v1
    http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd"
  xmlns="http://www.topografix.com/GPX/1/1"
  xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"
  xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <metadata>
  <desc>Tillbaka för att plocka upp bilen från service</desc>
  <copyright author="www.runtastic.com">
  <year>2014</year>
  <license>http://www.runtastic.com</license>
  </copyright>
  <link href="http://www.runtastic.com">
  <text>runtastic</text>
  </link>
  <time>2013-09-23T13:30:54.000Z</time>
  </metadata>
  <trk>
  <link href="http://www.runtastic.com/sport-sessions/123539841">
  <text>Visit this link to view this activity on runtastic.com</text>
  </link>
  <trkseg>
  <trkpt lon="24.8140869140625000" lat="60.1844253540038991">
  <ele>1.84204363822937</ele>
  <time>2013-09-23T13:30:54.000Z</time>
  </trkpt>
  <trkpt lon="24.8141384124756001" lat="60.1844062805175994">
  <ele>1.72201466560364</ele>
  <time>2013-09-23T13:30:56.000Z</time>
  </trkpt>
...

For the demo I’ll use a directory /home/rasmus/nodegpxmariadb/public/uploads/ which I’ve preloaded with a set of GPX files. Now starts the interesting part. I want to make the data in these XML formatted files available to MariaDB. It’s pretty straight forward by creating a new table with the Connect engine:

CREATE TABLE GPXSource (
  filename VARCHAR(100) NOT NULL special=FILEID,
  lon double(20,16) NOT NULL field_format='@',
  lat double(20,16) NOT NULL field_format='@',
  ele double(20,16) NOT NULL field_format="*[local-name()='ele']",
  time datetime date_format="YYYY-MM-DD 'T' hh:mm:ss '.000Z'"
  field_format="*[local-name()='time']"
) ENGINE=CONNECT table_type=XML
  file_name="/home/rasmus/nodegpxmariadb/public/uploads/runtastic_*.gpx"
  tabname="//*[local-name()='trkseg']" multiple=1;

In the CREATE TABLE statement you can see that Connect is specified as the engine, the table_type option is set to XML, multiple is set to 1 to allow for multiple files and file_name includes the path and the pattern of the XML files that I want to make accessible. In the tabname option the XML node under which the data resides is given. Data in this node will be represented as table rows. In this case the node is called trkseg. This node includes all the trkpt -nodes that we’re interested in. A single trkpt node represents a track point stored at certain intervals while running. One track point includes longitude, latitude, elevation and time. By combining these trackpoints you’ll have a representation of the running track. That is what I’m going to do.

In the CREATE TABLE statement there is one more thing to notice. The option tabname and the table field ele both include namespace directives to give to the XML parser so that it knows how to get hold of the right node. This was a bit tricky, but it has to do with the use of the namespaces of GPX and the XML parser that is used. The Connect engine uses libxml2 on Linux.

With the table GPXSource now in place let’s try to do a SELECT over it:

MariaDB [running]> SELECT filename, COUNT(*) FROM GPXSource GROUP BY filename ORDER BY filename;
+--------------------------------------------------------------------------------+----------+
| filename                                                                       | COUNT(*) |
+--------------------------------------------------------------------------------+----------+
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140325_1154_Running.gpx |     3383 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140408_1133_Running.gpx |     1043 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140411_1341_Running.gpx |     2977 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140417_1140_Running.gpx |     2441 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140423_1150_Running.gpx |     2710 |
+--------------------------------------------------------------------------------+----------+
5 rows in set (0.12 sec)

Voila! We can see that it works and I get out the number of trackpoints per GPX file.

Let’s try another SELECT that I will make use of in the actual application. We want to get the duration of each of the runs by calculating the difference between first and last trackpoint for each GPX file:

SELECT filename, TIMEDIFF(MAX(time), MIN(time)) AS duration  FROM GPXSource GROUP BY filename ORDER BY filename;
+--------------------------------------------------------------------------------+----------+
| filename                                                                       | duration |
+--------------------------------------------------------------------------------+----------+
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20130923_1631_Running.gpx | 00:20:44 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20130925_1008_Running.gpx | 00:22:39 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140325_1154_Running.gpx | 02:14:15 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140408_1133_Running.gpx | 00:44:32 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140411_1341_Running.gpx | 01:59:09 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140417_1140_Running.gpx | 01:37:15 |
| /home/rasmus/nodegpxmariadb/public/uploads/runtastic_20140423_1150_Running.gpx | 01:56:12 |
+--------------------------------------------------------------------------------+----------+
7 rows in set (0.26 sec)

Now I have all the data I need to be able to create my sample application. In this blog post I won’t dive into the Node.js application code part, but I have made all source code available in the same way as before on Github, https://github.com/rasmushoj/nodejs-gpx-mariadb . Hopefully you find it useful. The end result however accessing the XML files through MariaDB without importing the data looks like this: