The MariaDB project is pleased to announce a special preview release of MariaDB 10.0.9 with significant performance gains on FusionIO devices. This is is a beta-quality preview release.
The latest work between MariaDB and FusionIO has focused on dramatically improving performance of MariaDB on the high-end SSD drives produced by Fusion-IO and at the same time delivering much better endurance for the drives themselves. Furthermore, FusionIO flash memory solutions increase transactional database performance. MariaDB includes specialized improvements for FusionIO devices, leveraging a feature of the NVMFS filesystem on these popular, high performance solid state disks. Using this feature, MariaDB 10 can eliminate some of the overhead within the InnoDB storage engine when used with FusionIO devices.
In below Figure 1 shows a legacy architecture of SSDs on the left and the FusionIO architecture on the right.
When Innodb writes to the filesystem, there is generally no guarantee that a given write operation will be complete (not partial) in cases of a poweroff event, or if the operating system crashes at the exact moment a write is being done.
Without detection or prevention of partial writes, the integrity of the database can be compromised after recovery. Therefore, Innodb has had a mechanism to detect and ignore partial writes via the InnoDB Doublewrite Buffer (also innodb_checksum can be used to detect a partial write).
Doublewrites, controlled by the innodb_doublewrite system variable, comes with its own set of problems. Especially on SSD, writing each page twice can have detrimental effects (write leveling). Also the endurance of SSD is at stake since there is a maximum of writes it will be able to handle before it needs to be replaced. By writing twice the expected lifetime is cut by half.
Before writing pages to a data file, InnoDB first writes them to a contiguous tablespace area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer has completed does InnoDB write the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write (causing a torn page condition), InnoDB can later find a good copy of the page from the doublewrite buffer during recovery.
A better solution is to directly ask the filesystem to provide an atomic (all or nothing) write guarantee. Currently this is only available on the NVMFS filesystem on FusionIO devices that provide atomic write functionality. This functionality is supported by MariaDB’s XtraDB and InnoDB storage engines. To use atomic writes instead of the doublewrite buffer, add:
innodb_use_atomic_writes = 1
to the my.cnf config file.
For more information about this feature see https://mariadb.com/kb/en/fusionio-directfs-atomic-write-support/
InnoDB compressed tables
By using the InnoDB table options for compression, you can create tables where the data is stored in compressed form. Compression can help to improve both raw performance and scalability. The compression means less data is transferred between disk and memory, and takes up less space on disk and in memory. The benefits are increased for tables with secondary indexes, because index data is compressed also. Compression is important for SSD storage devices, because….
InnoDB stores uncompressed data in 16K pages and these 16K pages are compressed into a fixed compressed page size of 1K, 2K, 4K, 8K. This compressed page size is chosen at table creation time using KEY_BLOCK_SIZE parameter. Compression is performed using regular software compression libraries (zlib).
Because pages are frequently updated, B-tree pages require special treatment. It is essential to minimize the number of times B-tree nodes are split, as well as to minimize the need to uncompress and recompress their content. Therefore, InnoDB maintains some system information in the B-tree node in uncompressed form, thus facilitating certain in-place updates. For example, this allows rows to be delete-marked and deleted without any compression operation.
Furhermore, InnoDB attempts to avoid unnecessary uncompression and recompression of index pages when they are changed. Within each B-tree page, the system keeps an uncompressed “modification log” to record changes made to the page. Updates and inserts of small records may be written to this modification log without requiring the entire page to be completely reconstructed.
When the space for the modification log runs out, InnoDB uncompresses the page, applies the changes and recompresses the page. If recompression fails, the B-tree nodes are split and the process is repeated until the update or insert succeeds.
To avoid frequent compression failures in write-intensive workloads, such as for OLTP applications, InnoDB reserves some empty space (padding) in the page, so that the modification log fills up sooner and the page is recompressed while there is still enough room to avoid splitting it. The amount of padding space left in each page varies as the system keeps track of the frequency of page splits.
However, all these have clear drawbacks:
- Space: Both uncompressed and compressed pages stored in buffer pool
- Access: Updates are applied to both copies in memory
- CPU consumption
- Software compression libraries (decompress on read from disk, recompress on split)
- Split & Recompress & Rebalance when mlog overflows
- Capacity benefit
- Fixed compression page size – sets a bound on compression benefit
- Modification log and padding takes space decreasing the benefits
- Poor adoption
- Code is very complex and performance decrease compared to uncompressed tables significant
Solution: Page compression
Instead of storing both compressed and uncompressed pages on buffer pool store only a uncompressed 16KB pages in buffer pool. This avoids very complex logic on when page needs to be recompressed or when to add a change to mlog. Similarly, there is no need to do page splits etc. Before creating a page compressed table, make sure the
innodb_file_per_table configuration option is enabled, and
innodb_file_format is set to
This work was done in co-operation with FusionIO, especially with:
When a page is modified, it is compressed just before it is written (fil layer) and only a compressed size (aligned to sector boundary) is written. If compression fails because of compression failure we write uncompressed page to the file space. Then we trim unused 512B sectors in compressed page by:
fallocate(file, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, off, trim_len);
This leads to a situation where NVMFS file system reports that less space is used on media. If this fallocate call fails an error is reported on an error log and trim is not anymore used and server can continue normally.
When a page is read, it is decompressed before it is on buffer pool. All this is implemented by using a new page type FIL_PAGE_PAGE_COMPRESSED. Every page has a 38-byte FIL header (FIL is a shortened form of “file”). The header contains a field which is used to indicate the page type, which determines the structure of the rest of the page. The structure of the new FIL header is:
More detailed description on items at Figure 2:
- FIL_SPAGE_SPACE_OR_CHKSUM: Currently no checksums are calculated from compressed page because the original calculated checksum is part of the compressed data. Therefore, this field is initialized to value BUF_NO_CHECKSUM_MAGIC.
- FIL_PAGE_OFFSET: The page number.
- FIL_PAGE_PREV and FIL_PAGE_NEXT: Pointers to the logical previous and next page for this page type.
- FIL_PAGE_LSN: The 64-bit log sequence number (LSN) of the last modification of the page.
- FIL_PAGE_TYPE: FIL_PAGE_PAGE_COMPRESSED = 34354 is stored.
- FIL_PAGE_FILE_FLUSH_LSN: Used compression algorithm for this page is stored here. Currently values FIL_PAGE_COMPRESSION_ZLIB = 1 or FIL_PAGE_COMPRESSION_LZ4 = 2 are supported.
- FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID: Space id of the file is stored here.
- FIL_PAGE_DATA: Actual compressed size of the original page is stored starting from here using 2 bytes.
- Finally, original page in compressed format is stored to position FIL_PAGE_DATA + FIL_PAGE_COMPRESSED_SIZE = 38+2=40.
Note that FIL trailer is not stored (it is stored on original compressed page).
This new page type allows pages of the table to be uncompressed (if compression fails), compressed with ZLIB and compressed with LZ4 in the same file space. For example page 5 could be uncompressed, page 45 compressed with ZLIB and page 60 compressed with LZ4. However, current implementation allows only uncompressed and compressed pages inside a single file space. LZ4 is not compiled by default because many distributions do not have this library by default. Therefore, if you want to use LZ4 you need to download necessary source code from http://code.google.com/p/lz4/, compile and install it. After that you need to compile MariaDB from source distribution.
In MariaDB we have implemented page compression so that page compression, atomic writes and used compression level can be configured if needed per table basis. New create table options are implemented using an engine-defined table attributes (see https://mariadb.com/kb/en/engine-defined-new-tablefieldindex-attributes/). This avoids adding unnecessary extensions to SQL-syntax.
- CREATE TABLE T0 (A INTEGER) ENGINE=InnoDB PAGE_COMPRESSED=1;
CREATE TABLE A(B INTEGER) ENGINE=InnoDB PAGE_COMPRESSED=1 PAGE_COMPRESSION_LEVEL=6
CREATE TABLE B(C INTEGER) ENGINE=InnoDB ATOMIC_WRITES=ON;
CREATE TABLE t3 (a int KEY, b int) DATA DIRECTORY=’/dev/fioa’ PAGE_COMPRESSED=1 PAGE_COMPRESSION_LEVEL=4 ATOMIC_WRITES=’ON’;
Now that we had a new page type and underlaying storage system could provide us a very fast throughput, we noted that InnoDB page flushing does not scale well. This is because InnoDB page flushing is single threaded.
Solution: Multi-threaded flush
To achieve better throughput and to decrease the operation latencies we have developed a multi-threaded flush method inside InnoDB. This new feature can be enabled by configuration parameter innodb_use_mtflush=1 and number of threads used can be configured using innodb_mtflush_threads=.
This new feature is implemented as traditional producer multiple consumers concept like:
- Work tasks are inserted into the work-queue (wq)
- Completions are based on the type of operations performed and as a result the WRITE
compression/flush operation completions get posted to return queue wr_cq.
Actual producer (single threaded) works as follows:
loop: sleep so that one iteration takes roughly one second flush LRU: for each buffer pool instance send a work item to flush LRU scan depth pages in chunks send work items to multi-threaded flush work threads wait until we have received reply for all work items flush flush list: calculate target number of pages to flush for each instance set up a work item to flush (target number / # of instances) number pages send work items to multi-threaded flush work thread wait until we have received reply for all items
And consumers as follows:
loop (until not shutdown): wait until a work item is received from work queue if work_item->type is EXIT insert a reply message to return queue pthread_exit(); if work_item->type is WRITE call buf_mtflu_flush_pool_instance() for this work_item when we reach to os layer (os0file.cc) we compress the page if table uses page compression (fil0pagecompress.cc) set up reply message containin number of flushed pages insert a reply message to return queue
This means that we can compress/decompress pages in parallel.
Firstly, we have used LinkBench  which is based on traces from production databases that store “social graph” data at Facebook, a major social network. LinkBench provides a realistic and challenging test for persistent storage of social and web service data. We have used following configuration for page compression tests:
[mysqld] innodb_buffer_pool_size = 50G innodb_use_native_aio = 1 innodb_use_mtflush = 1 innodb_file_per_table = 1 innodb_doublewrite = 0 innodb_use_fallocate = 1 innodb_use_atomic_writes = 1 innodb_use_trim = 1 innodb_buffer_pool_instances = 16 innodb_mtflush_threads = 16 innodb_use_lz4=1 innodb_flush_method = O_DIRECT innodb_thread_concurrencty = 32 innodb_write_io_threads = 32 innodb_read_io_threads = 32 innodb_file_format=barracuda innodb_lru_scan_depth=2000 innodb_io_capacity=30000 innodb_io_capacity_max=35000
For row-compressed and uncompressed we have used following setup:
innodb_buffer_pool_size = 50G innodb_use_native_aio = 1 innodb_use_mtflush = 0 innodb_file_per_table = 1 innodb_doublewrite = 1 innodb_use_fallocate = 1 innodb_use_atomic_writes = 0 innodb_use_trim = 0 innodb_buffer_pool_instances = 16 innodb_mtflush_threads = 0 innodb_use_lz4=0 innodb_flush_method = O_DIRECT innodb_thread_concurrencty = 32 innodb_write_io_threads = 32 innodb_read_io_threads = 32 innodb_file_format=barracuda innodb_lru_scan_depth=2000 innodb_io_capacity=30000 innodb_io_capacity_max=35000
Thus only difference on configuration is that for row-compressed (i.e. ROW_FORMAT=COMPRESSED) and uncompressed tables do not use multi-threaded flush, they use doublewrite buffer and they do not use atomic writes. We will use InnoDB storage engine (5.6.15) on following tests. We start of showing the storage usage of 10x LinkBench database (~100G) in Figure 3.
Figure 3: Storage usage.
Linkbench database is populated as follows:
./bin/linkbench -D dbid=linkdb -D ohst=127.0.0.1 -D user=root -D password= -D maxid1=100000001 -c config/MyConfig.properties -l
And LinkBench measure phase is run with following setup:
nohup ./bin/linkbench -D dbid=linkdb -D ohst=127.0.0.1 -D user=root -D password= -D maxid1=100000001 -c config/MyConfig.properties -cvstats stats.cvs -cvsvtream stream.cvs -D requests=50000000 -D maxtime=21600 -r &
These results are shown in Figure 4.
Figure 4: LinkBench benchmark results.
Secondly, we have used TPC-C  like benchmark. Approved in July of 1992, TPC Benchmark C is an on-line transaction processing (OLTP) benchmark. TPC-C is more complex than previous OLTP benchmarks such as TPC-A because of its multiple transaction types, more complex database and overall execution structure. TPC-C involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution. The database is comprised of nine types of tables with a wide range of record and population sizes. TPC-C is measured in transactions per minute (tpmC). In below Figure 5 MySQL compression actually means InnoDB compressed tables (i.e. ROW_FORMAT=COMPRESSED).
This run with following TPC-C setup:
tpcc_start -h localhost -P3306 -d tpcc1000 -r root -p "" -w 1000 -c 32 -r 30 -l 3600
Figure 5. TPC-C results.
MariaDB include high-end SSD optimizations especially targeting Fusion-IO devices that increase performance, gives the devices a longer lifetime and improves compression ratio:
- With atomic writes there is an out-of-box performance increase of about 30% and by enabling fast checksum for XtraDB the increase is 50% 
- The introduction of page compression solves the complex logic recompressions and page splits familiar in MySQL InnoDB’s row compression. By using page compression the compression ratio is leading to better performance and there are less writes to disk.
- Multi-threaded flush provides better throughput and decreases operation latencies delivering a performance boost
By enabling these features on top of a Fusion-IO device one will get a better compression ratio compared to MySQL InnoDB’s own compression, less writes to disk and a significant performance boost compared to MySQL InnoDB compression.
 Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark Callaghan. 2013. LinkBench: a database benchmark based on the Facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, New York, NY, USA, 1185-1196. DOI=10.1145/2463676.2465296 http://doi.acm.org/10.1145/2463676.2465296.
 MairaDB Introduces Atomic Writes, https://blog.mariadb.org/mariadb-introduces-atomic-writes/