MariaDB Introduces Atomic Writes
When dealing with high performance, low latency storage devices, such as SSD cards, one finds bottlenecks in new places. This is a story about such a bottle neck and how to work around it.
One unique feature of InnoDB is the double write buffer. This buffer was implemented to recover from half-written pages. This can happen in case of a power failure while InnoDB is writing a page (16KB = 32 sectors) to disk. On reading that page, InnoDB would be able to discover the corruption from the mismatch of the page checksum. However in order to recover, an intact copy of the page would be needed.
The double write buffer provides such a copy. Whenever InnoDB flushes a page to disk, it is first written to the double write buffer. Only when the buffer is safely flushed to disk, InnoDB writes the page to the final destination. When recovering, InnoDB scans the double write buffer and for each valid page in the buffer checks if the page in the data file is valid too.
Both the checksum calculation and the double writing consume time and thus reduce the performance of page flushing. The effect becomes visible only with fast storage and heavy write load. For testing it is possible to disable both features. However I strongly discourage from using this in a productive environment.
Currently, in order to use atomic writes, it is necessary to use the DirectFS file system, which is a part of the Fusion IO SDK. Wlad Vaintroub from Monty Program AB, in cooperation with FusionIO developers, implemented the necessary changes in InnoDB/XtraDB to use the new feature. If the all table spaces reside on directFS/FusionIO and thus support atomic writes, the new variable innodb_use_atomic_writes=1 will switch to using atomic writes instead of the double write buffer. The patch is already in MariaDB-10.0.2 and will also be included in MariaDB 5.5.31. The user documentation of the feature is available in this Knowlegde Base article.
Now for numbers! First tests have shown that atomic writes are of little to no benefit for small data sets (data set fits into RAM). However with fast SSD one can now afford to have a much bigger data set (compared to RAM size) because reads and writes are much faster.
The following numbers are for the Sysbench OLTP benchmark, using a data set of 100GB (400 Mio rows in 16 tables) but only 16GB of RAM in the InnoDB buffer pool. Performance was of course slower than with the usual 10GB data set. Numbers:
small data set (10G) | big data set (100G) | |
max ro tps | 8000 | 3000 |
max rw tps | 6000 | 1800 |
For the big data set the following configurations have been compared:
- double write, InnoDB page checksum
- atomic write, InnoDB page checksum
- atomic write, XtraDB fast page checksum
- atomic write, no page checksum
Threads | double write | atomic write | atomic write + fast checksum | atomic write + no checksum | |||
---|---|---|---|---|---|---|---|
1 | 159.81 | 164.34 | +2.8% | 179.68 | +12.4% | 184.72 | +15.6% |
2 | 316.65 | 343.21 | +8.4% | 378.61 | +19.6% | 391.72 | +23.7% |
4 | 544.05 | 635.26 | +16.8% | 699.6 | +28.6% | 726.55 | +33.5% |
8 | 830.37 | 1062.8 | +28.0% | 1176.8 | +41.7% | 1214.7 | +46.3% |
16 | 1054.7 | 1421.2 | +34.7% | 1570.7 | +48.9% | 1610.1 | +52.7% |
32 | 1208.3 | 1615.1 | +33.7% | 1736.6 | +43.7% | 1767.4 | +46.3% |
64 | 1286.9 | 1673.2 | +30.0% | 1793.2 | +39.3% | 1833.7 | +42.5% |
128 | 1266.2 | 1653.1 | +30.6% | 1824.5 | +44.1% | 1875.6 | +48.1% |
256 | 1139.4 | 1505 | +32.1% | 1586.3 | +39.2% | 1618.3 | +42.0% |
Conclusions:
- enabling atomic writes yields 30% better write performance out of the box
- using the fast checksum from XtraDB boosts this to nearly 50%
- disabling checksums completely is only marginally better
Benchmark details:
MariaDB-5.5.30 (lp trunk). Sysbench-0.5 (Lua enabled). Complex OLTP benchmark. 400 mio rows in 16 tables. 16GB InnoDB buffer pool, 4G redo logs.
As always the benchmark scripts and results are available to anybody.
Do atomic writes reduce flash storage wearing? Have you measured the affect on amount of total blocks written on a block device?
Yes, the atomic writes do extend Endurance by significantly reducing write amplification.
Will this ever work on non-FusionIO SSDs?
That depends on other manufacturers adopting the API. It’s a public API.
I’ve been having a hard time searching for information about other file-systems implementing atomic writes, so I’m curious whether this has changed much in the years since this article was posted?