Whole system SSD caching
using dm-cache and initramfs

Linux 3.9+

SSD is coming, however cost of an solid bit is still very high. So, switching to SSD is associated with either decrease in available space (though that isn't such a problem today) or with only partial SSD'ing of file system. But another solution is to use SSD disk as additional cache for whole file system, combining speed of SSD and capacity of HDD.

In this article we'll consider applying of dm-cache technology for SSD caching of root file system. Caching of non-root partitions is relatively easy to set up (it doesn't require modification of initramfs), see for example ssd-caching-using-dmcache-tutorial. Most of instructions about SSD caching are for such setups and don't consider caching of root partition, but "/" benefits of caching not less.

Now

Now, Linux offers two SSD caching technologies out of the box:

Both technologies are comparable by their characteristics but have significant differences in configuration. Bcache needs a partition to be prepared (formatted with adding of a special superblock), so it's hard to apply bcache in case of existent file system, especially with the root, and it's practically impossible to connect bcache on the fly. Dm-cache is simpler in configuration, though requires creating of special metadata partition in addition to the cache partition itself.

Don't repeat this at home

Dm-cache is still (Linux 3.12) an EXPERIMENTAL feature of kernel, so recompile your one right now to enable it if it's disabled! Whole configuration of dm-cache is performed by single command

# dmsetup create <cache_name> 
  --table '0 <n_of_blocks> cache 
  /dev/<metadata_dev>
  /dev/<cache_dev> 
  /dev/<orig_dev> 
  512 1 writeback default 0'

(Single quotes are required.) For example,

# dmsetup create sdb2c 
  --table '0 391168 cache /dev/sda4 /dev/sda3
  /dev/sdb2 512 1 writeback default 0'

For details see official documentation. I only say that n_of_blocks is not byte size of partition, it's block size that can be found using blockdev --getsz /dev/XXX.

Before executing this command we need an original partition orig_dev, formatted or not (actually it can be any block device), and two partitions on SSD — for cache and metadata.

The size of metadata partition can be calculated by undocumented at the time of this writing formula 4 Mb + 16 bytes * n_of_blocks. However I've just created partition with the size of half of my SSD device, thereby reserving free space for use by SSD controller (that's advisable for better performance and longer life of flash memory). This partition should be TRIMed by command

# fstrim -v /mnt/disk

Or by any other convenient method (ldparm? blkdiscard?). For example by a method which doesn't require the creation of the file system on the partition, however I failed to found one quickly. Next, we should erase superblock of the partition because otherwise dm-cache will try to activate this partition instead of preparing it:

# umount /mnt/disk
# dd if=/dev/zero of=/dev/<metadata> bs=4k count=1

Now, we can execute dmsetup create and see a new device cache_name in /dev/mapper directory. This is our cached partition that should be mounted instead of original one.

Root

The problem is that all access to the root file system should always go through cache because SSD cache is nonvolatile and so called dirty blocks remain not flushed after shutdown or reboot (unlike temporary buffers in RAM). So any bypassing of the cache will result in incorrect operation. Therefore we should activate the cache simultaneously with the main file system. In this article we show how to do this using initramfs (btw there're examples for mkinitcpio on the internet too).

Initramfs is boot time version of our OS, comprising kernel and modules necessary for booting. We should add dm_mod, dm_cache and dm_cache_mq modules in /etc/initramfs-tools/modules to be able to activate the cache early enough. Next, we need to add cache activation commands in /etc/initramfs-tools/scripts/init-premount/ directory. Like this:

#!/bin/sh
dmsetup create sdb6c --table '0 62498816 cache \
/dev/sda3 /dev/sda1 /dev/sdb6 512 1 writeback \
mq 0'

Scripts in init-premount are executed very early in boot sequence. Now we're ready to build initrd image:

# rm /boot/initrd.img-`uname-r`
# update-initramfs -c -k `uname -r`

This image will be placed in folder like /boot and take execution before main system. Update-initramfs can warning you about inaccessible device but the image will be built correctly. Also we need to update fstab, substituting our target partition with cache device /dev/mapper/.

To check that specific modules or scripts have integrated in initrd image you can use lsinitramfs initrd.img-`uname -r` XXX command.

The nuance here connects with that we are caching root we also need to reconfigure boot manager. I didn't find how to force update-grub to change root device, so I just manually change root device to /dev/mapper/XXX in grub console, boot and then call update-grub. This make osprober to generate right grub menus but with two identical by UUID root partitions (original and cache). This isn't such a problem if you mount just one of them in old school way through /dev/mapper/XXX. Also you can completely disable UUID identification by adding GRUB_DISABLE_LINUX_UUID=true option to grub settings.

Some debugging of initramfs can be done in busybox console included in initrd image by default. I don't know how to get the console in normal way, but if you pass wittingly nonexistent devcie as your root for grub, your system will automatically switch to busybox after rootdelay. In any case I strongly recommend to have several initrd's at the machine to not get unbootable system.

The last thing is the dm-cache's shutdown sequence. Dm-cache is statistics based cache, the statistics are used to optimize and organize the cache. But if you don't call dmsetup suspend and dmsetup remove the metadata will be unclean and at next boot dm-cache will reset all the statistics counters. In case of non-root partitions we can easily call these commands during system shutdown but as far as I know Debian unmounts root after all of scriptable steps in shutting down sequence. So it seems that we can't do clean shutdown of dm-cache for root device at all. Possible way to work around this problem can be modification of initramfs in place where it executes switch_root. But I myself just hope that cuz I very rarely reboot or switch off my laptop, the statistics will be ok most of the time. It should be noted that loss of statistics doesn't mean loss of data or cache — content of blocks will be saved in any case.

Disabling the cache

Once you may need to disable cache to reorganize your disk system or to remove one of disks. You can do it like this

$ sudo umount /dev/mapper/cached
$ sudo dmsetup table cached
0 1048576000 cache 252:3 252:4 252:0 512 1 \
writeback default 0
$ sudo dmsetup suspend cached
$ sudo dmsetup reload cached \
--table '0 1048576000 cache 252:3 252:4 252:0 512 0 \
cleaner 0'
$ sudo dmsetup resume cached
$ sudo dmsetup wait cached

This will switch cache policy from default/mq to cleaner and thus force all dirty blocks to flush. But in case of root it's clear you can't just unmount it. But you can make additional initramfs image with this commands executing before mounting of root. Or you can just use any Linux LiveCD with kernel 3.9+ and execute commands manually. At last you can change your root device in grub menu to your original device, in this case cache will not be able to set up because it requires target device to be unmounted and you will get busybox with full freedom of action.

By the way kernels until 3.11 have a bug with non-zero number of dirty blocks when you twice change policy of cache. Actually all data will be completely flushed.

Conclusion

As developers write current architecture of dm-cache isn't perfect and can be improved through integration with virtual memory manager. Author of bcache comments that dm-cache is not really cache because it caches fixed size blocks unlike bcache which caches blocks with variable size. Also there's possibility to improve caching further by integrate it with file system modules and use more information about data being cached.

Anyway although I didn't make any performance tests I read much of mailing lists and I'm inclined to optimism]

Links:

See also alternative technologies: flashcache and EnhancedIO.

 

shitpoet@gmail.com

 



 

free hit counters