Using TRIM/DISCARD with Ceph RBD and libvirt

TRIM/DISCARD

Using TRIM/DISCARD you can give back free space to a Ceph cluster. Normally, any thin provisioned block device will keep on growing until its maximum size while being used. Using the DISCARD command a underlying block device can be instructed to discard blocks which do not contain data.

In the case of Ceph’s RBD we can shrink our RBD images again which gives us back free space in our Ceph cluster.

Libvirt

Using this feature is only supported if you use VirtIO-SCSI and not if you use plain VirtIO.

Some searching brought me to this XML for my Ubuntu 15.10 guest:

<disk type='network' device='disk'>
  <driver name='qemu' type='raw' cache='none' discard='unmap'/>
  <auth username='admin'>
    <secret type='ceph' uuid='f94812dd-f06f-48f6-9839-1edf7ee8f8d6'/>
  </auth>
  <source protocol='rbd' name='libvirt/image1'>
    <host name='hostname.of.my.ceph.monitor'/>
  </source>
  <target dev='sda' bus='scsi'/>
  <controller type='scsi' index='0' model='virtio-scsi'/>
</disk>

Inside the guest

I tried a Ubuntu 15.10 guest but this should be supported in any other modern Linux guest.

lspci shows me:

root@ubuntu1510:~# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c895a
root@ubuntu1510:~#

And I have a sda block device which my guest uses:

root@ubuntu1510:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            230M     0  230M   0% /dev
tmpfs            49M  4.6M   45M  10% /run
/dev/sda1       9.3G  1.3G  7.6G  15% /
tmpfs           245M     0  245M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           245M     0  245M   0% /sys/fs/cgroup
tmpfs            49M     0   49M   0% /run/user/0
root@ubuntu1510:~#

Now I can run fstrim which will trim the block device:

root@ubuntu1510:~# fstrim -v /
/: 128 MiB (134217728 bytes) trimmed
root@ubuntu1510:~#

Rebuilding libvirt under CentOS 7.1 with RBD storage pool support

If you want to use CentOS 7.1 for your hypervisors with Apache CloudStack and Ceph’s RBD as Primary Storage you need to rebuild libvirt.

CloudStack requires libvirt to be built with RBD storage pool support. It uses libvirt to manage RBD volumes. By default libvirt under CentOS is not built with this support. (On Ubuntu it is btw).

Rebuilding from source

First we need to install a couple of packages:

$ yum install -y rpm-build gcc make ceph-devel

Now we need to download the sRPM:

$ wget http://vault.centos.org/centos/7.1.1503/os/Source/SPackages/libvirt-1.2.8-16.el7.src.rpm

Create a rpmbuild directory:

$ mkdir /root/rpmbuild

Now edit /root/.rpmmacros so that it contains:

%_topdir    /root/rpmbuild

Install the sRPM:

$ rpm -i libvirt-1.2.8-16.el7.src.rpm

Open the /root/rpmbuild/SPECS/libvirt.spec file and look for:

%else
    %define with_storage_rbd      0
%endif

Change this to:

%else
    %define with_storage_rbd      1
%endif

Now build the RPM:

$ cd /root/rpmbuild
$ rpmbuild -ba SPECS/libvirt.spec

After a couple of minutes you should have RPMs with RBD storage pool support enabled!

Calculating RADOS objects for RBD images

Ceph’s RBD (RADOS Block Device) is just a thin wrapper on top of RADOS, the object store of Ceph.

It stripes (by default) over 4MB objects in RADOS. It’s very simple to calculate which RADOS object corresponds with which sector on your RBD image/block device.

First you have to find out the block device’s object prefix name and the stripe size:

ceph@daisy:~$ sudo rbd info test
rbd image 'test':
	size 128 MB in 32 objects
	order 22 (4096 KB objects)
	block_name_prefix: rb.0.1066.2ae8944a
	format: 1
ceph@daisy:~$

In this case the stripe size is 4MB (order 2^22) and the object name prefix is rb.0.1066.2ae8944a

With one line of Perl we can calculate the object name in RADOS:

perl -e 'printf "BLOCK_NAME_PREFIX.%012x\n", ((SECTOR_OFFSET * 512) / (4 * 1024 * 1024))'

Let’s say that we want the object for sector 1 of our block device:

perl -e 'printf "rb.0.1066.2ae8944a.%012x\n", ((0 * 512) / (4 * 1024 * 1024))'

This tells us that we need to fetch object rb.0.1066.2ae8944a.000000000000 from RADOS. This can be done using the ‘rados’ command:

sudo rados -p rbd get rb.0.1066.2ae8944a.000000000000 rb.0.1066.2ae8944a.000000000000

Voila, you just fetched 4MB of your drive. Might be useful if you want to do some data recovery or such.