Thursday, May 19, 2011

Shared storage for OpenStack based on DRBD

Storage is a tricky part of the cloud environment. We want it to be fast, to be network-accessible and to be as reliable as possible. One way is to go to the shop and buy yourself a SAN solution from a prominent vendor for solid money. Another way is to take commodity hardware and use open source magic to turn it into distributed network storage. Guess what we did?

We have several primary goals ahead. First, our storage has to be reliable. We want to survive both minor and major hardware crashes - from HDD failure to host power loss. Second, it must be flexible enough to slice it fast and easily and resize slices as we like. Third, we will manage and mount our storage from cloud nodes over the network. And, last but not the least, we want decent performance from it.

For now, we have decided on the DRBD driver for our storage. DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network-based RAID-1. It has lots of features, has been tested and is reasonably stable.

DRBD has been supported by the Linux kernel since version 2.6.33. It is implemented as a kernel module and included in the mainline. We can install the DRBD driver and command line interface tools using a standard package distribution mechanism; in our case it is Fedora 14:

The DRBD configuration file is /etc/drbd.conf, but usually it contains only 'include' statements. The configuration itself resides in global_common.conf and *.res files inside /etc/drbd.d/. An important parameter in global_common.conf is 'protocol'. It defines the sync level of the replication:

  • A (async). Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has been placed in the local TCP send buffer. Data loss is possible in case of fail-over.

  • B (semi-sync or memory-sync). Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has reached the peer node. Data loss is unlikely unless the primary node is irrevocably destroyed.

  • C (sync). Local write operations on the primary node are considered completed only after both the local and the remote disk write have been confirmed. As a result, loss of a single node is guaranteed not to lead to any data loss. This is the default replication mode.

Other sections of the common configuration are usually left blank and can be redefined in per-resource configuration files. To create a usable resource, we must create a configuration file for our resource in /etc/drbd.d/drbd0.res. Basic parameters for the resource are:

  • Name of the resource. Defined with 'resource' parameter, open main configuration section.

  • 'on' directive opens the host configuration section. Only 2 'on' host sections are allowed per resource. Common parameters for both hosts can be defined once in the main resource configuration section.

  • 'address' directive is unique to each host and must contain the IP-address and port number to which the DRBD driver listens.

  • 'device' directive defines the path to the device created on the host for the DRBD resource.

  • 'disk' is the path to the back-end device for the resource. This can be a hard drive partition (i.e. /dev/sda1), soft- or hardware RAID device, LVM Logical Volume or any other block device, configured by the Linux device-mapper infrastructure.

  • 'meta-disk' defines how DRBD stores meta-data. It can be 'internal' when meta-data resides on the same back-end device as user data, or 'external' on a separate device.

Configuration Walkthrough

We are creating a relatively simple configuration: one DRBD resource shared between two nodes. On each node, the back-end for the resource is the software RAID-0 (stripe) device /dev/md3 made of two disks. The hosts are connected back-to-back by GigabitEthernet interfaces with private addresses.

As we need write access to the resource on both nodes, we must make it 'primary' on both nodes. A DRBD device in the primary role can be used unrestrictedly for read and write operations. This mode is called 'dual-primary' mode. Dual-primary mode requires additional configuration. In the 'startup' section directive, 'become-primary-on' is set to 'both'. In the 'net' section, the following is recommended:

The 'allow-two-primaries' directive allows both ends to send data.
Next, three parameters define I/O errors handling.
The 'sndbuf-size' is set to 0 to allow dynamic adjustment of the TCP buffer size.

Resource configuration with all of these considerations applied will be as follows:

Enabling Resource For The First Time

To create the device /dev/drbd0 for later use, we use the drbdadm command:

After the front-end device is created, we bring the resource up:

This command set must be executed on both nodes. We may collapse the steps drbdadm attach, drbdadm syncer, and drbdadm connect into one, by using the shorthand command drbdadm up.
Now we can observe the /proc/drbd virtual status file and get the status of our resource:

We must now synchronize resources on both nodes. If we want to replicate data that are already on one of the drives, it's important to run the next command on the host which contains data. Otherwise, this can be issued on any of two hosts.

This command puts the node host1 in 'primary' mode and makes it the synchronization source. This is reflected in the status file /proc/drbd:

We can adjust the syncer rate to make initial and background synchronization faster. To speed up the initial sync drbdsetup command used:

This allows us to consume almost all bandwidth of Gigabit Ethernet. The background syncer rate is configured in the corresponding config file section:

The exact rate depends on available bandwidth and must be about 0.3 of the slowest I/O subsystem (network or disk). DRBD seems to make it slower if it interferes with data flow.

LVM Over DRBD Configuration

Configuration of LVM over DRBD requires changes to /etc/lvm/lvm.conf. First, physical volume is created:

This command writes LVM Physical Volume data on the drbd0 device and also on the underlying md3 device. This can pose a problem as LVM default behavior is to scan all block devices for the LVM PV signatures. This means two devices with the same UUID will be detected and an error issued. This can be avoided by excluding /mnt/md3 from scanning in the /etc/lvm/lvm.conf file by using the 'filter' parameter:

The vgscan command must be executed after the file is changed. It forces LVM to discard its configuration cache and re-scan the devices for PV signatures.
Different 'filter' configurations can be used, but it must ensure that: 1. DRBD devices used as PVs are accepted (included); 2. Corresponding lower-level devices are rejected (excluded).

It is also nessesary to disable the LVM write cache:

These steps must be repeated on the peer node. Now we can create a Volume Group using the configured PV /dev/drbd0 and Logical Volume in this VG. Execute these commands on one of nodes:

To make use of this VG and LV on the peer node, we must make it active on it:

When the new PV is configured, it is possible to proceed to adding it to the Volume Group or creating a new one from it. This VG can be used to create Logical Volumes as usual.

We are going to install Openstack on nodes with shared storage as a private cloud controller. The architecture of our system presumes that storage volumes will reside on the same nodes as nova-compute. This makes it very important to have some level of disaster survival on the cloud nodes.

With DRBD we can survive any I/O errors on one of nodes. DRBD internal error handling can be configured to mask any errors and go to diskless mode. In this mode, all I/O operations are transparently redirected from the failed node to the replicant. This gives us time to restore a faulty disk system.

If we have a major system crash, we still have all of the data on the second node. We can use them to restore or replace the failed system. Network failure can put us into a 'split brain' situation, when data differs between hosts. This is dangerous, but DRBD also has rather powerful mechanisms to deal with these kinds of problems.