Mirantis Official Blog: cloud

Showing posts with label cloud. Show all posts

Tuesday, February 14, 2012

Under the hood of Swift. The Ring

This is the first post in series that summarizes our analysis of Swift architecture. We've tried to highlight some points that are not clear enough in the official documentation. Our primary base was an in-depth look into the source code. The Ring is the vital part of Swift architecture. This half database, half configuration file keeps track of where all data resides in the cluster. For each possible path to any stored entity in the cluster, the Ring points to the particular device on the particular physical node.

There are three types of entities that Swift recognizes: accounts, containers and objects. Each type has the ring of its own, but all three rings are put up the same way. Swift services use the same source code to create and query all three rings. Two Swift classes are responsible for this tasks: RingBuilder and Ring respectively.

Ring data structure

Every Ring of three in Swift is the structure that consists of 3 elements:

a list of devices in the cluster, also known as devs in the Ring class;
a list of lists of devices ids indicating partition to data assignments, stored in variable named _replica2part2dev_id;
an integer number of bits to shift an MD5-hashed path to the account/container/object to calculate the partition index for the hash (partition shift value, part_shift).

List of devices

A list of devices includes all storage devices (disks) known to the ring. Each element of this list is a dictionary of the following structure:

Key	Type	Value
id	integer	Index of the devices list
zone	integer	Zone the device resides in
weight	float	The relative weight of the device to the other devices in the ring
ip	string	IP address of server containing the device
port	integer	TCP port the server uses to serve requests for the device
device	string	Disk name of the device in the host system, e.g. `sda1`. It is used to identify disk mount point under `/srv/node` on the host system
meta	string	General-use field for storing arbitrary information about the device. Not used by servers directly

Some device management can be performed using values in the list. First, for the removed devices, the 'id' value is set to 'None'. Device IDs are generally not reused. Second, setting 'weight' to 0.0 disables the device temporarily, as no partitions will be assigned to that device.

Partitions assignment list

This data structure is a list of N elements, where N is the replica count for the cluster. The default replica count is 3. Each element of partitions assignment list is an array('H'), or Python compact efficient array of short unsigned integer values. These values are actually index into a list of devices (see previous section). So, each array('H') in the partitions assignment list represents mapping partitions to devices ID.

The ring takes a configurable number of bits from a path's MD5 hash and converts it to the long integer number. This number is used as an index into the array('H'). This index points to the array element that designates an ID of the device to which the partition is mapped. Number of bits kept from the hash is known as the partition power, and 2 to the partition power indicates the partition count.

For a given partition number, each replica's device will not be in the same zone as any other replica's device. Zones can be used to group devices based on physical locations, power separations, network separations, or any other attribute that could make multiple replicas unavailable at the same time.

Partition Shift Value

This is the number of bits taken from MD5 hash of '/account/[container/[object]]' path to calculate partition index for the path. Partition index is calculated by translating binary portion of hash into integer number.

Ring operation

The structure described above is stored as pickled (see Python pickle) and gzipped (see Python gzip.GzipFile) file. There are three files, one per ring, and usually their names are:

account.ring.gz
container.ring.gz
object.ring.gz

These files must exist in /etc/swift directory on every Swift cluster node, both Proxy and Storage, as services on all these nodes use it to locate entities in cluster. Moreover, ring files on all nodes in the cluster must have the same contents, so cluster can function properly.

There are no internal Swift mechanisms that can guarantee that the ring is consistent, i.e. gzip file is not corrupt and can be read. Swift services have no way to tell if all nodes have the same version of rings. Maintenance of ring files is administrator's responsibility. These tasks can be automated by means external to the Swift itself, of course.

The Ring allows any Swift service to identify which Storage node to query for the particular storage entity. Method Ring.get_nodes(account, container=None, obj=None) is used for identification of target Storage node for the given path (/account[/container[/object]]). It returns the tuple of partition and dictionary of nodes. The partition is used for constructing the local path to object file or account/container database. Nodes dictionary elements have the same structure as the devices in list of devices (see above).

Ring management

Swift services can not change the Ring. Ring is managed by swift-ring-builder script. When new Ring is created, first administrator should specify builder file and main parameter of the Ring: partition power (or partition shift value), number of replicas of each partition in cluster, and the time in hours before a specific partition can be moved in succession:

When the temporary builder file structure is created, administrator should add devices to the Ring. For each device, required values are zone number, IP address of the Storage node, port on which server is listening, device name (e.g. sdb1), optional device meta-data (e.g., model name, installation date or anything else) and device weight:

Device weight is used to distribute partitions between the devices. More the device weight, more partitions are going to be assigned to that device. Recommended initial approach is to use the same size devices across the cluster and set weight 100.0 to each device. For devices added later, weight should be proportional to the capacity. At this point, all devices that will initially be in the cluster, should be added to the Ring. Consistency of the builder file can be verified before creating actual Ring file:

In case of successful verification, the next step is to distribute partitions between devices and create actual Ring file. It is called 'rebalance' the Ring. This process is designed to move as few partitions as possible to minimize the data exchange between nodes, so it is important that all necessary changes to the Ring are made before rebalancing it:

The whole procedure must be repeated for all three rings: account, container and object. The resulting .ring.gz files should be pushed to all nodes in cluster. Builder files are also needed for the future changes to rings, so they should be backed up and kept in safe place. One of approaches is to put them to the Swift storage as ordinary objects.

Physical disk usage

Partition is essentially the block of data stored in the cluster. This does not mean, however, that disk usage is constant for all partitions. Distribution of objects between the partitions is based on the object path hash, not the object size or other parameters. Objects are not partitioned, which means that an object is kept as a single file in storage node file system (except very large objects, greater than 5Gb, which can be uploaded in segments - see the Swift documentation).

The partition mapped to the storage device is actually a directory in structure under /srv/node/<dev_name>. The disk space used by this directory may vary from partition to partition, depending on size of objects that have been placed to this partition by mapping hash of object path to the Ring.

In conclusion it should be said that the Swift Ring is a beautiful structure, though it lacks a degree of automation and synchronization between nodes. I'm going to write about how to solve these problems in one of the following posts.

More information

More information about Swift Ring can be found in following sources:
Official Swift documentation - base source for description of data structure
Swift Ring source code on Github - code base of Ring and RingBuilder Swift classes.
Blog of Chmouel Boudjnah - contains useful Swift hints

Friday, August 12, 2011

LDAP identity store for OpenStack Keystone

After some time working with OpenStack installation using existing LDAP installation for authentication, we encountered one big problem. The latest Dashboard code dropped support of old bare authentication in favor of Keystone-based one. That time Keystone had no support for multiple authentication backends, so we had to develop this feature.
Now we have a basic support of LDAP authentication in Keystone which provides subset of functionality that was present in Nova. Currently, the main limitation is inability to actually integrate with the existing LDAP tree due to limitations in backend, but it works fine in isolated corner of LDAP.
So, after a long time of coding and fighting with new upstream workflows, we can give you a chance to try it out.
To do it, one should:

Make sure that all necessary components are installed. They are Nova, Glance, Keystone and Dashboard.

Since the latter pair is still in incubator, you’ll have to download them from the source repository:
Set up Nova to authorize requests in Keystone:

It assumes that you’re in the same dir where you’ve downloaded Keystone sources. Replace nova.conf path if it differs in your Nova installation.
Add schema information to your LDAP installation.

It heavily depends on your LDAP server. There is a common .schema file and .ldif for the latest version of OpenLDAP in keystone/keystone/backends/ldap/ dir. For local OpenLDAP installation, this will do the trick (if you haven’t change the dir after previous steps):
Modify Keystone configuration at keystone/etc/keystone.conf to use ldap backend:
- add keystone.backends.ldap to the backends list in [DEFAULT] section;
- remove Tenant, User, UserRoleAssociation and Token from the backend_entities list in [keystone.backends.sqlalchemy] section;
- add new section (don’t forget to change URL, user and password to match your installation):
Make sure that ou=Groups,dc=example,dc=com and ou=Users,dc=example,dc=com subtree exists or set LDAP backend to use any other ones by adding tenant_tree_dn, role_tree_dn and user_tree_dn parameters into [keystone.backends.ldap] section in config file.
Run Nova, Keystone and Dashboard as usual.
Create some users, tenants, endpoints, etc. in Keystone by using keystone/bin/keystone-manage command or just run keystone/bin/sample-data.sh to add the test ones.

Thursday, June 30, 2011

vCider Virtual Switch Overview

A couple of months ago, Chris Marino, CEO at vCider, stopped by the Mirantis office and gave a very interesting presentation on the vCider networking solution for clouds. A few days later, he kindly provided me with beta access to their product.

A few days ago, vCider announced public availability of the product. So now it's a good time to blog about my experience concerning it.

About vCider Virtual Switch

To make a long story short, vCider Virtual Switch allows you to build a virtual Layer 2 network across several Linux boxes; these boxes might be Virtual Machines (VMs) on a cloud (or even in different clouds), or it might be a physical server.

The flow is pretty simple: you download a package (DEBs and RPMs are available on the site) and install it to all of the boxes for which you will create a network. No configuration is required except for creating a file with an account token.

After that, all you have to do is to visit the vCider Dashboard and create networks and assign nodes to them.

So to start playing with that, I created two nodes on Rackspace and created a virtual network for them for which I used 192.168.87.0/24 address space.

On both boxes two new network interfaces appeared:

On the first box:

5: vcider-net0:  mtu 1442 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether ee:cb:0b:93:34:45 brd ff:ff:ff:ff:ff:ff
inet 192.168.87.1/24 brd 192.168.87.255 scope global vcider-net0
inet6 fe80::eccb:bff:fe93:3445/64 scope link
 valid_lft forever preferred_lft forever

and on the second one:

7: vcider-net0:  mtu 1442 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether 6e:8e:a0:e9:a0:72 brd ff:ff:ff:ff:ff:ff
inet 192.168.87.4/24 brd 192.168.87.255 scope global vcider-net0
inet6 fe80::6c8e:a0ff:fee9:a072/64 scope link
 valid_lft forever preferred_lft forever

tracepath output looks like this:

root@alice:~# tracepath 192.168.87.4
1: 192.168.87.1 (192.168.87.1) 0.169ms pmtu 1442
1: 192.168.87.4 (192.168.87.4) 6.677ms reached
1: 192.168.87.4 (192.168.87.4) 0.338ms reached
Resume: pmtu 1442 hops 1 back 64
root@alice:~#

arping also works fine:

novel@bob:~ %> sudo arping -I vcider-net0 192.168.87.1
ARPING 192.168.87.1 from 192.168.87.4 vcider-net0
Unicast reply from 192.168.87.1 [EE:CB:0B:93:34:45] 0.866ms
Unicast reply from 192.168.87.1 [EE:CB:0B:93:34:45] 1.030ms
Unicast reply from 192.168.87.1 [EE:CB:0B:93:34:45] 0.901ms
^CSent 3 probes (1 broadcast(s))
Received 3 response(s)
novel@bob:~ %>

Performance

One of the most important questions is performance. First, I used iperf to measure bandwidth on the public interfaces:

novel@bob:~ %> iperf -s -B xx.yy.94.250
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address xx.yy.94.250
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local xx.yy.94.250 port 5001 connected with xx.yy.84.110 port 34231
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.3 sec 12.3 MBytes 9.94 Mbits/sec
[ 5] local xx.yy.94.250 port 5001 connected with xx.yy.84.110 port 34232
[ 5] 0.0-20.9 sec 12.5 MBytes 5.02 Mbits/sec
[SUM] 0.0-20.9 sec 24.8 MBytes 9.93 Mbits/sec
[ 6] local xx.yy.94.250 port 5001 connected with xx.yy.84.110 port 34233
[ 6] 0.0-10.6 sec 12.5 MBytes 9.92 Mbits/sec
[ 4] local xx.yy.94.250 port 5001 connected with xx.yy.84.110 port 34234
[ 4] 0.0-10.6 sec 12.5 MBytes 9.94 Mbits/sec
[ 5] local xx.yy.94.250 port 5001 connected with xx.yy.84.110 port 34235
[ 5] 0.0-10.5 sec 12.4 MBytes 9.94 Mbits/sec
[ 6] local xx.yy.94.250 port 5001 connected with xx.yy.84.110 port 34236
[ 6] 0.0-10.6 sec 12.6 MBytes 9.94 Mbits/sec
[ 4] local xx.yy.94.250 port 5001 connected with xx.yy.84.110 port 34237
[ 4] 0.0-10.7 sec 12.6 MBytes 9.94 Mbits/sec
[ 5] local xx.yy.94.250 port 5001 connected with xx.yy.84.110 port 34238
[ 5] 0.0-10.6 sec 12.6 MBytes 9.93 Mbits/sec

So it gives average bandwidth ~9.3Mbit/sec.

And here's the same test via vCider network:

novel@bob:~ %> iperf -s -B 192.168.87.4
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address 192.168.87.4
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.87.4 port 5001 connected with 192.168.87.1 port 60977
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.5 sec 11.4 MBytes 9.10 Mbits/sec
[ 5] local 192.168.87.4 port 5001 connected with 192.168.87.1 port 60978
[ 5] 0.0-10.5 sec 11.4 MBytes 9.05 Mbits/sec
[ 6] local 192.168.87.4 port 5001 connected with 192.168.87.1 port 60979
[ 6] 0.0-10.6 sec 11.4 MBytes 9.03 Mbits/sec
[ 4] local 192.168.87.4 port 5001 connected with 192.168.87.1 port 60980
[ 4] 0.0-10.4 sec 11.2 MBytes 9.03 Mbits/sec
[ 5] local 192.168.87.4 port 5001 connected with 192.168.87.1 port 60981
[ 5] 0.0-10.5 sec 11.4 MBytes 9.06 Mbits/sec
[ 6] local 192.168.87.4 port 5001 connected with 192.168.87.1 port 60982
[ 6] 0.0-10.4 sec 11.3 MBytes 9.05 Mbits/sec
[ 4] local 192.168.87.4 port 5001 connected with 192.168.87.1 port 60983
[ 4] 0.0-20.8 sec 11.2 MBytes 4.51 Mbits/sec
[SUM] 0.0-20.8 sec 22.4 MBytes 9.05 Mbits/sec
[ 5] local 192.168.87.4 port 5001 connected with 192.168.87.1 port 60984
[ 5] 0.0-10.5 sec 11.3 MBytes 9.03 Mbits/sec

It gives an average bandwidth of 8.5Mbit/sec, and it's about 91% of the original bandwidth, which is not bad I believe.

For the sake of experimenting, I tried to emulate TAP networking using openvpn. I chose the quickest configuration possible and just ran openvpn on the server this way:

# openvpn --dev tap0

and on the client:

# openvpn --remote xx.yy.94.250 --dev tap0

As you might guess, openvpn runs in user space and it tunnels traffic over the public
interfaces on the boxes I use for tests.

And I conducted another iperf test:

novel@bob:~ %> iperf -s -B 192.168.37.4
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address 192.168.37.4
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.37.4 port 5001 connected with 192.168.37.1 port 53923
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.5 sec 11.2 MBytes 8.97 Mbits/sec
[ 5] local 192.168.37.4 port 5001 connected with 192.168.37.1 port 53924
[ 5] 0.0-10.5 sec 11.1 MBytes 8.88 Mbits/sec
[ 6] local 192.168.37.4 port 5001 connected with 192.168.37.1 port 53925
[ 4] local 192.168.37.4 port 5001 connected with 192.168.37.1 port 53926
[ 6] 0.0-10.4 sec 11.1 MBytes 8.90 Mbits/sec
[ 4] 0.0-20.6 sec 10.8 MBytes 4.38 Mbits/sec
[SUM] 0.0-20.6 sec 21.8 MBytes 8.90 Mbits/sec
[ 5] local 192.168.37.4 port 5001 connected with 192.168.37.1 port 53927
[ 5] 0.0-10.4 sec 11.0 MBytes 8.87 Mbits/sec
[ 6] local 192.168.37.4 port 5001 connected with 192.168.37.1 port 53928
[ 6] 0.0-10.3 sec 10.9 MBytes 8.90 Mbits/sec
[ 4] local 192.168.37.4 port 5001 connected with 192.168.37.1 port 53929
[ 4] 0.0-10.5 sec 11.1 MBytes 8.88 Mbits/sec
[ 5] local 192.168.37.4 port 5001 connected with 192.168.37.1 port 53930
[ 5] 0.0-10.3 sec 10.9 MBytes 8.88 Mbits/sec

It gives an average bandwidth of 8.3Mbit/sec, and it's 89% of the original bandwidth. It's just a little slower than vCider Virtual Switch which is very good for openvpn, but I have to note it's not quite a fair comparison:

I don't use encryption in my openvpn setup
Real-world openvpn configuration will be much more complex
I believe openvpn will scale significantly worse with the growth of the number of machines in the network, as openvpn works in client/server mode while vCider works in p2p mode and uses central service to grab metadata such as routing information etc.

Also, it seems to me that the vCider team's comparison to openvpm is helpful, as they have a note on it in the FAQ -- be sure to check it out.

Support

It's a pleasure to note that the vСider team is very responsive. As I started testing the product at quite an early stage, I spotted some issues, and even they were not critical. It's a great pleasure to see they are all fixed in the next version.

Conclusion

vCider Virtual Switch is a product with expected behavior, good performance, complete documentation, and it's easy to use. The vCider team provides good support as well.

It seems that for relatively small setups within a single trusted environment, e.g. about 5-8 VMs within a single cloud provider, where traffic encryption and performance are not that critical, one could go with a openvpn setup. However, when either security or performance becomes important or the size of the setup increases, vCider Virtual Switch would be a good choice.

I am looking forward to new releases and specifically I'm very curious about multicast support and exposed API which manages networks.

Thursday, June 9, 2011

Clustered LVM on DRBD resource in Fedora Linux

As Florian Haas has pointed out in my previous post's comment, our shared storage configuration requires special precautions to avoid corruption of data when two hosts connected via DRBD try to manage LVM volumes simultaneously. Generally, these precautions concern locking LVM metadata operations while running DRBD in 'dual-primary' mode.

Let's examine it in detail. The LVM locking mechanism is configured in the [global] section of /etc/lvm/lvm.conf. The 'locking_type' parameter is the most important here. It defines which locking LVM is used while changing metadata. It can be equal to:

'0': disables locking completely - it's dangerous to use;
'1': default, local file-based locking. It knows nothing about the cluster and possible conflicting metadata changes;
'2': uses an external shared library and is defined by the 'locking_library' parameter;
'3': uses built-in LVM clustered locking;
'4': read-only locking which forbids any changes of metadata.

The simplest way is to use local locking on one of the drbd peers and to disable metadata operations on another one. This has a serious drawback though: we won't have our Volume Groups and Logical Volumes activated automatically upon creation on the other, 'passive' peer. The thing is that it's not good for the production environment and cannot be automated easily.

But there is another, more sophisticated way. We can use the Linux-HA (Heartbeat) coupled with the LVM Resource Agent. It automates activation of the newly created LVM resources on the shared storage, but still provides no locking mechanism suitable for a 'dual-primary' DRBD operation.

It should be noted that full support of clustered locking for the LVM can be achieved by the lvm2-cluster Fedora RPM package stored in the repository. It contains the clvmd service which runs on all hosts in the cluster and controls LVM locking on shared storage. In this case, we have only 2 drbd-peers in the cluster.

clvmd requires a cluster engine in order to function properly. It's provided by the cman service, installed as a dependency of the lvm2-cluster (other dependencies may vary from installation to installation):

The only thing we need the cluster for is the use of clvmd; the configuration of cluster itself is pretty basic. Since we don't need advanced features like automated fencing yet, we specify manual handling. As we have only 2 nodes in the cluster, we can tell cman about it. Configuration for cman resides in the /etc/cluster/cluster.conf file:

clusternode name should be a fully qualified domain name and should be resolved by DNS or be present in /etc/hosts. Number of votes is used to determine quorum of the cluster. In this case, we have two nodes, one vote per node, and expect one vote to make the cluster run (to have a quorum), as configured by cman expected attribute.

The second thing we need to configure is the cluster engine (corosync). Its configuration goes to /etc/corosync/corosync.conf:

The bindinetaddr parameter must contain a network address. We configure corosync to work on eth1 interfaces, connecting our nodes back-to-back on 1Gbps network. Also, we should configure iptables to accept multicast traffic on both hosts.

It's noteworthy that these configurations should be identical on both cluster nodes.

After the cluster has been prepared, we can change the LVM locking type in /etc/lvm/lvm.conf on both drbd-connected nodes:

Start cman and clvmd services on drbd-peers and get our cluster ready for the action:

Now, as we already have a Volume Group on the shared storage, we can easily make it cluster-aware:

Now we see the 'c' flag in VG Attributes:

As a result, Logical Volumes created in the vg_shared volume group will be active on both nodes, and clustered locking is enabled for operations with volumes in this group. LVM commands can be issued on both hosts and clvmd takes care of possible concurrent metadata changes.

Thursday, May 19, 2011

Shared storage for OpenStack based on DRBD

Storage is a tricky part of the cloud environment. We want it to be fast, to be network-accessible and to be as reliable as possible. One way is to go to the shop and buy yourself a SAN solution from a prominent vendor for solid money. Another way is to take commodity hardware and use open source magic to turn it into distributed network storage. Guess what we did?

We have several primary goals ahead. First, our storage has to be reliable. We want to survive both minor and major hardware crashes - from HDD failure to host power loss. Second, it must be flexible enough to slice it fast and easily and resize slices as we like. Third, we will manage and mount our storage from cloud nodes over the network. And, last but not the least, we want decent performance from it.

For now, we have decided on the DRBD driver for our storage. DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network-based RAID-1. It has lots of features, has been tested and is reasonably stable.

DRBD has been supported by the Linux kernel since version 2.6.33. It is implemented as a kernel module and included in the mainline. We can install the DRBD driver and command line interface tools using a standard package distribution mechanism; in our case it is Fedora 14:

The DRBD configuration file is /etc/drbd.conf, but usually it contains only 'include' statements. The configuration itself resides in global_common.conf and *.res files inside /etc/drbd.d/. An important parameter in global_common.conf is 'protocol'. It defines the sync level of the replication:

A (async). Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has been placed in the local TCP send buffer. Data loss is possible in case of fail-over.

B (semi-sync or memory-sync). Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has reached the peer node. Data loss is unlikely unless the primary node is irrevocably destroyed.

C (sync). Local write operations on the primary node are considered completed only after both the local and the remote disk write have been confirmed. As a result, loss of a single node is guaranteed not to lead to any data loss. This is the default replication mode.

Other sections of the common configuration are usually left blank and can be redefined in per-resource configuration files. To create a usable resource, we must create a configuration file for our resource in /etc/drbd.d/drbd0.res. Basic parameters for the resource are:

Name of the resource. Defined with 'resource' parameter, open main configuration section.

'on' directive opens the host configuration section. Only 2 'on' host sections are allowed per resource. Common parameters for both hosts can be defined once in the main resource configuration section.

'address' directive is unique to each host and must contain the IP-address and port number to which the DRBD driver listens.

'device' directive defines the path to the device created on the host for the DRBD resource.

'disk' is the path to the back-end device for the resource. This can be a hard drive partition (i.e. /dev/sda1), soft- or hardware RAID device, LVM Logical Volume or any other block device, configured by the Linux device-mapper infrastructure.

'meta-disk' defines how DRBD stores meta-data. It can be 'internal' when meta-data resides on the same back-end device as user data, or 'external' on a separate device.

Configuration Walkthrough

We are creating a relatively simple configuration: one DRBD resource shared between two nodes. On each node, the back-end for the resource is the software RAID-0 (stripe) device /dev/md3 made of two disks. The hosts are connected back-to-back by GigabitEthernet interfaces with private addresses.

As we need write access to the resource on both nodes, we must make it 'primary' on both nodes. A DRBD device in the primary role can be used unrestrictedly for read and write operations. This mode is called 'dual-primary' mode. Dual-primary mode requires additional configuration. In the 'startup' section directive, 'become-primary-on' is set to 'both'. In the 'net' section, the following is recommended:

The 'allow-two-primaries' directive allows both ends to send data.
Next, three parameters define I/O errors handling.
The 'sndbuf-size' is set to 0 to allow dynamic adjustment of the TCP buffer size.

Resource configuration with all of these considerations applied will be as follows:

Enabling Resource For The First Time

To create the device /dev/drbd0 for later use, we use the drbdadm command:

After the front-end device is created, we bring the resource up:

This command set must be executed on both nodes. We may collapse the steps drbdadm attach, drbdadm syncer, and drbdadm connect into one, by using the shorthand command drbdadm up.
Now we can observe the /proc/drbd virtual status file and get the status of our resource:

We must now synchronize resources on both nodes. If we want to replicate data that are already on one of the drives, it's important to run the next command on the host which contains data. Otherwise, this can be issued on any of two hosts.

This command puts the node host1 in 'primary' mode and makes it the synchronization source. This is reflected in the status file /proc/drbd:

We can adjust the syncer rate to make initial and background synchronization faster. To speed up the initial sync drbdsetup command used:

This allows us to consume almost all bandwidth of Gigabit Ethernet. The background syncer rate is configured in the corresponding config file section:

The exact rate depends on available bandwidth and must be about 0.3 of the slowest I/O subsystem (network or disk). DRBD seems to make it slower if it interferes with data flow.

LVM Over DRBD Configuration

Configuration of LVM over DRBD requires changes to /etc/lvm/lvm.conf. First, physical volume is created:

This command writes LVM Physical Volume data on the drbd0 device and also on the underlying md3 device. This can pose a problem as LVM default behavior is to scan all block devices for the LVM PV signatures. This means two devices with the same UUID will be detected and an error issued. This can be avoided by excluding /mnt/md3 from scanning in the /etc/lvm/lvm.conf file by using the 'filter' parameter:

The vgscan command must be executed after the file is changed. It forces LVM to discard its configuration cache and re-scan the devices for PV signatures.
Different 'filter' configurations can be used, but it must ensure that: 1. DRBD devices used as PVs are accepted (included); 2. Corresponding lower-level devices are rejected (excluded).

It is also nessesary to disable the LVM write cache:

These steps must be repeated on the peer node. Now we can create a Volume Group using the configured PV /dev/drbd0 and Logical Volume in this VG. Execute these commands on one of nodes:

To make use of this VG and LV on the peer node, we must make it active on it:

When the new PV is configured, it is possible to proceed to adding it to the Volume Group or creating a new one from it. This VG can be used to create Logical Volumes as usual.

Conclusion
We are going to install Openstack on nodes with shared storage as a private cloud controller. The architecture of our system presumes that storage volumes will reside on the same nodes as nova-compute. This makes it very important to have some level of disaster survival on the cloud nodes.

With DRBD we can survive any I/O errors on one of nodes. DRBD internal error handling can be configured to mask any errors and go to diskless mode. In this mode, all I/O operations are transparently redirected from the failed node to the replicant. This gives us time to restore a faulty disk system.

If we have a major system crash, we still have all of the data on the second node. We can use them to restore or replace the failed system. Network failure can put us into a 'split brain' situation, when data differs between hosts. This is dangerous, but DRBD also has rather powerful mechanisms to deal with these kinds of problems.

Monday, May 16, 2011

Make your bet on open source infrastructure computing

Today we are launching our company blog, focused on open source infrastructure computing. We plan to cover various emerging technologies and market paradigms related to this segment of IT. As you might imagine, we did not choose this topic by accident. Aside from being the focus for our blog, it is also the focus of Mirantis as a company. Employing Silicon Valley industry veterans backed by 150 open source hackers and programming champions from Russia we have built this company because we believe in a few basic principles. I felt there is no better way to open our blog, than to share these principles with the world. So here we go:

1. Cloud Drives Adoption of Open Source

Until recently the biggest selling point of commercial enterprise software was its reliability and scalability when it comes to mission-critical tasks. Open source was considered OK by enterprises for tactical purposes, but a no-no for mission-critical, enterprise wide stuff. Now after Amazon, Rackspace, salesforce.com etc. have built out their systems on top of what’s now largely available in open source the argument of OSS being unreliable no longer holds water.

Moreover, today, cloud essentially refers to the new paradigm for delivery of IT services… i.e. it is an economic model that revolves around “pay for what you get, when you get it.” Surprisingly, it took enterprises a very long time to accept this approach, but last year was pivotal in showing that it is tracking and is the way of the future. Open source historically, has been monetized, leveraging a model that is much closer to “cloud” than that of commercial software. I.e. in case of commercial software you buy the license and pay for implementation upfront. If you are lucky to implement, you continue to pay subscription which is sold in various forms – support, service assurance etc. With open source – you always implement first, if it works – you may (or may not) buy commercial support, which is also frequently sold as a subscription service. Therefore, as enterprises wrap their mindset around cloud, they shy further away from the traditional commercial software model and closer to the open source / services focused model.

2. OSS is The Future of Enterprise Infrastructure Computing

I expect that enterprise adoption of open source will be particularly concentrated in the infrastructure computing space. I.e. open source databases (NoSQL, MySQL instead of Oracle, DB2 etc.), application servers (SrpingDM, JBoss vs. WebSphere, WebLogic), messaging engines (RabbitMQ vs. Tibco), infrastructure monitoring and security tools etc. Adoption of OSS initiatives higher up the stack (Alfresco, Compiere ERP, Pentaho etc.) in my opinion will lag behind infrastructure projects. One of the reasons here being greater end user dependence on tools that are higher up the stack. If you have 100 employees that are used to getting their BI reports in Cognos, it is hard to get them to switch to Pentaho and get used to the new user interface and report formats. However, if your Cognos BI runs on Oracle, switching it to MySQL will likely only affect a few IT folks, while 100 users will not notice the difference.

More importantly, however, the lower down the stack you are, the more “techie” the consumer of your product is. The more techie your consumer, the more likely he is to a) prefer customizing the product to the process and not the other way around; b) ultimately contribute to the open source product. Lower level OSS products tend to be more popular and more in demand overall. The extreme example would be to look at operating system vs. end user apps. Linux powers more than half of enterprise servers, but how many people use open source text editing software?

3. Public PaaS is not for Everyone

An alternative to dealing with infrastructure computing is to not deal with it at all and use a platform like Google App Engine or Force.com to build your apps. Why deal with lower end of the stack at all if the guys that know how to do it best already today allow you to use their platform? I believe that PaaS will become the dominant answer in the SMB market, however, organizations that fall in the category of “technology creators” such as cloud service vendors themselves, financial services, large internet portals etc. will always want to keep control over their entire stack to be able to innovate ahead of the curve and remain vendor independent. Therefore, technology driven companies (those that differentiate with technology) will be the primary market for proprietary OSS based infrastructure computing.

4. Infrastructure Computing is Nobody’s Core Competency

Although infrastructure computing is a necessary component in every organization and most technology driven companies want to have full control over their entire stack, there are no technology companies out there that differentiate themselves based on the awesomeness of their infrastructure stack. Yes, everybody knows that Google’s application infrastructure is great and so is that of salesforce.com, but in the end, the customers don’t care if it takes 2K servers to power salesforce.com or 100K servers, as long as the features are there. In that context, it almost always makes sense to outsource infrastructure computing functions to some third party so as to enable the company to focus on those aspects of its technology that differentiate it from the competition.