Friday, May 4, 2012
How to Monetize the OpenStack Wave
Thursday, April 12, 2012
The New Open Source Superpower
Today is yet another important day in the history of OpenStack. The initial list of founding organizations for the independent OpenStack foundation has been announced and we, at Mirantis, are proud to be on that list.
While there is a lot to talk about on what this means for the infrastructure cloud market, I’d like to focus on what this means as far as illustrating the sheer momentum of the OpenStack beast. The non-profit legal entity that will house OpenStack has not yet been formed, but 18 organizations have already pledged significant financial (and not only financial) support to the foundation. The current financing model calls for $500K/year commitment from a Platinum sponsor and $50-$200K/year – from a Gold sponsor. Judging by the current composition of the supporting organizations, it is clear that the new foundation will launch with the initial budget north of $5M.
So how does this measure up to the rest of the FLOSS ecosystem? Well, there is a reason why OpenStack has been repetitively tagged as the Linux of the cloud. With the $5M annual budget, the newly formed OpenStack foundation takes the second spot in the entire FLOSS world. And it is second only to… you guessed it… the Linux foundation itself. According to form 990 filed by the Linux foundation in 2010 its operating revenues were $9.6M. Yes, the Linux foundation budget is still double that the OpenStack…but…come on…Linux is close to 20% of the server market. It also happens to power the majority of all mobile devices. OpenStack = Linux was a vision… judging by these numbers, this vision may soon be realized.
Another interesting thing that these numbers portray is why OpenStack (unlike CloudStack) has opted to create its own foundation, rather than surrendering everything to the governance of the Apache Foundation. With the Apache Foundation budget under $1M, OpenStack eats it for breakfast.
Now many of you will argue that none of this matters. Apache foundation houses many great projects that are far more mature and popular than OpenStack… true. But can you tell me, how many of these are truly vendor agnostic? And I am not talking about developer tools like Ant, Maven, Beehive etc. All Apache projects fall into two categories – they are either developer tools or vendor centric enterprise products: Tomcat – VMWare, Hadoop – Cloudera, Cloud.com – will now be Citrix =).
In my opinion, there is a reason for it and it is somewhat tied to foundation budgets. Open source is heavily driven by marketing. The number one open source company – RedHat - spends 2-3x more on marketing relative to its revenue than any of its closed source competitors. Ultimately, it is the marketing spend on an open source project that heavily affects its vendor independence status. If the entire spend comes from a single pocket, there is a single vendor that dominates that product.
Unlike most Apache open source projects, OpenStack (while still under RackSpace) was backed by a significant marketing and PR budget. Consequently, when foundation plans were being discussed, it was the desire to continue this centralized marketing effort that precluded OpenStack from considering the Apache foundation as its home. A significant chunk of the $5M raised will be spent by the foundation to promote and protect the OpenStack brand and the projects that the foundation will house. In a sense, this implies that for anyone to derail the vendor independent status of OpenStack, one will need the marketing budget, comparable to the $5M the foundation has raised… I say this is a decent barrier to start with.
Thursday, April 5, 2012
Some Brutally Honest Thoughts on Citrix’s Defection
When I first heard the announcement about Cloud.com being spun off into the Apache foundation, my initial reaction was to interpret the event as a hostile move by one of the OpenStack community insiders. Citrix is one of the founding members of OpenStack, with representation on the project policy board; the company has been quite active evangelizing the community through various events and code contributions. So why, all of a sudden, a move that may appear to undermine the OpenStack momentum?
Let’s take a look at the history. When Citrix bought Cloud.com for more than $200 million in July, 2011, insider information suggested the company had revenue of only a several million. While high valuations were not uncommon in the cloud space, a 40x revenue multiple is quite unusual. Why did Citrix do it? The only answer that comes to mind was that it wanted to quickly gain credibility in the cloud market.
I believe that corporate politics and relationships also played a role in this deal. Cloud.com was backed by Redpoint Ventures, which had an existing track record of selling its portfolio companies to Citrix. But, more importantly, Cloud.com founder and CEO – Sheng Liang – was also the founder and CTO of Teros Networks, a Web security company that was acquired by the very same Citrix just a few years before Cloud.com was founded. In fact, I am pretty sure, that in some sense cloud.com was Citrix’s skunk works project; acquisition by Citrix was the key part of the Cloud.com business plan. While there is nothing wrong with the approach and I can only complement the strategy, the early connection between Citrix and Cloud.com was key to its successful exit and the events that followed.
Just one year before the acquisition of Cloud.com, OpenStack was announced at OSCON and nobody knew what to think of it. It took the open source community by a storm and it soon became evident to all those competing for open cloud dominance, that simply ignoring the OpenStack phenomenon was not an option. “Open cloud strategy” soon became synonymous with the “OpenStack Strategy”. Citrix, a founding member of OpenStack itself, was in a bit of a tight spot. One choice was to abandon its Cloud.com project. Given the OpenStack momentum at the time, this could inevitably translate to the swift death of Cloud.com and $17 million in losses to the VCs backing it. Alternatively, Citrix could go all in, acquire the Cloud.com community to boast its credibility in the open source cloud space and take a stab at creating the dominant distribution of OpenStack, ultimately becoming to OpenStack what Red Hat has become to Linux. In the end, the scales tipped towards the latter option. In May, 2011 Citrix announced its distribution of OpenStack – project Olympus. Two months thereafter, the Cloud.com acquisition was announced.
However, when the dust settled, it became evident that Citrix’s involvement with Cloud.com and OpenStack (Project Olympus), instead of being complimentary as Citrix has anticipated, has been perceived as strange and surprising. CloudStack is Java based, whereas OpenStack is all Python. On the compute side, CloudStack focused on Xen, whereas the dominant hypervisor for OpenStack so far has been KVM. CloudStack was licensed under GPL, and OpenStack under Apache 2.0. Ultimately, Citrix’s cloud.com acquisition was sending confusing messages to both communities and Citrix’s customer base. A few months after Citrix’s acquisition, the Cloud.com community had little momentum left. At the same time, the OpenStack community remained wary of Citrix due to its involvement with CloudStack. Consequently, not much has happened with Project Olympus since its announcement over a year ago until it was officially abandoned with the latest announcement.
Today, Citrix announced that Cloud.com will find a new home with the Apache foundation. Is it a hostile move that will undermine OpenStack? I see it more as an act of desperation. Clearly, that wasn’t the initial plan, when Citrix first acquired Cloud.com. Consequently Citrix has failed to build the community around Cloud.com, miscalculated the synergies between the two communities, got trumped by OpenStack momentum, and dumped what’s left of Cloud.com to the Apache foundation. They have already announced CloudStack would be open source twice before, yet have received no outside contributions to date. The last commit to Cloud.com on GitHub by a non-Citrix employee is dated several months ago.
At this point, Citrix has a spotty history when it comes to open source. Open source is built on trust and they are hard to trust right now. Having burned bridges at their last two communities (Xen / Linux) and now OpenStack, it is going to be big challenge for them to revive CloudStack from its present semi-dead state.
Tuesday, February 14, 2012
Under the hood of Swift. The Ring
There are three types of entities that Swift recognizes: accounts, containers and objects. Each type has the ring of its own, but all three rings are put up the same way. Swift services use the same source code to create and query all three rings. Two Swift classes are responsible for this tasks: RingBuilder and Ring respectively.
Ring data structure
Every Ring of three in Swift is the structure that consists of 3 elements:- a list of devices in the cluster, also known as devs in the Ring class;
- a list of lists of devices ids indicating partition to data assignments, stored in variable named _replica2part2dev_id;
- an integer number of bits to shift an MD5-hashed path to the account/container/object to calculate the partition index for the hash (partition shift value, part_shift).
List of devices
A list of devices includes all storage devices (disks) known to the ring. Each element of this list is a dictionary of the following structure:Key | Type | Value |
---|---|---|
id | integer | Index of the devices list |
zone | integer | Zone the device resides in |
weight | float | The relative weight of the device to the other devices in the ring |
ip | string | IP address of server containing the device |
port | integer | TCP port the server uses to serve requests for the device |
device | string | Disk name of the device in the host system, e.g. sda1. It is used to identify disk mount point under /srv/node on the host system |
meta | string | General-use field for storing arbitrary information about the device. Not used by servers directly |
Partitions assignment list
This data structure is a list of N elements, where N is the replica count for the cluster. The default replica count is 3. Each element of partitions assignment list is an array('H'), or Python compact efficient array of short unsigned integer values. These values are actually index into a list of devices (see previous section). So, each array('H') in the partitions assignment list represents mapping partitions to devices ID.The ring takes a configurable number of bits from a path's MD5 hash and converts it to the long integer number. This number is used as an index into the array('H'). This index points to the array element that designates an ID of the device to which the partition is mapped. Number of bits kept from the hash is known as the partition power, and 2 to the partition power indicates the partition count.
For a given partition number, each replica's device will not be in the same zone as any other replica's device. Zones can be used to group devices based on physical locations, power separations, network separations, or any other attribute that could make multiple replicas unavailable at the same time.
Partition Shift Value
This is the number of bits taken from MD5 hash of '/account/[container/[object]]' path to calculate partition index for the path. Partition index is calculated by translating binary portion of hash into integer number.Ring operation
The structure described above is stored as pickled (see Python pickle) and gzipped (see Python gzip.GzipFile) file. There are three files, one per ring, and usually their names are:account.ring.gz
container.ring.gz
object.ring.gz
These files must exist in /etc/swift directory on every Swift cluster node, both Proxy and Storage, as services on all these nodes use it to locate entities in cluster. Moreover, ring files on all nodes in the cluster must have the same contents, so cluster can function properly.There are no internal Swift mechanisms that can guarantee that the ring is consistent, i.e. gzip file is not corrupt and can be read. Swift services have no way to tell if all nodes have the same version of rings. Maintenance of ring files is administrator's responsibility. These tasks can be automated by means external to the Swift itself, of course.
The Ring allows any Swift service to identify which Storage node to query for the particular storage entity. Method Ring.get_nodes(account, container=None, obj=None) is used for identification of target Storage node for the given path (/account[/container[/object]]). It returns the tuple of partition and dictionary of nodes. The partition is used for constructing the local path to object file or account/container database. Nodes dictionary elements have the same structure as the devices in list of devices (see above).
Ring management
Swift services can not change the Ring. Ring is managed by swift-ring-builder script. When new Ring is created, first administrator should specify builder file and main parameter of the Ring: partition power (or partition shift value), number of replicas of each partition in cluster, and the time in hours before a specific partition can be moved in succession:When the temporary builder file structure is created, administrator should add devices to the Ring. For each device, required values are zone number, IP address of the Storage node, port on which server is listening, device name (e.g. sdb1), optional device meta-data (e.g., model name, installation date or anything else) and device weight:
Device weight is used to distribute partitions between the devices. More the device weight, more partitions are going to be assigned to that device. Recommended initial approach is to use the same size devices across the cluster and set weight 100.0 to each device. For devices added later, weight should be proportional to the capacity. At this point, all devices that will initially be in the cluster, should be added to the Ring. Consistency of the builder file can be verified before creating actual Ring file:
In case of successful verification, the next step is to distribute partitions between devices and create actual Ring file. It is called 'rebalance' the Ring. This process is designed to move as few partitions as possible to minimize the data exchange between nodes, so it is important that all necessary changes to the Ring are made before rebalancing it:
The whole procedure must be repeated for all three rings: account, container and object. The resulting .ring.gz files should be pushed to all nodes in cluster. Builder files are also needed for the future changes to rings, so they should be backed up and kept in safe place. One of approaches is to put them to the Swift storage as ordinary objects.
Physical disk usage
Partition is essentially the block of data stored in the cluster. This does not mean, however, that disk usage is constant for all partitions. Distribution of objects between the partitions is based on the object path hash, not the object size or other parameters. Objects are not partitioned, which means that an object is kept as a single file in storage node file system (except very large objects, greater than 5Gb, which can be uploaded in segments - see the Swift documentation).The partition mapped to the storage device is actually a directory in structure under /srv/node/<dev_name>. The disk space used by this directory may vary from partition to partition, depending on size of objects that have been placed to this partition by mapping hash of object path to the Ring.
In conclusion it should be said that the Swift Ring is a beautiful structure, though it lacks a degree of automation and synchronization between nodes. I'm going to write about how to solve these problems in one of the following posts.
More information
More information about Swift Ring can be found in following sources:Official Swift documentation - base source for description of data structure
Swift Ring source code on Github - code base of Ring and RingBuilder Swift classes.
Blog of Chmouel Boudjnah - contains useful Swift hints
Monday, January 30, 2012
Introducing OpenStackAgent for Xen-based Clouds. What?
What it is all about
Not long ago we’ve been working on deployment of OpenStack Cactus-based public cloud using Xen as an underlying hypervisor. One of the problems we’ve faced were Windows guest instances failing to set up their administrator password to those generated by nova on instance creation. As it turned out the overall process of compute-guest instance communication in OpenStack-Xen environment is rather tricky (see the illustration). One of the core components of the process is so called guest agent - a special user space service which runs within a guest OS and executes commands provided from outside. Originally we’ve used the guest agent implementation provided by Rackspace. One can find the source code both for *nix and Windows OS on the Launchpad page. Although the project seemed to be quite stable at the moment the service built from C# code and combined with Cactus version of nova plugin for Xen was unable to set the password for the Windows instances. Deep log analysis revealed the problem at the stage of cryptography engine initialization. It should be noted that the procedure of resetting administrator’s password itself is complex. It first includes Diffie-Hellman key exchange between compute and guest agent. Next the password is encrypted for the sake of security and sent via the public channel i.e. Xen Store to the agent. For the deadline was coming in several hours we had no time to set up a proper environment for debugging and therefore we decided to perform a rather immature step which turned out to be a success afterwards. Hastily we implemented our own guest agent service using pywin32 library. Later on, it acquired several additional features including MSI installer and grew up into a separate project named OpenStackAgent. And now we would like to introduce it to the community.Friday, September 23, 2011
What is this Keystone anyway?
The simplest way to authenticate a user is to ask for credentials (login+password, login+keys, etc.) and check them over some database. But when it comes to lots of separate services as it is in the OpenStack world, we have to rethink that. The main problem is an inability to use one user entity to be authorized everywhere. For example, a user expects Nova to get one's credentials and create or fetch some images in Glance or set up networks in Quantum. This cannot be done without a central authentication and authorization system.
So now we have one more OpenStack project - Keystone. It is intended to incorporate all common information about users and their capabilities across other services, along with a list of these services themselves. We have spent some time explaining to our friends what, why, and how it is and now we decided to blog about it. What follows is an explanation of every entity that drives Keystone’s life. Of course, this explanation can become outdated in no time since the Keystone project is very young and it has developed very fast.
The first basis is the user. Users are users; they represent someone or something that can gain access through Keystone. Users come with credentials that can be checked like passwords or API keys.
The second one is tenant. It represents what is called the project in Nova, meaning something that aggregates the number of resources in each service. For example, a tenant can have some machines in Nova, a number of images in Swift/Glance, and couple of networks in Quantum. Users are always bound to some tenant by default.
The third and last authorization-related kinds of objects are roles. They represent a group of users that is assumed to have some access to resources, e.g. some VMs in Nova and a number of images in Glance. Users can be added to any role either globally or in a tenant. In the first case, the user gains access implied by the role to the resources in all tenants; in the second case, one's access is limited to resources of the corresponding tenant. For example, the user can be an operator of all tenants and an admin of his own playground.
Now let’s talk about service discovery capabilities. With the first three primitives, any service (Nova, Glance, Swift) can check whether or not the user has access to resources. But to try to access some service in the tenant, the user has to know that the service exists and to find a way to access it. So the basic objects here are services. They are actually just some distinguished names. The roles we've talked about recently can be not only general but also bound to a service. For example, when Swift requires administrator access to create some object, it should not require the user to have administrator access to Nova too. To achieve that, we should create two separate Admin roles - one bound to Swift and another bound to Nova. After that admin access to Swift can be given to user with no impact on Nova and vice versa.
To access a service, we have to know its endpoint. So there are endpoint templates in Keystone that provide information about all existing endpoints of all existing services. One endpoint template provides a list of URLs to access an instance of service. These URLs are public, private and admin ones. The public one is intended to be accessible from the global world (like http://compute.example.com), the private one can be used to access from a local network (like http://compute.example.local), and the admin one is used in case admin access to service is separated from the common access (like it is in Keystone).
Now we have the global list of services that exist in our farm and we can bind tenants to them. Every tenant can have its own list of service instances and this binding entity is named the endpoint, which “plugs” the tenant to one service instance. It makes it possible, for example, to have two tenants that share a common image store but use distinct compute servers.
This is a long list of entities that are involved in the process but how does it actually work?
- To access some service, users provide their credentials to Keystone and receive a token. The token is just a string that is connected to the user and tenant internally by Keystone. This token travels between services with every user request or requests generated by a service to another service to process the user's request.
- The users find a URL of a service that they need. If the user, for example, wants to spawn a new VM instance in Nova, one can find an URL to Nova in the list of endpoints provided by Keystone and send an appropriate request.
- After that, Nova verifies the validity of the token in Keystone and should create an instance from some image by the provided image ID and plug it into some network.
- At first Nova passes this token to Glance to get the image stored somewhere in there.
- After that, it asks Quantum to plug this new instance into a network; Quantum verifies whether the user has access to the network in its own database and to the interface of VM by requesting info in Nova.
Friday, September 16, 2011
Cloudpipe Image Creation Automation
The process of creating an image involves a lot of manual steps which crave to be automated. To simplify these steps, I wrote a simple script that uses some libvirt features to provide fully automated solution, in a way that you don't even have to bother with preparing base VM manually.
The solution can be found on a github and consists of 3 parts:
- The first ubuntukickstart.sh is the main part. Only this part should be executed. When you run it, it will configure the virtual network and PXE. Then it will start a new VM to install a minimal server Ubuntu by kickstart, so the installation is fully automated and unattended.
- The second cloudpipeconf.sh is used to turn minimal server Ubuntu to cloudpipe. It is being executed when the VM is ready to make this turning.
- The last ssh.fs is used to ssh into the VM and shutdown it.
More detailed information about how it works can be found in README file.
Don’t hesitate to leave a comment If you have any questions or concerns.
Friday, August 12, 2011
LDAP identity store for OpenStack Keystone
Now we have a basic support of LDAP authentication in Keystone which provides subset of functionality that was present in Nova. Currently, the main limitation is inability to actually integrate with the existing LDAP tree due to limitations in backend, but it works fine in isolated corner of LDAP.
So, after a long time of coding and fighting with new upstream workflows, we can give you a chance to try it out.
To do it, one should:
- Make sure that all necessary components are installed. They are Nova, Glance, Keystone and Dashboard.
Since the latter pair is still in incubator, you’ll have to download them from the source repository: - Set up Nova to authorize requests in Keystone:
It assumes that you’re in the same dir where you’ve downloaded Keystone sources. Replace nova.conf path if it differs in your Nova installation. - Add schema information to your LDAP installation.
It heavily depends on your LDAP server. There is a common .schema file and .ldif for the latest version of OpenLDAP in keystone/keystone/backends/ldap/ dir. For local OpenLDAP installation, this will do the trick (if you haven’t change the dir after previous steps):
- Modify Keystone configuration at keystone/etc/keystone.conf to use ldap backend:
- add keystone.backends.ldap to the backends list in [DEFAULT] section;
- remove Tenant, User, UserRoleAssociation and Token from the backend_entities list in [keystone.backends.sqlalchemy] section;
- add new section (don’t forget to change URL, user and password to match your installation):
- add keystone.backends.ldap to the backends list in [DEFAULT] section;
- Make sure that ou=Groups,dc=example,dc=com and ou=Users,dc=example,dc=com subtree exists or set LDAP backend to use any other ones by adding tenant_tree_dn, role_tree_dn and user_tree_dn parameters into [keystone.backends.ldap] section in config file.
- Run Nova, Keystone and Dashboard as usual.
- Create some users, tenants, endpoints, etc. in Keystone by using keystone/bin/keystone-manage command or just run keystone/bin/sample-data.sh to add the test ones.
Now you can authenticate in Dashboard using credentials of one of created users. Note that from this point all user, project and role management should be done through Keystone using either keystone-manage command or syspanel on Dashboard.
Thursday, May 19, 2011
Shared storage for OpenStack based on DRBD
We have several primary goals ahead. First, our storage has to be reliable. We want to survive both minor and major hardware crashes - from HDD failure to host power loss. Second, it must be flexible enough to slice it fast and easily and resize slices as we like. Third, we will manage and mount our storage from cloud nodes over the network. And, last but not the least, we want decent performance from it.
For now, we have decided on the DRBD driver for our storage. DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network-based RAID-1. It has lots of features, has been tested and is reasonably stable.
DRBD has been supported by the Linux kernel since version 2.6.33. It is implemented as a kernel module and included in the mainline. We can install the DRBD driver and command line interface tools using a standard package distribution mechanism; in our case it is Fedora 14:
The DRBD configuration file is /etc/drbd.conf, but usually it contains only 'include' statements. The configuration itself resides in global_common.conf and *.res files inside /etc/drbd.d/. An important parameter in global_common.conf is 'protocol'. It defines the sync level of the replication:
- A (async). Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has been placed in the local TCP send buffer. Data loss is possible in case of fail-over.
- B (semi-sync or memory-sync). Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has reached the peer node. Data loss is unlikely unless the primary node is irrevocably destroyed.
- C (sync). Local write operations on the primary node are considered completed only after both the local and the remote disk write have been confirmed. As a result, loss of a single node is guaranteed not to lead to any data loss. This is the default replication mode.
Other sections of the common configuration are usually left blank and can be redefined in per-resource configuration files. To create a usable resource, we must create a configuration file for our resource in /etc/drbd.d/drbd0.res. Basic parameters for the resource are:
- Name of the resource. Defined with 'resource' parameter, open main configuration section.
- 'on' directive opens the host configuration section. Only 2 'on' host sections are allowed per resource. Common parameters for both hosts can be defined once in the main resource configuration section.
- 'address' directive is unique to each host and must contain the IP-address and port number to which the DRBD driver listens.
- 'device' directive defines the path to the device created on the host for the DRBD resource.
- 'disk' is the path to the back-end device for the resource. This can be a hard drive partition (i.e. /dev/sda1), soft- or hardware RAID device, LVM Logical Volume or any other block device, configured by the Linux device-mapper infrastructure.
- 'meta-disk' defines how DRBD stores meta-data. It can be 'internal' when meta-data resides on the same back-end device as user data, or 'external' on a separate device.
Configuration Walkthrough
We are creating a relatively simple configuration: one DRBD resource shared between two nodes. On each node, the back-end for the resource is the software RAID-0 (stripe) device /dev/md3 made of two disks. The hosts are connected back-to-back by GigabitEthernet interfaces with private addresses.
As we need write access to the resource on both nodes, we must make it 'primary' on both nodes. A DRBD device in the primary role can be used unrestrictedly for read and write operations. This mode is called 'dual-primary' mode. Dual-primary mode requires additional configuration. In the 'startup' section directive, 'become-primary-on' is set to 'both'. In the 'net' section, the following is recommended:
The 'allow-two-primaries' directive allows both ends to send data.
Next, three parameters define I/O errors handling.
The 'sndbuf-size' is set to 0 to allow dynamic adjustment of the TCP buffer size.
Resource configuration with all of these considerations applied will be as follows:
Enabling Resource For The First Time
To create the device /dev/drbd0 for later use, we use the drbdadm command:
After the front-end device is created, we bring the resource up:
This command set must be executed on both nodes. We may collapse the steps drbdadm attach, drbdadm syncer, and drbdadm connect into one, by using the shorthand command drbdadm up.
Now we can observe the /proc/drbd virtual status file and get the status of our resource:
We must now synchronize resources on both nodes. If we want to replicate data that are already on one of the drives, it's important to run the next command on the host which contains data. Otherwise, this can be issued on any of two hosts.
This command puts the node host1 in 'primary' mode and makes it the synchronization source. This is reflected in the status file /proc/drbd:
We can adjust the syncer rate to make initial and background synchronization faster. To speed up the initial sync drbdsetup command used:
This allows us to consume almost all bandwidth of Gigabit Ethernet. The background syncer rate is configured in the corresponding config file section:
The exact rate depends on available bandwidth and must be about 0.3 of the slowest I/O subsystem (network or disk). DRBD seems to make it slower if it interferes with data flow.
LVM Over DRBD Configuration
Configuration of LVM over DRBD requires changes to /etc/lvm/lvm.conf. First, physical volume is created:
This command writes LVM Physical Volume data on the drbd0 device and also on the underlying md3 device. This can pose a problem as LVM default behavior is to scan all block devices for the LVM PV signatures. This means two devices with the same UUID will be detected and an error issued. This can be avoided by excluding /mnt/md3 from scanning in the /etc/lvm/lvm.conf file by using the 'filter' parameter:
The vgscan command must be executed after the file is changed. It forces LVM to discard its configuration cache and re-scan the devices for PV signatures.
Different 'filter' configurations can be used, but it must ensure that: 1. DRBD devices used as PVs are accepted (included); 2. Corresponding lower-level devices are rejected (excluded).
It is also nessesary to disable the LVM write cache:
These steps must be repeated on the peer node. Now we can create a Volume Group using the configured PV /dev/drbd0 and Logical Volume in this VG. Execute these commands on one of nodes:
To make use of this VG and LV on the peer node, we must make it active on it:
When the new PV is configured, it is possible to proceed to adding it to the Volume Group or creating a new one from it. This VG can be used to create Logical Volumes as usual.
Conclusion
We are going to install Openstack on nodes with shared storage as a private cloud controller. The architecture of our system presumes that storage volumes will reside on the same nodes as nova-compute. This makes it very important to have some level of disaster survival on the cloud nodes.
With DRBD we can survive any I/O errors on one of nodes. DRBD internal error handling can be configured to mask any errors and go to diskless mode. In this mode, all I/O operations are transparently redirected from the failed node to the replicant. This gives us time to restore a faulty disk system.
If we have a major system crash, we still have all of the data on the second node. We can use them to restore or replace the failed system. Network failure can put us into a 'split brain' situation, when data differs between hosts. This is dangerous, but DRBD also has rather powerful mechanisms to deal with these kinds of problems.
Wednesday, May 18, 2011
OpenStack Deployment on Fedora using Kickstart
In this article, we discuss our approach to performing an Openstack installation on Fedora using our RPM repository and Kickstart. When we first started working with OpenStack, we found that the most popular platform for deploying OpenStack was Ubuntu, which seemed like a viable option for us, as there are packages for it available, as well as plenty of documentation. However, because our internal infrastructure is running on Fedora, instead of migrating the full infrastructure to Ubuntu, we decided to make OpenStack Fedora-friendly. The challenge in using Fedora, however, is that there aren't any packages, nor is there much documentation available. Details of how we worked around these limitations are discussed below.
OpenStack RPM Repository
Of course, installing everything from sources and bypassing the system's package manager is always an option, but this approach has some limitations:
- OpenStack has a lot of dependencies, so it's hard to track them all
- Installations that bypass the system's package manager take quite some time (compared to executing a single Yum installation)
- When some packages are installed from repositories, and some are installed from sources, managing upgrades can become quite tricky
Because of these limitations, we decided to create RPMs for Fedora. In order to avoid reinventing the wheel, we've based these RPMs on RHEL6 OpenStack Packages, as RHEL6 and Fedora are fairly similar. There are two sets of packages available for various OpenStack versions:
There are two key metapackages:
- node-full: installing a complete cloud controller infrastructure, including RabbitMQ, dnsmasq, etc.
- node-compute: installing only node-compute services
To use the repository, just install the RPM:
In addition to installing everything with a single "yum install" command, we also need to perform the configuration. For a bare metal installation, we've created a Kickstart script. Kickstart by itself is a set of answers for the automated installation of Fedora distributive. We use it for automated hosts provisioning with PXE. The post-installation part of the Kickstart script was extended to include the OpenStack installation and configuration procedures.
Cloud Controller
To begin with, you can find the post-installation part of the Kickstart file for deploying a cloud controller below.
There are basic settings you will need to change. In our case, we are using a MySQL database.
Your server must be accessible by hostname, because RabbitMQ uses "node@host" identification. Also, because OpenStack uses hostnames to register services, if you want to change the hostname, you must stop all nova services and RabbitMQ, and then start it again after making the change. So make sure you set a resolvable hostname.
Add required repos and install the cloud controller.
qemu 0.14+ is needed to support creating custom images.
(UPD: Fedora 15 release already has qemu 0.14.0 in repository)
If you're running nova under a non-privileged user ("nova" in this case), libvirt configs should be changed to provide access to the libvirtd unix socket for nova services. Access over TCP is required for live migration, so all of our nodes should have read/write access to the TCP socket.
Now we can apply our db credentials to the nova config and generate the root certificate.
And finally, we add services to "autostart", prepare the database, and run the migration. Don't forget the setup root password for the MySQL server.
Compute Node
Compute Node script is much easier:
The config section differs very little; there is a cloud controller IP variable, which points to full nova infrastructure and other support services, such as MySQL and rabbit.
That code is very similar to cloud controller, except that it installs the openstack-nova-node-compute package, instead of node-full.
It is required to change the Cloud Controller IP address (CC_IP variable) for Compute Node installation.
IMPORTANT NOTE: All of your compute nodes should have synchronized time with the cloud controller for heartbeat control.