Modern Middle Manager
Primarily my musings on the practical application of technology and management principles at a financial services company.
Continuous Availability -- Well, Almost

Friday, July 11, 2003  

What you can do with a little VMware ESX server, Perl scripting and Firewire drives?

Though we are a small IT shop, we like to dream big. Dreams like continuous availability come to mind. Naturally, with our size and the cost it would take to actually create "continuous availability," we aren't going to get it. Maybe, just maybe, there are baby steps we can take. Well, it turns out there might be.

Behind The Technology

Our latest project is looking at migrating our VMware GSX virtual servers (about 90% of our existing server base, the rest are Dell 1655MC blades) to their ESX server platform. Why? It turns out that we can perform snapshots of our virtual servers without interrupting operations, sort of like Microsoft's Windows 2003 is claiming to do with its volume shadow copy service. By using a combination of their disk logging and Perl API technology, we can perform snapshots.

ESX Server virtual disks come in several flavors; we're only concerned about two of them here: persistent and undoable. Persistent disks are written to immediately. Undoable disks are like databases, consisting of the main disk device and a transaction log, called the redo disk. Redo disks are written to the main disk when committed.

The key to performing a snapshot is adding a redo disk to a persistent disk using the Perl API provided by VMware. The process is something like this: take a VM, add a redo disk to every persistent disk it uses, back up the persistent disk, then commit and delete the redo disk. Caveat: the virtual machine freezes for a little bit while the redo disk is committed to the persistent disk. We aren't 24x7 so that works OK for us; however, there is apparently an advanced method to use two redo disks to avoid this kind of interruption. We'll explore that in version 2.

Our Goal
With the script done, it's time to look at the entire process. The strategy is to create online backups overnight, once per week. A Linux server will run the script on the weekend, gather all of the virtual server disks and push them to a Firewire drive. That Firewire drive will be placed into a regular rotation to our recovery site. With our critical data asynchronously mirrored to the recovery site filer and the virtual servers updated regularly on the Firewire drive we should achieve our goal -- a reasonable recovery time objective.

The Next Iteration
Looking ahead, we can start performing backups once per day. Most of our data is on a NetApp filer, so we're already getting hourly snapshots of critical data (databases, Exchange information stores, web sites and user files). The next step is to take daily overnight backups and stage them on a filer volume at the data center that in turn takes daily and weekly snapshots. Those server disks will be placed on tape regularly and stored offsite. This method should allow for us to roll back any server to a prior time period within a short amount of time.

posted by Henry Jenkins | 7/11/2003 04:25:00 PM

Comments: Post a Comment
the author
open source