When Kyle's colocated KVM instance won't boot, he has to pull out every troubleshooting trick to get his most valuable server back on-line.
New Linux users often ask me “what is the best way to learn about Linux?” My advice always comes down to this: install and use Linux (any distribution will do but something stable works better), and play around with it. Inevitably, you will break something, and then instead of re-installing, force yourself to fix what you broke. That's my advice, because I've personally learned more about Linux by fixing my own problems than just about any other way. After years of doing this, you start to build confidence in your Linux troubleshooting skills, so that no matter what problem comes your way, you figure if you work at it long enough, you can solve it.
That confidence was put to the test recently when I had a problem with a KVM host. After a power outage, it refused to boot a virtual machine that was my primary personal server for just about everything. In this article, I walk through a problem that almost had me stumped and show how I was able to find a solution in an unorthodox place (at least for me).
Before I dive too deep into my problem, it would help to understand my setup. Although I do have servers at home, my primary server is colocated in a data center. I share the server with a friend, so the physical server simply acts as a secured KVM host, and I split the server's RAM and CPU across two virtual machines 50/50. All of my most important services from my primary DNS server and e-mail for me and my immediate family, a number of different Web sites and blogs, and even my main Irssi session sits on one of those two VMs. I end up hosting secondary DNS and e-mail from a server on my home connection, but due to a one-megabit upstream connection, I don't host much else at home for the outside world.
One day (while a relative happened to be visiting from out of town), I noticed that both my main server and the physical server that was hosting it were unavailable. I notified my contact at the data center, and it ended up being an accidental power outage that affected my cabinet. I was taking my relative out to the coast for the day, far away from decent cell-phone reception. So, since there wasn't much I could do, I assumed that long before I got back into town that afternoon, power would be restored, and other than losing over a year's uptime, I would be back up and running.
The first time I knew there was a real problem was when I got back into town and my main server still was down. I could log in to the physical host, however; so at first I wasn't too worried. After all, I had seen KVM instances not recover from a physical host reboot before. In the past, it was either from not setting a VM to start at boot or sometimes even a wayward libvirt apparmor profile that got in the way. Usually once I logged in to the physical host, I could change any bad settings, disable any troublesome apparmor module, then manually launch my VM with virsh. This time was different.
When my VM wouldn't boot manually, I was ready to blame AppArmor. It had blocked VMs from booting in the past, but this time, neither setting the libvirtd AppArmor module to complain mode, disabling all AppArmor modules nor even forcefully stopping AppArmor seemed to help. I even resorted to rebooting the physical host to heed AppArmor's warning that forcibly stopping it after it was running may cause some modules to misbehave. Nothing helped. When I connected a console to the VM as it booted, I started seeing initial kernel errors as though it was having trouble mounting the root filesystem. Great. Did the power outage corrupt my data?
The next step in the troubleshooting process was to attempt to boot from a rescue disk. With KVM, it's relatively easy to add a local ISO image as though it were a CD-ROM. So after not much effort, I discovered I could, in fact, boot a rescue disk and confirmed from the rescue disk I could mount my VM's drives, and the data did not seem corrupted. So then why wouldn't it boot? After I ran a manual fsck from the rescue disk, I attempted to reload GRUB, and that was when I got my first strange clue about the nature of the problem—even from the rescue disk, I wasn't able to write to the filesystem reliably. I would get virtual ATA resets, even though I seemed to be able to read fine.
So, I assumed I had some level of corruption with that particular VM, but because my data wasn't affected, I figured in the worst case, I could spawn a fresh VM and migrate the data over. So, that's what I tried next using the ubuntu-vm-builder wrapper script I used previously to build my VM. The VM seemed to spawn fine; however, once again, even this brand-new VM refused to boot properly and had the same strange disk errors.
It was at this point that my troubleshooting steps start to get a bit hazy, because I starting trying more desperate things. I booted different kernel versions in GRUB (after all, the kernel had been updated a few times in the year the server had been up). I audited all of the filesystem permissions on my VM disk images, and I tried to launch the VMs as root just in case. I even tried converting one VM's disks from qcow2 to raw with no results. Even Web searches came up empty. This server had been down longer than it ever had before, and I was starting to run out of options.
My first break came when I decided to copy the VM I had just spawned over to almost identical hardware I had at home with the same distribution and see if I could reproduce the problem there. I picked the new host simply because since qcow2 filesystems grow on demand, it happened to have the smallest disks and was the fastest to sync over. The process was pretty straightforward. First I exported that KVM instance's configuration XML file with virsh on the colocated host:
$ virsh dumpxml test1.example.net > test1.example.net.xml
Then I copied that XML file to my home server, created a local directory named after this VM to store its disk images and synced them over from the physical host:
$ mkdir test1.example.net $ rsync -avx --progress remotehost:/var/lib/libvirt/ ↪images/test1.example.nett/est1.example.net/
Once the disk images were copied, I had to edit the test1.example.net.xml file, because the disk images now were stored in a new location. After I did that, I used virsh again to import this XML configuration file and start the VM:
$ virsh define test1.example.net.xml $ virsh start test1.example.net
The VM actually started! Although I still had no idea what the problem was on the colocated server, I felt pretty confident that if I could sync over my main server, it would run on this home machine. Of course, with a 12Mb-down, 1Mb-up connection at home, it was going to take a bit longer to copy the 45GB disk images for this VM. Other than the time it took, the process was essentially the same as with the test machine, except once the host booted, I had to change its network configuration to reflect its new public IP.
With my server back up and running, I just had to change a number of DNS entries and firewall rules to reflect the new IP, and even with my slower upstream connection at home, I at least had some breathing room to troubleshoot the problem on the colocated server.
Now that my VM and its data were safe and services were restored (if a bit slow), I felt free to perform more drastic steps on my colocated server. The first step was trying to figure out what was so different about it compared to my home server. They had the same Ubuntu 10.04 server install and most of the same packages. Luckily, I had a number of old cached libvirt and KVM packages on my home server, so at first I iterated through all of those packages to see if the problem was due to some upgrade. Once I exhausted that, I tried different kernel versions on the physical host and still no results.
Believe me when I tell you that during that week I tried every troubleshooting measure I could think of before I finally went to the second-to-last resort. The fact that I was even considering this should tell you how desperate I was getting. The last resort would be to do a complete re-install from scratch—something I wasn't ready to do yet. I was desperate enough though that I went with the second-to-last resort: an in-place distribution upgrade from 10.04 to 12.04. Once the dust settled, I tried my small test image, and it actually worked. We were back in business.
Well, we were almost back in business. See, I had been using that server at home for a number of days now, and between the e-mail, blogs and other services, it had a lot of new data on it. This meant I couldn't just start up the image that was already on the colocated server. I had to sync up the changes from my home server.
The real trick to this was that I couldn't just sync the server hot. For one, the disk would be changing all the time, and two, I didn't want to risk having the same server running in weird states on two different physical hosts. This meant syncing the actual disk images. The problem was that while the 45GB disk images synced to my house relatively quickly over my 12Mb-downstream (plus the server was already down at the time, so downtime wasn't a consideration), syncing the same data up with my 1Mb upstream was going to take a long time—too long for a pure cold sync to be a solution, as I just couldn't have that much downtime.
The solution here was going to be two-fold, and it was based on a few assumptions I could make:
Although a fair number of files had changed on my local VM instance, the actual size of the change was relatively small compared to the size of the disk images.
rsync has an excellent mechanism for syncing over only the parts of large files that have changed.
A lot of the changes in my qcow2 files were likely going to be at the end of those files anyway.
If I use rsync with the --inplace option, it will modify the existing disk image on the remote machine directly and save disk space and time.
So my plan for phase 1 was to run rsync from physical host to physical host and sync over the qcow2 disk images hot while the VM was running and tell rsync to sync the disk images in place. Because I could assume the remote images would be somewhat corrupted anyway (that's the downside of syncing a disk image while the disk is being used), I didn't have to care about --inplace leaving behind a potentially corrupted file if it were stopped midway through the sync. I could clean it up later.
The advantage of doing the phase 1 rsync hot was that I could get all of the main differences between the home and colocated images sorted out while the server was still running at home. I even could potentially run that rsync multiple times leading up to phase 2 just to make sure it was as up to date as it could be. Here are the rsync commands I used to perform the phase 1 hot sync:
$ rsync -avz --progress --inplace disk0.qcow2 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk0.qcow2 $ rsync -avz --progress --inplace disk1.qcow2 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk1.qcow2
Between rsync's syncing only the bits that changed and the fact that I used -z to compress the data before it was transferred, I was able to sync these files way faster than you would think possible on a 1Mb connection. Of course, these commands ended up saturating my bandwidth at home, so since I wasn't under time pressure for the hot sync to complete, I ended up setting a bandwidth limit of 10 kilobytes per second for the larger disk1.qcow2 image:
$ rsync -avz --progress --inplace --bwlimit=10 disk1.qcow2 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk1.qcow2
Once phase 1 was complete, I could start with phase 2. I needed the phase 2 rsync to run while the VM was powered off so I could make sure the disk wasn't being written to during the sync. Otherwise, I would risk corruption on the filesystem. Because this required downtime, I picked a proper maintenance window for my server when it would be less busy, finished a final phase 1 hot sync a few hours before, then halted the VM cleanly before I performed the final syncs:
$ rsync -avz --progress --inplace disk0.qcow2 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk0.qcow2 $ rsync -avz --progress --inplace disk1.qcow2 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk1.qcow2
Because of the previous work of syncing up the disk images, the final cold sync took only an hour or two with most of the time being spent with rsync seeking between the local and remote image to confirm they were in sync. Once the commands completed, I was able to power up the server again on my colocated host, change its IPs back, and I was back in business.