I've been using Xenserver for a couple of years now, after VMWare made using ESXi 5.5 ridiculous with a mandatory pay upgrade for the privilege of using version 10 VMs and a web UI. Overall, I've made Xenserver angry a few times, but the extensive documentation on the
xe CLI has never failed to repair and correct my gaffs.
For my home lab, I keep a wonderful little Intel NUC with a 500GB HGST spinner. It is quiet, power efficent, and with 16GB of memory crammed in I can host all the VMs I require for my personal use and research. There are numerous HOWTOs on the web about this, so I won't duplicate good work on getting one of these set up. What I will talk about in this article is the utter failboat I sailed in trying to get Xenserver 7 installed.
Set sail for Fail
So, my current installed Xenserver version is 6.5, all patched up. I decided that I would go the route of an in-place upgrade, rather than a clean install. I wanted to see if it would work (it sort-of did!), and I wanted to not have to piece my VM disk assignments back together with the storage repositories.
I wrote the installation ISO to a USB drive with Rufus in Windows, updated my XenCenter version to 7.0.1, and booted the NUC from the install media. After a couple of questions it asked me if I wanted to do an upgrade, or a clean install. I chose upgrade, it backed up the 6.5 install, did its thing, and rebooted.
Where everything went sideways was from here. After booting successfully, the
xsconsole showed that no management network was configured. I've dealt with these sorts of errors before, and they usually have to do with the node being part of a pool, missing the pool master, and thus getting a bit peeved. However, my NUC isn't in a pool, and never has been, so I was a little confused about what I was seeing.
I dropped into the shell, and tried to issue some commands like
and got back a strange "connection refused" error.
If you've never used Xenserver,
xe is a CLI to the XAPI servicing stack. It's how the 'dom0' (control domain, basically a privileged VM for management) communicates with the xen hypervisor and does...everything really. And it was broken.
I looked around in some logs, but didn't find anything very useful at first. Since it was pretty broken, I decided that maybe the upgrade just didn't go well and I should roll back and try again. I rebooted from the install media, and this time it offered a rollback or a clean install. I chose rollback and indeed that it was I received. After a reboot things were back to normal.
The next day, I tried again, and once again received a broken install. I was able to get the networking status to show up in
xeconsole by issuing
but it would 'break' again almost immediately. I should note that
ifconfig did show that eth0 was set up properly, and the xenbr0 bonded TAP interface was using the correct static IP. I could ping my border device, etc. XAPI was hosed though. Couldn't start the service (it would immediately crash) and issuing
would sometimes show all the proper entries being started, and other times would only show the multipath daemon.
Even more strange was when the networking was briefly up, the
xsconsole showed that the current running version of Xenserver was...6.5?
showed that I was infact running version 7 though.
I figured it was time to cut my losses and just do the full clean install. I rolled back to 6.5, backed up the VM config data to my main SR, and then booted once again from the install media, choosing the clean install option.
Turns out, "clean install" isn't so clean.
On the next boot I still had broken everything. At this juncture, I was starting to get concerned. If a clean install didn't solve the issue, I had something deeper than a poorly-migrated config or something.
I booted the install media again, and...was presented with the option to roll back to 6.5.
As they say on the internet these days, "Wat?"
Thus began several hours of sanity-checking and such, only to come to the conclusion that the "clean" installer doesn't actually do anything to prevent the remade partitions from carrying over old data. I even switched to the shell, used
fdisk to clean the partition table and write a new GPT block and it STILL offered to roll me back to 6.5 on the next reboot.
I ended up swapping in an entirely different hard disk (because that was faster than writing 46GB of 0's) and finally shucked the persistent 'phantom' 6.5 install.
What it failed to do was solve my problem.
Now that I could completely rule out any carry-over config issues from 6.5, I wasn't left with a lot of possibilities to work with.
xensource.log file showed much weirdness. The crux appeared to be an error early during initialization, where the XAPI tries to start some sort of database, but can't contact the control domain to do so. What is bizarre is that the error is a "divided_by_zero" error. This starts a chain-reaction of failing commands, all with the same "divided_by_zero" error code.
This set me down the path of some hardware diagnostics. I was pretty skeptical though, since my Xenserver 6.5 installation runs perfectly, regularly using more than 12GB of RAM and running for weeks or months at a time. And sure enough, the RAM came back clean, as did both hard disks.
In the end, I clean installed 6.5, restored my SR config, and restored the VM metadata. I don't know at this point if the issue was with the NUC's hardware, the Xenserver installation media not handling the hardware correctly, or some actual bug that is triggered by my config. I'll give it a couple of weeks, and if no google results come up for others having my issue, I'll go through the trouble of getting the weird xensource log contents into a bug system somewhere.