It’s been a while since I’ve posted to the blog. I had (and have) aspirations of writing here on a regular basis, not every day but certainly more often than I have been lately. I don’t have time to post every day (or multiple times a day) about news happening in the Sysadmin part of the world. There are better sites out there for that type of thing, this site doesn’t need to replicate work that already being done better elsewhere.
I want to focus more on longer/better but less frequent articles. I want to continue writing posts more like the Unifi post. This one is about the importance of reading release notes for all the bits of software sysadmins are responsible for in a modern datacenter.
I just finished a major software upgrade for my company’s production VMware cluster. It was running vSphere 5.5 xxxx and needed to be upgraded to 5.5 update 3, both to address a bug we were experiencing at the version we were at but to also get the wide range of security fixes that had been patched between the two builds. Seems simple enough, right? I mean just login to the vSphere client, connect to the vSphere Update Manager and go to town.
Not so much. I’ve got an approved maintenance window of 3 hours a week, same 3 hours every Thursday. The business knows that’s the time upgrades happen, but everything needs to be back in a running state before 10 PM. I can’t get all of this done in one 3 hour block, so things need to be kept happy and running between maintenance windows.
Besides vSphere, I also needed to account for the following:
- Trend Micro Deep Security
- Has various hooks into each host in order to be able to inspect and product the guest VMs. Needs to support both the existing ESXi build as well update 3. Also needed to confirm that the new version of DSM would work with the existing appliances since they could only be upgraded as each host was upgraded in turn.
-vShield Networking and Security
- Needed upgraded to address bugs, etc but also needs to be upgraded to a version that is supported by Deep Security, the version of ESXi I was currently running, as well as the version of ESXi I would be going to.
- Nutanix Controller VMs (NOS)
- Although there were no known issues at the time of update 3a’s release, I waited approximately 2 weeks for Nutanix to do internal QA with their code and Update 3a to ensure there were no tricky gotchas waiting for me. That’s great because that’s one less thing I need to worry about and it isn’t like I didn’t have a couple maintenance windows worth of other updates that needed to be applied for prior to rolling out the update hypervisor anyway.
- Horizon View Desktops
- Needed to upgrade to a version of Horizon View that supported both the current build of ESXi I was on as well as the Update 3a. The VMware Product Interoperability listed no such version. I had to open a ticket with VMware support to verify which build of View I should go to. The matrix has since been updated to show version 6.1.1 was the magic build for me.
After a lot of checking, double checking, and note taking I had a comprehensive set of steps in Omnifocus that would result in an updated cluster that could be completed in chunks spread across several weeks with no downtime outside of the Thursday night maintenance window.
That process was:
- Upgrade vShield Network
- Update Deep Security Manager
- Upgrade vCenter Server Appliance
- Upgrade Horizon View Connection Server
- Upgrade Nutanix Controller software
- Begin updating the hypervisor on each host, one at a time.
- Pick first host
- Put host in maintenance mode
- Upgrade vShield Endpoint Driver
- Upgrade Trend Micro Filter Driver
- Upgrade physical NIC drivers for ESXi (update needed)
- Remove old Trend Micro appliance
- Provision new Trend Micro appliance
- Apply vSphere updates
- Exit maintenance mode
- Verify Nutanix Controller services restarted and rejoined the cluster
- Repeat for additional hosts
I was lucky. I managed to just barely squeak by without needing to do multiple updates of a single product to get up to date. If I had waited much longer, I’d have had to upgrade vSphere partway, upgrade View, then upgraded vSphere the rest of the way, then finish updating View.
I’ve got resources in the cluster such that we can continue to run at 100% load with one host out of the cluster. I could power off test VMs and other non-critical servers to free up resources so that more than one host could be down at a time. But at the end of the day, I decided that the time savings from jumping through all the hoops to be able to reboot multiple hosts at once would likely be the same as if I just took down one host a time and vMotion’d everything around. In the end, I just did it one host a time. To get everything updated and make it through two reboots of a physical server (rebooting a VM has us all so spoiled, such a fast reboot cycle versus booting a physical server) took about an hour each. I ended up doing two hosts (back to back) in a maintenance window, so it took a few weeks to get everything done.
In news that will come as a shock to absolutely no one who reads a Sysadmin blog, before I got all my hosts upgraded to the latest and greatest build…….a new round of patches was released. Don’t get me wrong, bugs need fixed and security holes need patched. I’m glad to receive improvements and updates. I just need to not let it go so long between update cycles. It makes it a real pain to get it all sorted out.