Cisco ACI Fabric Upgrade Process

In this write-up, I’m going to cover the upgrade process for a typical Cisco ACI fabric.  Overall, the ACI fabric upgrade process is fairly straightforward, and with sufficient redundancy of connected devices can be performed without any adverse impact to production traffic.

It’s worth mentioning that not all upgrades are the same – make sure to review the release notes for the versions involved to check for any caveats or release-specific instructions that may exist.

We’ll start by downloading the .iso and .bin files from CCO.  Needed are both the aci-apic-dk9.(version).iso file for the APIC itself, and the aci-n9000-dk9.(version).bin files for the Nexus 9K’s.  The same image is used for the spines and the leafs.  Together, the files are approximately 4.0-4.5GB in size.  If internet bandwidth or speed of transfer from the management workstation to the APIC is a concern, these files can be staged well ahead of the planned maintenance activity.

blog2.1_lg

After download, we’ll need to upload both of these files to an APIC.  From the GUI, navigate to Admin > Firmware, right-click Firmware Repository, choose Upload Firmware to APIC and upload the two files.

blog2.2_lg

After uploading, it will take several minutes for the files to be processed and appear in the repository listing.  You may also see an aci-catalog file appear as well, which was contained within the .iso bundle – you won’t find it online.  Note that it’s only necessary to upload these files to one APIC in the cluster, not all cluster members.

blog2.2a_lg

With the files in place – and after confirming the upgrade order provided in the release notes – upgrading the APIC image is first on our list.  Right-click Controller Firmware and select Controller Upgrade to begin the process.

blog2.3_lg

Even though this work may be occurring within a maintenance window, it is advisable to check the health and proper operation of all redundant devices and forwarding paths.  If a secondary or failover device is not working properly, the time to discover and resolve the issue should be before an upgrade, not during or after.  These processes should be incorporated into an application test plan, which would ideally be executed before, during, and after any major upgrade processes.

Once the health and operation of the ACI fabric and external devices have been confirmed, select the target version from the drop-down, and hit Submit.  The APICs will upgrade sequentially, usually taking about 35-40 minutes each.   blog2.4_lg

Each APIC will reboot itself, so expect your browser session to be reset as a normal part of the process.

blog2.8_lg

After logging back in, check to ensure that all APICs in the cluster are successfully upgraded.  This lab configuration has only one member.

blog2.9_lg

Next up – leafs and spines.  Under Fabric Node Firmware, there are two separate objects that determine how the upgrade will occur – Firmware Groups and Maintenance Groups.

blog2.5_lg

Firmware Groups determine which version each switch will be upgraded to when an upgrade is executed.  Typically, all switches will target the same firmware version – although that certainly does not have to be the case; an ACI fabric can actually run three versions of firmware concurrently.  This could be helpful if a partial fabric upgrade does not resolve an issue, as a third version could be deployed – likely with Cisco TAC’s direction – to further troubleshoot and isolate a problem.

For now, we’ll target v11.2(1k) – the version corresponding to APIC v1.2(1k) – across all switches and hit Submit.  Changing and confirming this setting does not begin the actual upgrade – we’ll do that in the next step.

blog2.6_lg

Maintenance Groups define which nodes will execute the upgrade simultaneously.  This is where proper cabling plan design comes into play.  If dual-homed hosts are connected to leafs 201 & 202, 203 & 204, or 205 & 206, the fabric upgrade can be divided into groups of even- and odd-numbered nodes, as one member of each of these pairings can be rebooted (along with half of the spines) with minimal or no service disruption.

Given the above scenario, there will be a loss of 50% fabric forwarding capacity during each group’s upgrade – some level of loss of forwarding capacity during these upgrades is simply going to be unavoidable.  Any disruption involving single-homed hosts, layer-2 connections, or layer-3 devices must be taken into account as well.

After setting the target firmware version under the Firmware Group object in the previous step, it’s now time to begin the fabric upgrade process.  Selecting the first Maintenance Group, right-click and choose Upgrade Now.

blog2.7_lg

The switch firmware is pushed from the APIC to leafs and spines, nodes reboot, and the discovery process reinitialize the switches back into the fabric.  The health score will naturally take a dip – half of the fabric just went offline, after all.

blog2.7a_lg

After the first maintenance group’s members are back online, it would be recommended in a production environment to execute appropriate infrastructure validation test plans: not only checking ACI fabric health, but to ensure that all redundant paths, hosts, and protocols have returned to healthy and operational states, as they are about to be again put to the test with the upgrade of the second group of nodes.

After the infrastructure has stabilized, right-click on the second maintenance group and select Upgrade Now.  Repeat for additional groups, as necessary.

blog2.10_lg

Once all switches are upgraded, again execute any and all test and validation plans.  ACI fabric health should recover to previous metrics.  Any new faults that exist may warrant additional investigation.

blog2.11_lg

Our fabric is completely upgraded!  With three APICs and two maintenance groups, I allocate two hours of actual upgrade time to a given fabric – not considering execution of any test plans or troubleshooting that may be involved.

Leave a Reply

Your email address will not be published. Required fields are marked *