Last week (on the 24th) I performed a routine OnTAP upgrade across 5 of my 3170 filers; I upgraded from 7.3.5 to 7.3.5.1P5. This upgrade was performed to help prevent a system panic from happening, which had happened twice before on our stand alone 3170 snapvault target system, under Bug 446493. Here is the Bug description:
Much disk and shelf hardware can be managed by an ANSI-standard technology called SCSI Enclosure Services (SES). To support SES-related processing, the SES subsystem of Data ONTAP schedules various periodic actions, using an internal timeout mechanism.
Due to a software defect, under certain conditions, the SES subsystem may set an excessive number of timers, with new timers being set before old ones expire. If this continues, an internal callout table will fill up, triggering an interruption of processing.
One condition in which the problem can occur is during initial setup and configuration of the storage system: when the “cluster-setup wizard” is run, it asks the administrator for configuration input as follows:
Do you want to create a new cluster or join an existing cluster? {create, join}: (Login timeout will occur in 60 seconds)
When installing releases in which the defect is present, if this prompt is allowed to time out, an interruption may occur at some later time.
However, the callout table can also fill up during routine production, if storage events occur in rapid succession, such that SES scans are rapidly invoked. Such events may include:
- a continuing series of disk errors
- breakages in disk-communication links
- adding new shelves
- power-cycling shelves
- removing a power supply
- shelf firmware updates
- shelf faults
- takeover/giveback
We were hitting this bung under the “continuing series of disk errors” event, which caused the SES scans to fill the timer-callout table. When this happened the system would panic and reboot. After the upgrade I performed all standard checkouts and everything appeared to be functioning normally and within standards, so I closed the upgrade processes and marked it as successful.
Then on the 26th we attempted to perform an allocation using the NMC. While going through all the steps everything appeared as though it was going well, all of the checks passed and we hit commit, only to have the process came back with an error message indicating that the process had failed. After opening a support call with NetApp and providing both screen captures of the error received and steps to reproduce, it was determined that we had run it to another bug. This time we hit Bug 474612. Here is the bug description:
system-cli API returns cli-result-value with an invalid return status. This invalid return status may break OnCommand and other third party applications utilizing NMSDK.
Basically what is happening is the NMC is executing the requested commands, but the filer cli is returning a response code that the NMC is not expecting and in fact does not recognize. When this happens the NMC does not know how to proceed and the command fails without performing the provisioning. Apparently this issue was introduced in OnTAP 7.3.5.1P4 and still exists in 1P5, so the solution is to downgrade OnTAP to 7.3.5.1P3. In this version of OnTAP bug 446493 is resolved and bug 474612 has not yet been introduced (as of this writing, bug 474612 is NOT resolved in any version of OnTAP).
After performing the downgrade tonight to OnTAP 7.3.5.1P3 I performed all normal checkouts and additionally performed the pending allocation via the NMC to verify the functionality and non-existence of bug 474612 in this version of OnTAP. Happily the allocation went through without issue, and from what we can tell all aspects of the filers are functioning normally. Our next OnTAP upgrade will have to be to 7.3.6.