Hard Drive failures and mitigation: brochure, theory, practice

I was asked to explain drive failures and mitigation efforts to some Director level and Aboves relative to our servers, after one server had issues due to a bad drive.

“How are we not monitoring for these types of failures, and how when drives are hot swappable and mirrored do we have problems?

The underlying message was clear, since a server had problems due to a disk failure, we must not be doing things right.

First, we’ll start where the skies are bright, and the sales reps smile.

The Brochure

Proper drive mirroring combined with hot-swappable drives prevents drive errors from taking down systems.

The salesman will nod and assure you that with THEIR system, you can get 100 9’s of uptime for the low low price of a large check.

The brochure is always awesome, so I won’t elaborate further.

I believe I have a unicorn promised to me in brochure-land. I hope they are feeding it while I’m here.

In Theory

Drives are paired up and use mirroring to mitigate dying drives.

We have monitoring detect the bad drives, and quickly replace drives which are identified as going bad.

In Practice

In Practice, Theory holds up a huge preponderance of the time.

We hot swap and replace bad drives quite often with few ill results.

But, and there is always a ‘but’.

Drives don’t always die in a graceful fashion. Most of the time, they go quietly into that binary-speckled good night, but sometimes we get the theater production of:

Lingering death of a drive

On most servers, the drives share the same communication bus, typically a scsi bus. Some drives die in a slow agonizing fashion, will still take some requests, but will be slow in responding, or reset the bus often. Systems having this type of failure can sometimes be seen ‘glitching’. Running fine for many things, but inexplicably hanging for brief periods. Other times, there will be no glitching but a series of errors in the log files.

When properly identified, drive replacements tend to go smoothly on these systems.

Squawking shrieking death of a drive

In this case, the drive dying will foul the scsi bus, preventing the good drives sharing the bus from communicating properly.

This is the scenario we faced with our server that got director level attention.

Drive  #2 on the system was going bad and squawking on the bus so much it was causing bus resets, and preventing our properly mirrored OS drives(drives 0,1) from doing their jobs.

A removal of the bad drive #2 allowed drives 0,1 to continue their good work.

The head-faking death of a drive

This can be the hardest to diagnose of all drive failures. The system starts glitching and complaining about a drive going bad.

We’ll call the bad drive “Bad disk 1”

System logs are filled with tales of “Bad Disk 1” doing many bad bad things, failed reads, writes, scsi bus timeouts, etc.

But, it could be that supposedly “Good drive 0” is head-faking and fouling the bus so hard, it looks like “Bad disk 1” is going bad.

In reality, it’s drive 0 which has gone south on us.

We have had a few cases where we removed the indicated bad drive, only to find out the remaining ‘good’ drive left in the system is useless.  In this case, we swap drives and retry.

There is no known heuristic to differentiate between the “Squawking shrieking death of a drive” and “The head-faking death of a drive”

This head-faking death of a drive scenario is the reason we keep the indicated bad drives we remove from systems around for a few days.

Drive failures of the above two cases are by and large thankfully rare.

 

 

This entry was posted in Nerd, Solaris, Uncategorized. Bookmark the permalink.