{"id":376,"date":"2012-02-23T16:31:43","date_gmt":"2012-02-23T16:31:43","guid":{"rendered":"http:\/\/www.pasko.net\/wordpress\/?p=376"},"modified":"2012-02-23T16:31:43","modified_gmt":"2012-02-23T16:31:43","slug":"hard-drive-failures-and-mitigation-brochure-theory-practice","status":"publish","type":"post","link":"https:\/\/www.pasko.net\/wordpress\/2012\/02\/23\/hard-drive-failures-and-mitigation-brochure-theory-practice\/","title":{"rendered":"Hard Drive failures and mitigation: brochure,  theory, practice"},"content":{"rendered":"<p id=\"top\" \/>I was asked to explain drive failures and mitigation efforts to some Director level and Aboves relative to our servers, after one server had issues due to a bad drive.<\/p>\n<p style=\"padding-left: 30px;\"><em>&#8220;How are we not monitoring for these types of failures, and how when drives are hot swappable and mirrored do we have problems?<\/em>&#8220;<\/p>\n<p>The underlying message was clear, since a server had problems due to a disk failure, we must not be doing things right.<\/p>\n<p>First, we\u2019ll start where the skies are bright, and the sales reps smile.<\/p>\n<h2>The Brochure<\/h2>\n<p>Proper drive mirroring combined with hot-swappable drives prevents drive errors from taking down systems.<\/p>\n<p>The salesman will nod and assure you that with THEIR system, you can get <strong>100 9\u2019s<\/strong> of uptime for the low low price of a large check.<\/p>\n<p>The brochure is always awesome, so I won\u2019t elaborate further.<\/p>\n<p>I believe I have a unicorn promised to me in brochure-land. I hope they are feeding it while I\u2019m here.<\/p>\n<h2>In Theory<\/h2>\n<p>Drives are paired up and use mirroring to mitigate dying drives.<\/p>\n<p>We have monitoring detect the bad drives, and quickly replace drives which are identified as going bad.<\/p>\n<h2>In Practice<\/h2>\n<p>In Practice, Theory holds up a huge preponderance of the time.<\/p>\n<p>We hot swap and replace bad drives quite often with few ill results.<\/p>\n<p>But, and <em><strong>there is always a \u2018but\u2019.<\/strong><\/em><\/p>\n<p>Drives don\u2019t always die in a graceful fashion. Most of the time, they go quietly into that binary-speckled good night, but sometimes we get the theater production of:<\/p>\n<h2>Lingering death of a drive<\/h2>\n<p>On most servers, the drives share the same communication bus, typically a scsi bus. Some drives die in a slow agonizing fashion, will still take some requests, but will be slow in responding, or reset the bus often. Systems having this type of failure can sometimes be seen \u2018glitching\u2019. Running fine for many things, but inexplicably hanging for brief periods. Other times, there will be no glitching but a series of errors in the log files.<\/p>\n<p>When properly identified, drive replacements tend to go smoothly on these systems.<\/p>\n<h2>Squawking shrieking death of a drive<\/h2>\n<p>In this case, the drive dying will foul the scsi bus, preventing the good drives sharing the bus from communicating properly.<\/p>\n<p>This is the scenario we faced with our server that got director level attention.<\/p>\n<p>Drive\u00a0 #2 on the system was going bad and squawking on the bus so much it was causing bus resets, and preventing our properly mirrored OS drives(drives 0,1) from doing their jobs.<\/p>\n<p>A removal of the bad drive #2 allowed drives 0,1 to continue their good work.<\/p>\n<h2>The head-faking death of a drive<\/h2>\n<p>This can be the hardest to diagnose of all drive failures. The system starts glitching and complaining about a drive going bad.<\/p>\n<p>We\u2019ll call the bad drive \u201cBad disk 1\u201d<\/p>\n<p>System logs are filled with tales of \u201cBad Disk 1\u201d doing many <strong>bad bad<\/strong> things, failed reads, writes, scsi bus timeouts, etc.<\/p>\n<p>But, it could be that supposedly \u201cGood drive 0\u201d is head-faking and fouling the bus so hard, it looks like \u201cBad disk 1\u201d is going bad.<\/p>\n<p>In reality, it\u2019s drive 0 which has gone south on us.<\/p>\n<p>We have had a few cases where we removed the indicated bad drive, only to find out the remaining \u2018good\u2019 drive left in the system is useless.\u00a0 In this case, we swap drives and retry.<\/p>\n<p>There is no known heuristic to differentiate between the \u201cSquawking shrieking death of a drive\u201d and \u201cThe head-faking death of a drive\u201d<\/p>\n<p>This head-faking death of a drive scenario is the reason we keep the indicated bad drives we remove from systems around for a few days.<\/p>\n<p>Drive failures of the above two cases are by and large thankfully rare.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I was asked to explain drive failures and mitigation efforts to some Director level and Aboves relative to our servers, after one server had issues due to a bad drive. &#8220;How are we not monitoring for these types of failures, &hellip; <a href=\"https:\/\/www.pasko.net\/wordpress\/2012\/02\/23\/hard-drive-failures-and-mitigation-brochure-theory-practice\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[6,7,1],"tags":[],"class_list":["post-376","post","type-post","status-publish","format-standard","hentry","category-nerd","category-solaris","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.pasko.net\/wordpress\/wp-json\/wp\/v2\/posts\/376","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pasko.net\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pasko.net\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pasko.net\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pasko.net\/wordpress\/wp-json\/wp\/v2\/comments?post=376"}],"version-history":[{"count":0,"href":"https:\/\/www.pasko.net\/wordpress\/wp-json\/wp\/v2\/posts\/376\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.pasko.net\/wordpress\/wp-json\/wp\/v2\/media?parent=376"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pasko.net\/wordpress\/wp-json\/wp\/v2\/categories?post=376"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pasko.net\/wordpress\/wp-json\/wp\/v2\/tags?post=376"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}