Catastrophic disk failure, where was smartd?

Wed Mar 26 18:28:01 UTC 2008

Bruno Wolff III wrote:
> On Wed, Mar 26, 2008 at 08:35:49 -0500,
>   "David G. Mackay" <mackay_d at bellsouth.net> wrote:
>> Shouldn't there have been some indication of problems prior to the
>> failure?
> 
> Only if you are lucky. Someone at Google published some information about
> smart around a year ago. In cases where catastrophic failures occur, for a high
> percentage there is no warning from smart.
> 

The big issue is that most of the smart implementations don't scan the disk for 
bad blocks, and in my experience several years ago with a 1000+ disks in 
services was that the #1 failure was bad blocks, and smart did little to catch 
that.    The #2 failure was failure to spin up at all, but this seemed to be 
confined to certain batches.

One thing that I would do was do a simple "dd if=/dev/sdx of=/dev/null bs=1M" on 
all of my disks maybe 1x per week or 1x per month to scan it yourself, if the 
disk detects a sector getting too many errors (still correctable with the extra 
bits they have) they will move the data from the bad sector to a spare, and mark 
the bad sector bad, and I believe smart counts when this has been done.

                                Roger