Wednesday, December 05, 2007

Failed drive slice in a Software RAID after resync

One of the things that I do periodically on my servers is to run a mdadm resync. Because this can put a heavy strain on the disk system, I strongly suggest that you have good backups in place. My home systems run a check about once a month, servers at work run a check early on Tuesday mornings.

The script is very simple, and you can even fire off the command by writing "check" to the sync_action variable of the md process.

#!/bin/sh
# Tells mdadm to verify that the arrays are synchronized.
# This deals with the issue where a seldom-read disk block has gone bad
# by doing a daily/weekly verification of the array.

echo check > /sys/block/md0/md/sync_action
echo check > /sys/block/md1/md/sync_action
echo check > /sys/block/md2/md/sync_action
echo check > /sys/block/md3/md/sync_action
echo check > /sys/block/md4/md/sync_action
echo check > /sys/block/md5/md/sync_action
echo check > /sys/block/md6/md/sync_action


In this particular case, all of my RAID slices verified correctly, except for one of them. In this particular situation I'm running a triple-active RAID1 array. (Instead of using a hot-spare disk, I'm putting live data onto all three disks and using all three actively.)

See also Failing hard drive in a Software RAID

$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[2] sdb1[1] sda1[0]
256896 blocks [3/3] [UUU]

md2 : active raid1 sdc3[2] sdb3[1] sda3[0]
12289600 blocks [3/3] [UUU]

md4 : active raid1 sdc5[2] sdb5[1] sda5[0]
33551616 blocks [3/3] [UUU]

md3 : active raid1 sdc6[2] sdb6[1] sda6[0]
1052160 blocks [3/3] [UUU]

md5 : active raid1 sdc7[2] sdb7[1] sda7[0]
64010880 blocks [3/3] [UUU]

md6 : active raid1 sdc8[2] sdb8[1] sda8[0]
267257216 blocks [3/3] [UUU]

md7 : active raid1 sdf1[2] sde1[1] sdd1[0]
488383936 blocks [3/3] [UUU]

md1 : active raid1 sdc2[3](F) sdb2[1] sda2[0]
12289600 blocks [3/2] [UU_]

unused devices: <none>


The md1 array is my / (root) partition. Since the rest of the disk slices appear to be fine, I'm going to proceed with the assumption that it was a minor glitch.

Step 0: Analyze the failure

The first sign of error was the (F) showing up in /proc/mdstat. Apparently I don't have mdadm configured yet in monitor mode so that it e-mails me when it finds an error.

# grep "sdc2" messages
Dec 4 09:11:58 fw1-shimo kernel: raid1: Disk failure on sdc2, disabling device.
Dec 4 09:12:06 fw1-shimo kernel: disk 2, wo:1, o:0, dev:sdc2


The full detail from the mdadm resync:

# grep "Dec 4 09" messages | grep "md:"
Dec 4 09:08:33 fw1-shimo kernel: md: md6: sync done.
Dec 4 09:08:33 fw1-shimo kernel: md: syncing RAID array md1
Dec 4 09:08:33 fw1-shimo kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Dec 4 09:08:33 fw1-shimo kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Dec 4 09:08:33 fw1-shimo kernel: md: using 128k window, over a total of 12289600 blocks.
Dec 4 09:11:31 fw1-shimo kernel: md: md1: sync done.
#


And finally, evidence from the logs that shows that sdc was having issues:

Dec 4 09:11:34 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:34 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:34 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:34 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:35 fw1-shimo kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:35 fw1-shimo kernel: ata2.00: (BMDMA stat 0x0)
Dec 4 09:11:35 fw1-shimo kernel: ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:35 fw1-shimo kernel: ata2: EH complete
Dec 4 09:11:37 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:37 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:37 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:37 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:50 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:51 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:51 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:51 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:51 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:52 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:52 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:52 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:52 fw1-shimo setroubleshoot: SELinux is preventing /usr/sbin/sendmail.postfix (system_mail_t) "read" to /dev/md1 (proc_mdstat_t). For complete SELinux messages. run sealert -l d5c655f4-6fc3-445b-ab9d-3b21336cb2d0
Dec 4 09:11:52 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:53 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:54 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:54 fw1-shimo kernel: sd 2:0:0:0: SCSI error: return code = 0x08000002
Dec 4 09:11:54 fw1-shimo kernel: sdc: Current: sense key: Medium Error
Dec 4 09:11:54 fw1-shimo kernel: Additional sense: Unrecovered read error - auto reallocate failed
Dec 4 09:11:55 fw1-shimo kernel: end_request: I/O error, dev sdc, sector 25091744
Dec 4 09:11:55 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:55 fw1-shimo kernel: SCSI device sdc: 781422768 512-byte hdwr sectors (400088 MB)
Dec 4 09:11:55 fw1-shimo kernel: sdc: Write Protect is off
Dec 4 09:11:56 fw1-shimo kernel: SCSI device sdc: drive cache: write back
Dec 4 09:11:56 fw1-shimo kernel: SCSI device sdb: 781422768 512-byte hdwr sectors (400088 MB)
Dec 4 09:11:56 fw1-shimo kernel: sdb: Write Protect is off
Dec 4 09:11:57 fw1-shimo kernel: SCSI device sdb: drive cache: write back
Dec 4 09:11:57 fw1-shimo kernel: SCSI device sdc: 781422768 512-byte hdwr sectors (400088 MB)
Dec 4 09:11:57 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:11:57 fw1-shimo kernel: counted 127, received 15
Dec 4 09:11:58 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:11:58 fw1-shimo kernel: raid1: Disk failure on sdc2, disabling device.
Dec 4 09:11:58 fw1-shimo kernel: Operation continuing on 2 devices
Dec 4 09:11:58 fw1-shimo kernel: blk: request botched
Dec 4 09:11:58 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:11:59 fw1-shimo kernel: counted 112, received 16
Dec 4 09:11:59 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:11:59 fw1-shimo kernel: blk: request botched
Dec 4 09:11:59 fw1-shimo kernel: sdc: Write Protect is off
Dec 4 09:12:00 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:00 fw1-shimo kernel: counted 96, received 16
Dec 4 09:12:00 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:01 fw1-shimo kernel: blk: request botched
Dec 4 09:12:01 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:01 fw1-shimo kernel: counted 80, received 16
Dec 4 09:12:01 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:02 fw1-shimo kernel: blk: request botched
Dec 4 09:12:02 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:02 fw1-shimo kernel: counted 64, received 16
Dec 4 09:12:02 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:03 fw1-shimo kernel: blk: request botched
Dec 4 09:12:03 fw1-shimo kernel: SCSI device sdc: drive cache: write back
Dec 4 09:12:03 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:03 fw1-shimo kernel: counted 48, received 16
Dec 4 09:12:04 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:04 fw1-shimo kernel: blk: request botched
Dec 4 09:12:04 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:04 fw1-shimo kernel: counted 32, received 16
Dec 4 09:12:05 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:05 fw1-shimo kernel: blk: request botched
Dec 4 09:12:05 fw1-shimo kernel: ata3.00: WARNING: zero len r/w req
Dec 4 09:12:06 fw1-shimo last message repeated 5 times


Step 1: Drop the failed slice

# /sbin/mdadm /dev/md1 --fail /dev/sdc2
mdadm: set /dev/sdc2 faulty in /dev/md1
# /sbin/mdadm /dev/md1 --remove /dev/sdc2
mdadm: hot removed /dev/sdc2


Step 2: Zero out the failed slice

My thinking here is that by zeroing out the failed slice, I can force the SATA disk to remap any sectors that have gone bad.

# dd if=/dev/zero of=/dev/sdc2
dd: writing to `/dev/sdc2': Input/output error
24577993+0 records in
24577992+0 records out
12583931904 bytes (13 GB) copied, 1916.7 seconds, 6.6 MB/s


Well, that's not a good sign (and the disk was clicking a bit). So I'll run smartctl and check the disk's SMART info (see Monitoring Hard Disks with SMART).

# /usr/sbin/smartctl -i -d ata /dev/sdc
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HD400LJ
Serial Number: S0H2J1KLA07831
Firmware Version: ZZ100-15
User Capacity: 400,088,457,216 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Wed Dec 5 09:43:36 2007 EST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled


However, the "-Hc" output of smartctl says that the disk health is still "PASSED" and not "FAILING". So it's possible that the disk doesn't need to be retired yet.

# /usr/sbin/smartctl -Hc -d ata /dev/sdc
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x05) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (7640) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 130) minutes.


Personally, since I know the drive makes clicking noises and throws an error during the dd wipe, I'm going to swap it out.

2 comments:

Anonymous said...

Thank you for the tips. The exact same thing happened to me on a machine at work. The hard drive seems to be fine, I'm getting some smart errors
Error 12 occurred at disk power-on lifetime: 2 hours (0 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 01 6e e2 42 e5

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
37 00 01 6e e2 42 e5 00 01:11:05.027 SET MAX ADDRESS EXT
27 00 00 6e e2 42 e0 00 01:11:02.946 READ NATIVE MAX ADDRESS EXT
37 00 00 6e e2 42 e5 00 01:11:02.886 SET MAX ADDRESS EXT
27 00 01 6e e2 42 e0 00 01:11:02.662 READ NATIVE MAX ADDRESS EXT
37 00 01 6e e2 42 e5 00 01:11:02.596 SET MAX ADDRESS EXT

Thomas said...

As an addendum to my original post. Even though the drive clicks at time, it still hasn't failed the weekly mdadm rebuild/resync.

#!/bin/sh
# Tells mdadm to verify that the arrays are synchronized.
# This deals with the issue where a seldom-read disk block has gone bad
# by doing a daily/weekly verification of the array.

echo check > /sys/block/md0/md/sync_action
echo check > /sys/block/md1/md/sync_action
echo check > /sys/block/md2/md/sync_action
echo check > /sys/block/md3/md/sync_action
echo check > /sys/block/md4/md/sync_action
echo check > /sys/block/md5/md/sync_action
echo check > /sys/block/md6/md/sync_action
echo check > /sys/block/md7/md/sync_action

So the flaky drive is still in use, with the spare hard drive sitting on top of the server case, ready to be swapped in.