Sunday, October 03, 2010

3ware 9650SE failing RAID6 array in Linux

I'm using a 3ware 9650SE 16-port controller in my Linux server, setup as a big single-disk RAID6 array. However, it's currently operating in degraded mode due to at least one disk failure. See 3ware RAID maintenance with tw_cli for a link to the 3ware documentation PDF. See also Fixing a degraded disk on the 3ware raid (blur).

# tw_cli help

Copyright (c) 2009 AMCC
AMCC/3ware CLI (version 2.00.09.012)

Commands  Description
-------------------------------------------------------------------
show      Displays information about controller(s), unit(s) and port(s).
flush     Flush write cache data to units in the system.
rescan    Rescan all empty ports for new unit(s) and disk(s).
update    Update controller firmware from an image file.
commit    Commit dirty DCB to storage on controller(s).     (Windows only)
/cx       Controller specific commands.
/cx/ux    Unit specific commands.
/cx/px    Port specific commands.
/cx/phyx  Phy specific commands.
/cx/bbu   BBU specific commands.                               (9000 only)
/cx/ex    Enclosure specific commands.                       (9690SA only)
/ex       Enclosure specific commands.                      (9KSX/SE only)

Certain commands are qualified with constraints of controller type/model
support.  Please consult the tw_cli documentation for explanation of the
controller-qualifiers.

Type help  to get more details about a particular command.
For more detail information see tw_cli's documentation. 

So if we take a look at the drives installed:

# tw_cli show

Ctl   Model        (V)Ports  Drives   Units   NotOpt  RRate   VRate  BBU
------------------------------------------------------------------------
c6    9650SE-16ML  16        7        1       1       4       4      1  

The columns are as follows:

Ports - # of drive ports on the card
Drives - # of drives connected
Units - # of RAID units created on the card
NotOpt - "not optimal"
RRate - "rebuild rate"
VRate - ???
BBU - Battery backup

# tw_cli /c6 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    DEGRADED       -       -       64K     4889.37   OFF    OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     NOT-PRESENT      -      -           -             -
p1     NOT-PRESENT      -      -           -             -
p2     NOT-PRESENT      -      -           -             -
p3     NOT-PRESENT      -      -           -             -
p4     NOT-PRESENT      -      -           -             -
p5     NOT-PRESENT      -      -           -             -
p6     OK               u0     698.63 GB   1465149168    5QD4####            
p7     OK               u0     698.63 GB   1465149168    3QD0####            
p8     NOT-PRESENT      -      -           -             -
p9     OK               u0     698.63 GB   1465149168    3QD0####            
p10    NOT-PRESENT      -      -           -             -
p11    OK               u0     698.63 GB   1465149168    5QD3####            
p12    OK               u0     698.63 GB   1465149168    3QD0####            
p13    OK               u0     698.63 GB   1465149168    5QD4####            
p14    OK               u0     698.63 GB   1465149168    3QD0####            
p15    NOT-PRESENT      -      -           -             -

As you can see, port 8 and port 10 have failed. Which means our RAID6 array is in dire shape. After testing, one of the units had failed completely, the other is merely suspect and was put back into the array. I did the rebuild in the BIOS, but when rebuilding, you will see the following:

# tw_cli /c6/u0 show all
/c6/u0 status = DEGRADED-RBLD
/c6/u0 is rebuilding with percent completion = 13%(A)
/c6/u0 is not verifying, its current state is DEGRADED-RBLD
/c6/u0 is initialized.
/c6/u0 Write Cache = off
/c6/u0 volume(s) = 1
/c6/u0 name = vg6                  
/c6/u0 serial number = 5QD40L3K00005F00#### 
/c6/u0 Ignore ECC policy = off       
/c6/u0 Auto Verify Policy = off       
/c6/u0 Storsave Policy = protection  
/c6/u0 Command Queuing Policy = on        
/c6/u0 Parity Number = 2         

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-6    DEGRADED-RBLD  13%(A)  -       -     64K     4889.37   
u0-0     DISK      OK             -       -       p14   -       698.481   
u0-1     DISK      OK             -       -       p13   -       698.481   
u0-2     DISK      OK             -       -       p12   -       698.481   
u0-3     DISK      OK             -       -       p11   -       698.481   
u0-4     DISK      DEGRADED       -       -       p10   -       698.481   
u0-5     DISK      OK             -       -       p9    -       698.481   
u0-6     DISK      DEGRADED       -       -       -     -       698.481   
u0-7     DISK      OK             -       -       p7    -       698.481   
u0-8     DISK      OK             -       -       p6    -       698.481   
u0/v0    Volume    -              -       -       -     -       4889.37

Specifically, we can see that the array is 13% through with the rebuild after only 32 minutes. I have not yet replaced port-8 as I'm going to wait for the array to finish rebuilding before I jostle it again.

Notes:

I strongly recommend that you feed the output of "tw_cli /c# show" and "tw_cli /c#/u# show all" into text files daily and parse them for issues. Or mail them to a monitoring email address. Being able to tell the technician to pull drive XYZ with a specific serial # helps eliminate errors. But that's hard to do if you don't keep track of your serial numbers.

On the systems I administer, we have a /reports/configuration folder where we consolidate all those types of reports. Things like the output of pvscan, lvscan, df, ntpq, /proc/mdstat, etc. all get dumped into text files daily and then committed to the central SVN repository for the server with FSVS. When things go bad later, we can step back through the SVN repository and look at the various reports at previous points in time.

No comments: