Wednesday, January 28, 2009

Removing a failed, non-existent drive from Software RAID

So, you have a drive that has failed, you've replaced the drive on the fly (using hot-swap SATA) and now you need to remove the old RAID slice.

For example:

md0 : active raid1 sdi1[0] sdc1[2] sdb1[3](F) sda1[1]
264960 blocks [3/3] [UUU]


In this case, sdb1 is marked as failed, and sdi1 was the slice from the newly added drive (via SATA hot-plug). So we want to remove it with mdadm's remove command:

# mdadm /dev/md0 --remove /dev/sdb1
mdadm: cannot find /dev/sdb1: No such file or directory


Oops, we can't do that because we already swapped out the failed drive (sdb).

The answer is found in the mdadm man page for the remove feature:

-r, --remove remove listed devices. They must not be active. i.e. they should be failed or spare devices. As well as the name of a device file (e.g. /dev/sda1) the words failed and detached can be given to --remove. The first causes all failed device to be removed. The second causes any device which is no longer connected to the system (i.e an open returns ENXIO) to be removed. This will only succeed for devices that are spares or have already been marked as failed.

So instead of specifying the name of the failed RAID slice we should instead us the following command:

# mdadm /dev/md0 -r detached  
mdadm: hot removed 8:17


And there you have it, the failed raid slice that is no longer connected to the system has been removed. It will not show up in "/proc/mdstat" any more.

2 comments:

GabrieleV said...

This has droven me crazy ! Thnak you !

Simon Field said...

Wonderful - I knew there had to be a way! Many thanks.