Replacing a (silently) failing disk in a ZFS pool

Maybe I can’t read, but I have the feeling that official documentations explain every single corner case for a given tool, except the one you will actually need. My today’s struggle: replacing a disk within a FreeBSD ZFS pool.

What? there’s a shitton of docs on this topic! Are you stupid?

I don’t know, maybe. Yet none covered the process in a simple, straight and complete manner. Here’s the story:

Since yesterday I felt my personal FreeBSD NAS was sluggish, and this morning, I saw those horrible messages popping in my syslog console:

1
2
3
4
5
6
7
Jul  2 12:49:53 <kern.crit> newcoruscant kernel: ahcich1: Timeout on slot 8 port 0
Jul 2 12:49:53 <kern.crit> newcoruscant kernel: ahcich1: is 00000000 cs 00000000 ss 00000300 rs 00000300 tfd 40 serr 00000000 cmd 0000c917
Jul 2 12:49:53 <kern.crit> newcoruscant kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 50 25 e9 40 3b 00 00 00 00 00
Jul 2 12:49:53 <kern.crit> newcoruscant kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Jul 2 12:49:53 <kern.crit> newcoruscant kernel: (ada1:ahcich1:0:0:0): Retrying command
Jul 2 12:51:02 <kern.crit> newcoruscant kernel: cant/memory/memory-inactive: ds[0] = 52350976.000000
Jul 2 12:51:02 <kern.crit> newcoruscant kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)

Yeah… that bad.

The first thing that stroke me is that ZFS seemed perfectly fine with that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
root@newcoruscant:~ # zpool status
pool: zroot
state: ONLINE
scan: scrub repaired 0 in 2h26m with 0 errors on Tue Jun 25 12:08:56 2019
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p4 ONLINE 0 0 0
ada1p4 ONLINE 0 0 0
ada2p4 ONLINE 0 0 0

errors: No known data errors

But the input/output error thrown by smartctl -a /dev/ada1 made things clear, I needed to replace this disk quickly!
Thanks to past-me, there already was a disk ready for this task at ada3, so, after trustfully reading the zpool administration guide, and in particular Replacing a Functioning Device, I entered:

1
# zpool replace zroot ada1p4 ada3p4

Except it didn’t ran as expected:

1
2
cannot open 'ada3p4': no such GEOM provider
must be a full path or shorthand device name

What a fantastic and explicit error message just to say that ada3 doesn’t have a corresponding partition table.
I am no FreeBSD guru and very occasional user, so no, I am not used to GEOM, gpart, GELI etc… finally, this very well written stackexchange post showed me how to replicate the correct partition table to the new disk:

1
# gpart backup ada0|gpart restore -F ada3

Now zpool replace zroot ada1p4 ada3p4 would work! I also did not forget to replicate the boot sequence to the new disk as instructed by both the documentation and zpool:

1
2
3
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada3 
partcode written to ada3p1
bootcode written to ada3

And at last the silvering was taking place:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@newcoruscant:~ # zpool status
pool: zroot
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Jul 2 11:21:24 2019
3.91M scanned out of 1.84T at 38.5K/s, (scan is slow, no estimated time)
1.30M resilvered, 0.00% done
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p4 ONLINE 0 0 0
replacing-1 ONLINE 0 0 0
ada1p4 ONLINE 0 0 0
ada3p4 ONLINE 0 0 0
ada2p4 ONLINE 0 0 0

errors: No known data errors

But… at less than 40K/s! Turns out that very logically the failing disk and its timeouts was slowing down the silvering, so I learned that to avoid this kind of situation, you should offline the failing disk from the zpool:

1
# zpool offline zroot ada1p4

And then

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ sudo zpool status
pool: zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Jul 2 16:01:22 2019
514G scanned out of 1.84T at 167M/s, 2h20m to go
170G resilvered, 27.22% done
config:

NAME STATE READ WRITE CKSUM
zroot DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ada0p4 ONLINE 0 0 0
replacing-1 DEGRADED 0 0 8
15084350875675872541 OFFLINE 0 0 0 was /dev/ada1p4
ada3p4 ONLINE 0 0 0
ada2p4 ONLINE 0 0 0

errors: No known data errors

Much better. At the end of the resilvering, everything is now working correctly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ sudo zpool status
pool: zroot
state: ONLINE
scan: resilvered 628G in 2h52m with 0 errors on Tue Jul 2 18:53:48 2019
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p4 ONLINE 0 0 0
ada3p4 ONLINE 0 0 0
ada2p4 ONLINE 0 0 0

errors: No known data errors

I read that you should zpool remove the failing disk at the end of this operation, but when trying to do so:

1
2
3
4
root@newcoruscant:~ # zpool remove zroot ada1p4
cannot remove ada1p4: no such device in pool
root@newcoruscant:~ # zpool remove zroot 15084350875675872541
cannot remove 15084350875675872541: no such device in pool

So I guess zpool did it itself.
Now it’s time to buy and add a new spare for the next disk that fails…