RAID-Z3 causes fatal hang upon scrub/import on 9.0-BETA2/amd64. By fatal hang, I mean: (1) the hard drive LEDs freeze in a static state of on or off (rather than flashing to indicate drive activity) and stay there; (2) the console no longer responds to any keypress events such as space bar or Control-Alt-F2; (3) the system entirely stops responding to pings. I noticed this initially when I tried running "zdb pool" while I was doing a "zpool scrub pool", and then the system crashed. I had thought "zdb pool" would be a read only operation just to give me some interesting metadata I could page through. But, rest assured, when I attempted to narrow down what was faulty or problematic here, I didn't touch that command with a ten foot pole (although, in the case where I confirmed that the system was working properly, such as with RAID-Z2, "zdb pool" didn't cause a problem). I think anyhow that "zdb pool" must have consumed too much memory and so the machine crashed. This was the first time the machine had been up and I had created the array in that boot. So, the first time I attempted to "zpool import pool" after initial creation, I could see all drives being accessed for about a minute or so (positive activity), but then after that minute, the system fatally stalled, as described above. I had tried "zpool scrub -s pool", and was only able to see the data at all by running "zpool export pool && zpool import -o readonly=on pool". Then when I tried importing it read-write again, there was a stall. It wasn't necessary to have the pool be disconnected without a clean dismount. In fact, when I tried repeating the problem with a fresh creation of a new zpool (after a proper zpool destroy of the old one), I found that it was the "zpool import" or "zpool scrub" process alone that triggered the fatal stall. I sincerely hope this is helpful. I've switched to RAID-Z2 for now, unfortunately. Rest assured, I would be able to do much more rigourous testing on ZFS. If this problem is confirmed and fixed by 9.0 I can offer a contribution of uncovering more bugs with a debugged kernel enabled. In the meantime I need to move forward. Fix: Unknown. I can confirm that if I use RAID-Z2 and do many "zpool import" and "zpool export" commands back to back as well as "zpool scrub" then there is no problem at all. How-To-Repeat: zpool create -O checksum=sha256 -O compression=gzip-9 pool raidz3 gpt/foo*.eli zfs create -o checksum=sha256 -o compression=gzip-9 -o copies=3 pool/pond zpool scrub pool # or: zpool export pool && zpool import pool (Both of these seem to trigger the fatal stall as described above). The following conditions may or may not apply. I don't have the resources or time to check. But, (1) the drives are 3TB each; (2) I partitioned the drives using GPT and one large labelled partition each with 99% capacity allocated to it; (3) I am using geli on the large partition. If it seems that these factors are what are causing the problem, note that when I choose to create a RAID-Z2 pool instead of RAID-Z3, there is no problem at all. I can also confirm that the entirety of the drives is accessible, since I did a full dd to the entire drive (partition sector, metadata and all), so it is not a matter of the kernel not seeing the drive size properly. In any case I would expect a graceful error from the kernel instead of this kind of stall. I haven't attempted to move past the actual stall condition such as by kernel debugging, but the reproducibility of the problem leads me to suspect that might not be necessary.
Responsible Changed From-To: freebsd-bugs->freebsd-fs Over to maintainer(s).
The following kernel trace may be of relevance (console output shown below). I receive this on 9.0-BETA3. lock order reversal: 1st 0xfffffe000656c278 zfs (zfs) @ /usr/src/sys/kern/vfs_vnops.c:618 2nd 0xfffffe017795e098 ufs (ufs) @ /usr/src/sys/kern/vfs_subr.c:2134 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a kdb_backtrace() at kdb_backtrace+0x37 _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x807 __lockmgr_args() at __lockmgr_args+0x109c ffs_lock() at ffs_lock+0x8c VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b _vn_lock() at _vn_lock+0x47 vget() at vget+0x7b vm_fault_hold() at vm_fault_hold+0x1976 trap_pfault() at trap_pfault+0x118 trap() at trap+0x39b calltrap() at calltrap+0x8 --- trap 0xc, rip = 0xffffffff80b0aa8d, rsp = 0xffffff82331b5640, rbp = 0xffffff82331b56a0 --- copyin() at copyin+0x3d zfs_freebsd_write() at zfs_freebsd_write+0x46f VOP_WRITE_APV() at VOP_WRITE_APV+0x103 vn_write() at vn_write+0x2a2 dofilewrite() at dofilewrite+0x85 kern_writev() at kern_writev+0x6c sys_write() at sys_write+0x55 amd64_syscall() at amd64_syscall+0x3ba Xfast_syscall() at Xfast_syscall+0xf7 --- syscall (4, FreeBSD ELF64, sys_write), rip = 0x80094533c, rsp = 0x7fffffffd9d8, rbp = 0x80065b000 ---
For bugs matching the following criteria: Status: In Progress Changed: (is less than) 2014-06-01 Reset to default assignee and clear in-progress tags. Mail being skipped
^Triage: I'm sorry that this PR did not get addressed in a timely fashion. By now, the version that it was created against is long out of suppoprt. Please re-open if it is still a problem on a supported version.