concurrent zfs commands operations may lead to a race/subsystem locking. for instance this is the current state wich is not changing for at least 30 minutes (system got into it after issuing concurrent zfs commands): ===Cut=== [root@san1:~]# ps ax | grep zfs 9 - DL 7:41,34 [zfskern] 57922 - Is 0:00,01 sshd: zfsreplica [priv] (sshd) 57924 - I 0:00,00 sshd: zfsreplica@notty (sshd) 57925 - Is 0:00,00 csh -c zfs list -t snapshot 57927 - D 0:00,00 zfs list -t snapshot 58694 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 58695 - D 0:00,00 /sbin/zfs list -t all 59512 - Is 0:00,02 sshd: zfsreplica [priv] (sshd) 59516 - I 0:00,00 sshd: zfsreplica@notty (sshd) 59517 - Is 0:00,00 csh -c zfs list -t snapshot 59520 - D 0:00,00 zfs list -t snapshot 59552 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 59553 - D 0:00,00 /sbin/zfs list -t all 59554 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 59555 - D 0:00,00 /sbin/zfs list -t all 59556 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 59557 - D 0:00,00 /sbin/zfs list -t all 59558 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 59559 - D 0:00,00 /sbin/zfs list -t all 59560 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 59561 - D 0:00,00 /sbin/zfs list -t all 59564 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 59565 - D 0:00,00 /sbin/zfs list -t all 59570 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 59571 - D 0:00,00 /sbin/zfs list -t all 59572 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 59573 - D 0:00,00 /sbin/zfs list -t all 59574 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 59575 - D 0:00,00 /sbin/zfs list -t all 59878 - Is 0:00,02 sshd: zfsreplica [priv] (sshd) 59880 - I 0:00,00 sshd: zfsreplica@notty (sshd) 59881 - Is 0:00,00 csh -c zfs list -t snapshot 59883 - D 0:00,00 zfs list -t snapshot 60800 - Is 0:00,01 sshd: zfsreplica [priv] (sshd) 60806 - I 0:00,00 sshd: zfsreplica@notty (sshd) 60807 - Is 0:00,00 csh -c zfs list -t snapshot 60809 - D 0:00,00 zfs list -t snapshot 60917 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 60918 - D 0:00,00 /sbin/zfs list -t all 60950 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 60951 - D 0:00,00 /sbin/zfs list -t all 60966 - Is 0:00,02 sshd: zfsreplica [priv] (sshd) 60968 - I 0:00,00 sshd: zfsreplica@notty (sshd) 60969 - Is 0:00,00 csh -c zfs list -t snapshot 60971 - D 0:00,00 zfs list -t snapshot 61432 - Is 0:00,03 sshd: zfsreplica [priv] (sshd) 61434 - I 0:00,00 sshd: zfsreplica@notty (sshd) 61435 - Is 0:00,00 csh -c zfs list -t snapshot 61437 - D 0:00,00 zfs list -t snapshot 61502 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 61503 - D 0:00,00 /sbin/zfs list -t all 61504 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 61505 - D 0:00,00 /sbin/zfs list -t all 61506 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 61507 - D 0:00,00 /sbin/zfs list -t all 61508 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 61509 - D 0:00,00 /sbin/zfs list -t all 61510 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 61511 - D 0:00,00 /sbin/zfs list -t all 61512 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 61513 - D 0:00,00 /sbin/zfs list -t all 61569 - I 0:00,01 /usr/local/bin/sudo /sbin/zfs list -t all 61570 - D 0:00,00 /sbin/zfs list -t all 61851 - Is 0:00,02 sshd: zfsreplica [priv] (sshd) 61853 - I 0:00,00 sshd: zfsreplica@notty (sshd) 61854 - Is 0:00,00 csh -c zfs list -t snapshot 61856 - D 0:00,00 zfs list -t snapshot 57332 7 D+ 0:00,04 zfs rename data/esx/boot-esx03 data/esx/boot-esx03_orig 58945 8 D+ 0:00,00 zfs list 62119 3 S+ 0:00,00 grep zfs [root@san1:~]# ps ax | grep ctladm 62146 3 S+ 0:00,00 grep ctladm [root@san1:~]# ===Cut=== This seems to be the operation that locks the system: zfs rename data/esx/boot-esx03 data/esx/boot-esx03_orig the dataset info: ===Cut=== # zfs get all data/esx/boot-esx03 NAME PROPERTY VALUE SOURCE data/esx/boot-esx03 type volume - data/esx/boot-esx03 creation ср авг. 2 15:48 2017 - data/esx/boot-esx03 used 8,25G - data/esx/boot-esx03 available 9,53T - data/esx/boot-esx03 referenced 555M - data/esx/boot-esx03 compressratio 1.06x - data/esx/boot-esx03 reservation none default data/esx/boot-esx03 volsize 8G local data/esx/boot-esx03 volblocksize 8K default data/esx/boot-esx03 checksum on default data/esx/boot-esx03 compression lz4 inherited from data data/esx/boot-esx03 readonly off default data/esx/boot-esx03 copies 1 default data/esx/boot-esx03 refreservation 8,25G local data/esx/boot-esx03 primarycache all default data/esx/boot-esx03 secondarycache all default data/esx/boot-esx03 usedbysnapshots 0 - data/esx/boot-esx03 usedbydataset 555M - data/esx/boot-esx03 usedbychildren 0 - data/esx/boot-esx03 usedbyrefreservation 7,71G - data/esx/boot-esx03 logbias latency default data/esx/boot-esx03 dedup off inherited from data/esx data/esx/boot-esx03 mlslabel - data/esx/boot-esx03 sync standard default data/esx/boot-esx03 refcompressratio 1.06x - data/esx/boot-esx03 written 555M - data/esx/boot-esx03 logicalused 586M - data/esx/boot-esx03 logicalreferenced 586M - data/esx/boot-esx03 volmode dev inherited from data data/esx/boot-esx03 snapshot_limit none default data/esx/boot-esx03 snapshot_count none default data/esx/boot-esx03 redundant_metadata all default ===Cut=== Since the dataset is only 8G big, it's unlikely that it should take that amount of time to be rename, considering disks are idle. Got this two times in a row, and as a result all the zfs/zpool commands stopped working. I have manually brought the system into panicking to get the crashdumps. Crashdumps are located here: http://san1.linx.playkey.net/r332096M/ along with a brief description and full kernel/module binaries. Please note that the vmcore.0 is from another panic, this lockup crashdumps are 1 (unfortunately, no txt files saved) and 2.
Got anotherlockup, induced manual panic, crashdump is available via the URL above, number 5 (vmcore.5.gz).
Is this still a problem?
Yup, but this particular PR is a duplicate either of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229958 (vice-versa) or of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226499. It's highly reproducible, like 1 of 5 zfs renames leads to lockup. I plan to schedule a maintenance time on our production site, reproduce this and get the procstat -kk of the zfs rename process that you requested in 226499.
So, maybe let's close this bug and continue in one of the more detailed / researched ones?
Yup, that seems reasonable.
*** This bug has been marked as a duplicate of bug 226499 ***