Bug 215635

Summary: LOR in zfs
Product: Base System Reporter: Dan Langille <dvl>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: New ---    
Severity: Affects Only Me    
Priority: ---    
Version: 10.3-RELEASE   
Hardware: Any   
OS: Any   

Description Dan Langille freebsd_committer freebsd_triage 2016-12-28 16:18:11 UTC
I saw this in my logs this morning.  This server is the same one which has been locking up, most nights, at about 0301 UTC.

Dec 28 03:12:20 knew kernel: lock order reversal:
Dec 28 03:12:20 knew kernel: 1st 0xfffff80418bea068 zfs (zfs) @ /usr/src/sys/kern/vfs_extattr.c:171
Dec 28 03:12:20 knew kernel: 2nd 0xfffff803f973c040 filedesc structure (filedesc structure) @ /usr/src/sys/kern/vfs_lookup.c:213
Dec 28 03:12:20 knew kernel: KDB: stack backtrace:
Dec 28 03:12:20 knew kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe03b95f2fa0
Dec 28 03:12:20 knew kernel: kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe03b95f3050
Dec 28 03:12:20 knew kernel: witness_checkorder() at witness_checkorder+0xc7e/frame 0xfffffe03b95f30e0
Dec 28 03:12:20 knew kernel: _sx_slock() at _sx_slock+0x46/frame 0xfffffe03b95f3120
Dec 28 03:12:20 knew kernel: namei() at namei+0x19a/frame 0xfffffe03b95f31f0
Dec 28 03:12:20 knew kernel: vn_open_cred() at vn_open_cred+0x10d/frame 0xfffffe03b95f3350
Dec 28 03:12:20 knew kernel: zfs_setextattr() at zfs_setextattr+0x204/frame 0xfffffe03b95f3670
Dec 28 03:12:20 knew kernel: VOP_SETEXTATTR_APV() at VOP_SETEXTATTR_APV+0xa7/frame 0xfffffe03b95f36a0
Dec 28 03:12:20 knew kernel: extattr_set_vp() at extattr_set_vp+0x153/frame 0xfffffe03b95f3770
Dec 28 03:12:20 knew kernel: sys_extattr_set_file() at sys_extattr_set_file+0xf4/frame 0xfffffe03b95f39a0
Dec 28 03:12:20 knew kernel: amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe03b95f3ab0
Dec 28 03:12:20 knew kernel: Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe03b95f3ab0
Dec 28 03:12:20 knew kernel: --- syscall (356, FreeBSD ELF64, sys_extattr_set_file), rip = 0x80086637a, rsp = 0x7fffffffe668, rbp = 0x7fffffffe700 ---
Comment 1 Dan Langille freebsd_committer freebsd_triage 2016-12-28 16:29:26 UTC
In case it's relevant:

The server has 3x LSI SAS 2008 cards.

It boots off zfs, a 10-drive raidz2.

There is a second 10-drive raidz3.

8 drives from one zpool are all on one HBA.

8 drives from the other zpool are on another HBA.

Two drives from each zpool are on a third HBA (i.e. that HBA has two drives from one pool and two drives from the other).

The zroot zpool has been around for several years, but was recently moved from one box to another, getting a new M/B & new HBAs.  Shortly afterwards, the raidz3 was added to the box. Since that addition, the system has frozen most nights at about 0301, more or less.

The host has 17 jails, all of which have stock /etc/crontab entries (i.e. daily periodic runs).

By freeze, I mean:

- ssh to the server fails to connect
- login at console accepts password but does not present prompt, but responds to CTL-T (https://twitter.com/DLangille/status/804738019857199105)
- Backups jobs which used this server as the destination, continued to work.
- nagios flips out over the services provided by the jails/host
- postfix stops responding (both host and jails)
- tail -F /var/log/messages : in an existing ssh session continued to stream entries

refs:

* https://twitter.com/DLangille/media (lots of screen shots0
* https://twitter.com/search?q=%23frozenserver&src=typd
* http://dan.langille.org/2016/12/09/system-freezes-up-with-lots-of-sonewconn-listen-queue-overflow/
* http://dan.langille.org/2016/12/14/server-freeze-2014-12-14/
Comment 2 Dan Langille freebsd_committer freebsd_triage 2016-12-28 16:41:57 UTC
Last night, I invoked the server freeze by doing a 'tar -czf'.  I took a dump from that, and will provide upon request.

See the output of 'show lockedvnods' here: https://twitter.com/DLangille/status/813870668513153024