|Summary:||panic: softdep_deallocate_dependencies: dangling deps|
|Product:||Base System||Reporter:||Deepak Ukey <deepak.ukey>|
|Component:||kern||Assignee:||freebsd-fs (Nobody) <fs>|
|Severity:||Affects Some People||CC:||babupalit, cem, charles.orbello, deepak.ukey, emaste, imp, kib, mckusick, vangyzen|
Description Deepak Ukey 2018-01-24 09:38:16 UTC
We have SmartRAID controller on which We are testing below test case. With this test I am facing kernel panic and dump is pointing towards some file system functions on FreeBSD 11. Test Case: We have installed FreeBSD11 on drive attached to Microsemi storage controller connected to Supermicro X10 system. We created the partition using gpart on the logical volumes behind Microsemi storage controller and executed write operation using dd command. During the write operation, We removed the logical volume. The logical volume gets destroyed but We faced Kernel panic. Please find the below crash dump. The same issue observed in FreeBSD 10 also. ------------------------------------------------------------------------- g_vfs_done():da5p1[READ(offset=324971036672, length=32768)]error = 6 (da5:smartpqi0:0:5:0): Periph destroyed fsync: giving up on dirty 0xfffff801058a2b10: tag devfs, type VCHR usecount 1, writecount 0, refcount 103 mountedhere 0xfffff80010e84c00 flags (VI_DOOMED|VI_ACTIVE) v_object 0xfffff8010fa41960 ref 0 pages 817 cleanbuf 98 dirtybuf 2 lock type devfs: EXCL by thread 0xfffff8000b000000 (pid 778, login, tid 100169) dev da5p1 panic: softdep_deallocate_dependencies: dangling deps cpuid = 0 KDB: stack backtrace: #0 0xffffffff80aada97 at kdb_backtrace+0x67 #1 0xffffffff80a6bb76 at vpanic+0x186 #2 0xffffffff80a6b9e3 at panic+0x43 #3 0xffffffff80d14646 at softdep_deallocate_dependencies+0x76 #4 0xffffffff80b09435 at brelse+0x165 #5 0xffffffff80b27551 at flushbuflist+0x131 #6 0xffffffff80b2700b at bufobj_invalbuf+0x9b #7 0xffffffff80b2a10e at vgonel+0x17e #8 0xffffffff80b2a720 at vgone+0x40 #9 0xffffffff80934968 at devfs_delete+0x1d8 #10 0xffffffff80934e3a at devfs_populate_loop+0x21a #11 0xffffffff80934c0a at devfs_populate+0x2a #12 0xffffffff80939e1b at devfs_populate_vp+0x9b #13 0xffffffff80937e0c at devfs_lookup+0x2c #14 0xffffffff8104b853 at VOP_LOOKUP_APV+0x83 #15 0xffffffff80b1d0c1 at lookup+0x701 #16 0xffffffff80b1c576 at namei+0x486 #17 0xffffffff80b331fa at kern_chflagsat+0x9a ---------------------------------------------------------------------------- Could you please help us to resolve this problem. PS: If same test case is executed with FIO(without any partition), it runs fine.
Comment 1 Kirk McKusick 2018-01-24 22:10:39 UTC
Created attachment 190043 [details] patch for 11-CURRENT Please try the attached patch (which is relative to 11-CURRENT).
Comment 2 commit-hook 2018-01-26 18:18:03 UTC
A commit references this bug: Author: mckusick Date: Fri Jan 26 18:17:11 UTC 2018 New revision: 328444 URL: https://svnweb.freebsd.org/changeset/base/328444 Log: For many years the message "fsync: giving up on dirty" has occationally appeared on UFS/FFS filesystems. In some cases it was promptly followed by a panic of "softdep_deallocate_dependencies: dangling deps". This fix should eliminate both of these occurences. Submitted by: Andreas Longwitz <longwitz at incore.de> Reviewed by: kib Tested by: Peter Holm (pho) PR: 225423 MFC after: 1 week Changes: head/sys/kern/vfs_default.c
Comment 3 Deepak Ukey 2018-01-29 13:06:17 UTC
(In reply to Kirk McKusick from comment #1) Thanks for the patch. I applied the patch and ran the same test but still i am facing system crash. Please find below crash dump. Device da0p1 went missing before all of the data could be written to it; expect data loss. panic: devfs_fsync: vop_stdfsync failed. cpuid = 0 KDB: stack backtrace: #0 0xffffffff80b24077 at kdb_backtrace+0x67 #1 0xffffffff80ad93e2 at vpanic+0x182 #2 0xffffffff80ad9253 at panic+0x43 #3 0xffffffff80985bdf at devfs_fsync+0x8f #4 0xffffffff8110cbcd at VOP_FSYNC_APV+0x8d #5 0xffffffff80bb2dfe at sched_sync+0x3be #6 0xffffffff80a90055 at fork_exit+0x85 #7 0xffffffff80f847fe at fork_trampoline+0xe Uptime: 6m4s Thanks Deepak
Comment 4 Kirk McKusick 2018-02-01 01:36:16 UTC
Created attachment 190238 [details] 10-current patch for lost disk panic Please try this patch on your 10-current system.
Comment 5 Kirk McKusick 2018-02-06 19:29:07 UTC
Fixed with -r328444 and -r328643. MFC to 11 with -r328764 and -r328944. MFC to 10 with -r328765 and -r328946.
Comment 6 Deepak Ukey 2018-02-07 08:52:55 UTC
(In reply to Kirk McKusick from comment #4) Hi, I applied the patch provide by you on top of your first patch but still i am facing the crash please find the below crash dump. But this time it took some seconds and then it crashed. panic: softdep_deallocate_dependencies: dangling deps cpuid = 0 KDB: stack backtrace: #0 0xffffffff80aada97 at kdb_backtrace+0x67 #1 0xffffffff80a6bb76 at vpanic+0x186 #2 0xffffffff80a6b9e3 at panic+0x43 #3 0xffffffff80d14646 at softdep_deallocate_dependencies+0x76 #4 0xffffffff80b09435 at brelse+0x165 #5 0xffffffff80b27551 at flushbuflist+0x131 #6 0xffffffff80b2700b at bufobj_invalbuf+0x9b #7 0xffffffff80b2a10e at vgonel+0x17e #8 0xffffffff80b2a720 at vgone+0x40 Thanks Deepak
Comment 7 Kirk McKusick 2018-02-10 18:27:05 UTC
(In reply to Deepak Ukey from comment #6) Please let me know what version of the system you are running (uname -a) so I can prepare a patch for you to test.
Comment 8 Deepak Ukey 2018-02-12 04:52:16 UTC
(In reply to Kirk McKusick from comment #7) Hi, I am using following version: FreeBSD pmc 11.0-RELEASE-p1 FreeBSD 11.0-RELEASE-p1 #0: Mon Jan 29 09:08:26 IST 2018 root@pmc:/usr/obj/usr/src/sys/GENERIC amd64 Regards, Deepak
Comment 9 Kirk McKusick 2018-02-12 18:42:07 UTC
(In reply to Deepak Ukey from comment #8) What are you doing to cause this panic? More specifically what do I need to do to reproduce the panic. Based on the back-trace it appears that you are removing a disk from an active filesystem.
Comment 10 Deepak Ukey 2018-02-13 05:06:14 UTC
(In reply to Kirk McKusick from comment #9) Hi, Setup Details: I have one SATA drive (/dev/da0) attached to Microsemi Storage controller and my OS is on drive attached to AHCI on board controller. Steps to reproduce the issue: 1) Create the GPT partition of drive: #gpart create –s gpt /dev/da0 2)Create the partition: #gpart add –t freebsd-ufs –l gpusrfs –a <512k> /dev/da0 3)Format it: #newfs –U /dev/da0p1 4)Create the mount directory: #mkdir /mnt/<Directory> 5)Mount the drive partition: #mount /dev/<da0p1> /mnt/<Directory> 7)Write the data using dd command: dd if=/dev/zero of=test bs=10240 count=100000 8) The same time when after starting step 7, remove the drive attached to Microsemi Storage controller. Please let me know if you have any questions. Thanks, Deepak
Comment 11 Deepak Ukey 2018-02-27 06:59:52 UTC
(In reply to Deepak Ukey from comment #10) (In reply to Kirk McKusick from comment #9) Hi Kirk, Are you able to reproduce the issue. Is there any update on it. Thanks, Deepak
Comment 12 Kirk McKusick 2018-02-27 07:16:30 UTC
Yes, I can easily reproduce it now that I know what you are doing. The filesystem has never been able to handle the removal of disks while running. Handling the removal or failure of disks is a project that we have started working on as part of ``hardening'' the filesystem. It requires an overhaul through not only the filesystem, but all the way down the I/O stack. We are at least a year from having this component of the hardening working.
Comment 13 Eric van Gyzen 2018-11-15 21:02:41 UTC
I get the same panic on recent head (r340361) when force-unmounting /dev. Does that fall under the same "hardening" umbrella, or should it be treated differently? # mount -t ufs /dev/gpt/scratch on /scratch (ufs, local, soft-updates) # echo hello > /scratch/hello # umount -f /dev fsync: giving up on dirty (error = 35) 0xfffff8002d19b000: tag devfs, type VCHR usecount 1, writecount 0, refcount 7 rdev 0xfffff800035a2c00 flags (VI_DOOMED|VI_ACTIVE) v_object 0xfffff8002d9e6600 ref 0 pages 15 cleanbuf 3 dirtybuf 1 lock type devfs: EXCL by thread 0xfffff8002d014580 (pid 729, umount, tid 100397) dev gpt/scratch panic: softdep_deallocate_dependencies: dangling deps panic() at panic+0x43/frame 0xfffffe001beed4d0 softdep_deallocate_dependencies() at softdep_deallocate_dependencies+0x76/frame 0xfffffe001beed4f0 brelse() at brelse+0x176/frame 0xfffffe001beed540 flushbuflist() at flushbuflist+0x147/frame 0xfffffe001beed5a0 bufobj_invalbuf() at bufobj_invalbuf+0x9f/frame 0xfffffe001beed600 vgonel() at vgonel+0x15e/frame 0xfffffe001beed670 vflush() at vflush+0x22c/frame 0xfffffe001beed7c0 devfs_unmount() at devfs_unmount+0x43/frame 0xfffffe001beed800 dounmount() at dounmount+0x4b1/frame 0xfffffe001beed860 sys_unmount() at sys_unmount+0x310/frame 0xfffffe001beed980 amd64_syscall() at amd64_syscall+0x278/frame 0xfffffe001beedab0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe001beedab0 --- syscall (22, FreeBSD ELF64, sys_unmount), rip = 0x8002d6a3a, rsp = 0x7fffffffc628, rbp = 0x7fffffffcb60 ---
Comment 14 Eric van Gyzen 2018-11-15 21:10:28 UTC
I realize "umount -f /dev" is insane under normal operation. I'm wondering if the resulting panic indicates a problem that can be hit by other, more normal operations.
Comment 15 Kirk McKusick 2018-11-15 21:24:06 UTC
(In reply to Eric van Gyzen from comment #14) Indeed `umount -f /dev' is insane. The /dev filesystem does not use UFS or soft updates, so I assume that you are getting the panic because of interaction with /dev/gpt/scratch which is a UFS filesystem. To the extent that the problem is with /dev/gpt/scratch then yes, the hardening work that we are doing should resolve that panic. The hardening work is proceeding, albeit slowly...
Comment 16 Konstantin Belousov 2018-11-16 08:55:25 UTC
(In reply to Kirk McKusick from comment #15) Unmounting /dev reclaims the devfs vnodes, in particular, the devvp which is referenced by the mount. Currently most of the UFS io bypasses devfs vnode layer, but this is not an architectural property. We cannot guarantee that UFS operations would not require a VOP call into devfs which is not grounded by a panic (see vfs_default.c).
Comment 17 Kirk McKusick 2018-11-16 09:03:00 UTC
(In reply to Konstantin Belousov from comment #16) Given that a successful unmount of /dev makes the system all but unusable, would it be sensible to have the /dev at the root simply return EINVAL if it is asked to be unmounted? Obviously copies of it in jails can and should be unmountable, but it seems to me that the original /dev from which all other /dev's are derived should not be.
Comment 18 Konstantin Belousov 2018-11-16 09:17:36 UTC
(In reply to Kirk McKusick from comment #17) Only when you use root on UFS then /dev unmount is fatal. If you boot over NFS, or e.g. boot from UFS but then re-root into tmpfs, /dev is not that significant. Also, I do not think that it is worth so much hand-holding of user as to imply the policy at the mount points.
Comment 19 Kirk McKusick 2018-11-16 09:35:00 UTC
(In reply to Konstantin Belousov from comment #18) If you unmount /dev you lose your console, /dev/null, /dev/zero, /dev/random, etc. A system without these fails to work pretty quickly (as I discovered when I created a jail and forgot to put a /dev in it). I expect a system running with any local filesystem (e.g., UFS, ZFS, ext2fs) would crash and burn pretty quickly if its access to disk was removed. Of course, in insecure mode (which is what we run in by default) we allow root to scribble all over kernel memory through /dev/kmem so unmounting /dev is no worse than that.
Comment 20 Konstantin Belousov 2018-11-16 09:45:58 UTC
(In reply to Kirk McKusick from comment #19) You only loose usermode access to the console. Imagine that somebody want to unmount and then mount /dev again, for whatever reasons, e.g. as part of more involved re-rooting. I do not see why should we prevent this. For the same reason, we do not prevent ifconfig down the interface which was used for nfs boot.
Comment 21 Kirk McKusick 2018-11-16 09:57:50 UTC
(In reply to Konstantin Belousov from comment #20) We can always fall back to the analogy that purveyors of Unix are like purveyors of rope. It is dangerous stuff and if you are not careful, you will hang yourself.
Comment 22 Arpan Palit 2019-01-22 12:51:28 UTC
Hi, I am using freeBSD stable/11 branch, and still facing the same dangling deps panic as the b_ioflags flag is re-set'ed (i.e., BIO_ERROR not set) and also observed most cases b_error is 0. Testing environment has following commit head. ------------------------------------------------------------------------------ commit 7b249ab3e16a3d41d0a58a43d7d89137a1c9ec00 (HEAD -> stable/11, origin/stable/11) Author: delphij <delphij@FreeBSD.org> Date: Tue Jan 22 04:20:52 2019 +0000 ------------------------------------------------------------------------------ backtrace: ========== fsync: giving up on dirty (error = 6) 0xfffff8029cbda938: tag devfs, type VCHR usecount 1, writecount 0, refcount 15 rdev 0xfffff80247038400 flags (VI_DOOMED|VI_ACTIVE) v_object 0xfffff80247d320f0 ref 0 pages 95 cleanbuf 10 dirtybuf 2 lock type devfs: EXCL by thread 0xfffff8029c84d620 (pid 774, df, tid 100720) dev ada1p1 panic: softdep_deallocate_dependencies: dangling deps cpuid = 57 KDB: stack backtrace: #0 0xffffffff80b20d27 at kdb_backtrace+0x67 #1 0xffffffff80add3c7 at vpanic+0x177 #2 0xffffffff80add453 at panic+0x43 #3 0xffffffff80d84376 at softdep_deallocate_dependencies+0x76 #4 0xffffffff80b7cd2c at brelse+0x16c #5 0xffffffff80b9bc5d at flushbuflist+0x15d #6 0xffffffff80b9b891 at bufobj_invalbuf+0x81 #7 0xffffffff80b9f06e at vgonel+0x18e #8 0xffffffff80b9f68f at vgone+0x2f #9 0xffffffff809aba42 at devfs_delete+0x1a2 #10 0xffffffff809ac1e2 at devfs_populate_loop+0x2b2 #11 0xffffffff809abf1a at devfs_populate+0x4a #12 0xffffffff809b0dcc at devfs_populate_vp+0x8c #13 0xffffffff809afa0f at devfs_getattr+0x1f #14 0xffffffff810dc8a7 at VOP_GETATTR_APV+0xf7 #15 0xffffffff80bad553 at vn_stat+0xa3 #16 0xffffffff80bab44f at vn_statfile+0x4f #17 0xffffffff80a8a8a9 at kern_fstat+0xa9 -> Am I missing some patch or the issue still required some potential fix? Thanks, Arpan
Comment 23 Warner Losh 2019-01-22 17:16:44 UTC
The issue is well understood. We've hit an error. We have dirty buffers that are interdependent. We can't just throw them away because of that interdependence, but have to carefully unwind the dependency tree. That code hasn't been written yet, and rather than continue when we know we've lost data, we panic. This panic is known to be too aggressive because we also know that there's no that the data could be flushed to media because the media is gone, never to return. There's no patches available (unless Kirk has some I've not noticed).
Comment 24 Kirk McKusick 2021-08-01 00:22:01 UTC
As of the release of 13.0 the UFS filesystem recovers from a disk dying or disappearing by forcibly unmounting the filesystem. As part of the forcible unmount all dirty buffers and soft update dependencies are removed. Thus this panic no longer occurs. Verified by Peter Holm with his stress tests.