Bug 194606 - filesystem unmount deadlock on 10.1 and head when TRIM enabled at unmount after r268815, MFC of 268205
Summary: filesystem unmount deadlock on 10.1 and head when TRIM enabled at unmount aft...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.1-RC2
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-26 09:51 UTC by Guido Falsi
Modified: 2014-10-29 13:51 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Guido Falsi freebsd_committer freebsd_triage 2014-10-26 09:51:06 UTC
While performing some tests with nanobad, FreeBSD 10.1-RC3 on alix hardware I discovered a lockup when unmounting filesystems.

This hardware is a small motherboard using CF card as main storage.

I usually enable trim support on these. NanoBSD mounts filesystems read only, and I use scripts to mount/unmount filesystems when changes need to be saved.

I have seen a deadlock when unmounting. With a debugging kernel I got this:

root@qtest:~ [0]# umount /cfg
panic: detach with active requests
KDB: stack backtrace:
db_trace_self_wrapper(c0968053,c08ea7f0,c2d48800,c23d6bc8,c0536a16,...)
at db_trace_self_wrapper+0x2d/frame 0xc23d6b98
kdb_backtrace(c09639e1,c09fa7e8,c095761d,c23d6c54,c095761d,...) at
kdb_backtrace+0x30/frame 0xc23d6c00
vpanic(c09fa682,100,c095761d,c23d6c54,c23d6c54,...) at vpanic+0x80/frame
0xc23d6c24
kassert_panic(c095761d,c09575b3,c2d7acc0,4c7,c2d7acc0,...) at
kassert_panic+0xe9/frame 0xc23d6c48
g_detach(c2d7acc0,4,c095725c,1c2,c09c8d5c,...) at g_detach+0x1d3/frame
0xc23d6c64
g_wither_washer(c09f7df4,0,c0956544,124,0,...) at
g_wither_washer+0x109/frame 0xc23d6c90
g_run_events(0,c23d6d08,c095d42a,3dc,0,...) at g_run_events+0x40/frame
0xc23d6ccc
fork_exit(c05c4e60,0,c23d6d08) at fork_exit+0x7f/frame 0xc23d6cf4
fork_trampoline() at fork_trampoline+0x8/frame 0xc23d6cf4
--- trap 0, eip = 0, esp = 0xc23d6d40, ebp = 0 ---
KDB: enter: panic
[ thread pid 12 tid 100006 ]
Stopped at      kdb_enter+0x3d: movl    $0,kdb_why
db>

I played around with ddb and discovered this:

db> show geom 0xc2e98b40
consumer: 0xc2e98b40
  class:    VFS (0xc09c8d5c)
  geom:     ffs.ada0s3 (0xc3293600)
  provider: ada0s3 (0xc2e7e200)
  access:   r0w0e0
  flags:    0x0030
  nstart:   19
  nend:     18

Which shows nstart != nend, while g_detach asserts them to be the same.

Going up the chain of providers I find also it's providers have nstart -
nend == 1:

db> show geom 0xc2e9b7c0
consumer: 0xc2e9b7c0
  class:    PART (0xc09c96b0)
  geom:     ada0 (0xc2e7e780)
  provider: ada0 (0xc2e7e500)
  access:   r2w0e0
  flags:    0x0030
  nstart:   1430
  nend:     1429
db> show geom 0xc2e7e500
provider: ada0 (0xc2e7e500)
  class:        DISK (0xc09c8890)
  geom:         ada0 (0xc2e7e580)
  mediasize:    4017807360
  sectorsize:   512
  stripesize:   0
  stripeoffset: 0
  access:       r2w0e0
  flags:         (0x0030)
  error:        0
  nstart:       2085
  nend:         2084
  consumer: 0xc2e9a700 (ada0), access=r0w0e0, flags=0x0030
  consumer: 0xc2e9b480 (ada0), access=r0w0e0, flags=0x0030
  consumer: 0xc2e9b7c0 (ada0), access=r2w0e0, flags=0x0030

Having no idea how to debug further I started testing various revisions and I finally discovered that the commit that broke it is r268815, which MFCed r268205. Also disabling trim on the FS "fixes" the problem, which seems to confirm that change to be involved.

Since this depends on hardware support for trim I have been unable to reproduce this in virtualbox. I'm sorry I'm unable to produce a use case.

I'm CCing imp, who committed r268815, hoping he can have some more insight in this.

This also affects head, obviously.

I'm available for any further testing or information needed.

Thanks in advance.
Comment 1 Warner Losh freebsd_committer freebsd_triage 2014-10-26 14:03:48 UTC
If you disable trim, does the problem go away?

I had a hard time scrounging up a CF card to test with on a SATA system.

I'm guessing that I've dropped a biodone given the debug you've posted.
Comment 2 Guido Falsi freebsd_committer freebsd_triage 2014-10-26 14:19:50 UTC
(In reply to Warner Losh from comment #1)
> If you disable trim, does the problem go away?

Yes, it does go away. It easy to test, since simply mounting thee FS, editing a file with vi and unmounting it causes the panic, if not at first try it does in 2-3.

> 
> I had a hard time scrounging up a CF card to test with on a SATA system.
> 
> I'm guessing that I've dropped a biodone given the debug you've posted.

Maybe, unluckily I don't know much about the kernel and the VFS system, so I can't really help with the code.

Looking at it I noticed that before that commit thee value of softc->trim_running is changed before any operation is performed, while after the patch the code calls the new functions performing operation before changing that value, which is changed after the conditional (line 1506). It could be unrelated, I don't really know what that variable means, but could it be related?

If you have some patch I'll be happy to test and report back. I can perform any kind of test, since this is not production hardware.
Comment 3 commit-hook freebsd_committer freebsd_triage 2014-10-26 18:41:11 UTC
A commit references this bug:

Author: smh
Date: Sun Oct 26 18:41:01 UTC 2014
New revision: 273704
URL: https://svnweb.freebsd.org/changeset/base/273704

Log:
  Fix CF ERASE breakage caused by 268205.

  This prevents BIO_DELETE requests getting stuck in the TRIM queue which
  results in a panic on shutdown due to outstanding requests.

  PR:		194606
  Reported by:	Guido Falsi
  Reviewed by:	mav
  MFC after:	3 days
  Sponsored by:	Multiplay

Changes:
  head/sys/cam/ata/ata_da.c
Comment 4 commit-hook freebsd_committer freebsd_triage 2014-10-29 11:12:16 UTC
A commit references this bug:

Author: smh
Date: Wed Oct 29 11:11:55 UTC 2014
New revision: 273818
URL: https://svnweb.freebsd.org/changeset/base/273818

Log:
  MFS10 r273814
  MFC r273704

  Fix ATA CF ERASE breakage caused by 268205

  PR:		194606
  Approved by:	re (marius)
  Sponsored by:	Multiplay

Changes:
_U  releng/10.1/
  releng/10.1/sys/cam/ata/ata_da.c
Comment 5 Guido Falsi freebsd_committer freebsd_triage 2014-10-29 13:51:19 UTC
Thanks!