222027 – panic on non-zero RACCT destroy

Bug 222027 - panic on non-zero RACCT destroy

Summary: panic on non-zero RACCT destroy

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:

Duplicates (1):	210315 (view as bug list)
Depends on:
Blocks:

Reported:	2017-09-03 15:14 UTC by Larry Rosenman
Modified:	2018-12-06 13:45 UTC (History)
CC List:	4 users (show)

See Also:	210315

Attachments
Latest (2017-10-07) text dump (103.50 KB, application/x-tar) 2017-10-08 13:15 UTC, Larry Rosenman	no flags	Details
info (2017-10-07) from last dump (486 bytes, text/plain) 2017-10-08 13:16 UTC, Larry Rosenman	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Larry Rosenman freebsd_committer

2017-09-03 15:14:11 UTC

panic: destroying non-empty racct: 3399680 allocated for resource 4

cpuid = 8
time = 1504448756
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe535fec7600
vpanic() at vpanic+0x19c/frame 0xfffffe535fec7680
kassert_panic() at kassert_panic+0x126/frame 0xfffffe535fec76f0
racct_destroy_locked() at racct_destroy_locked+0x7a/frame 0xfffffe535fec7720
racct_destroy() at racct_destroy+0x35/frame 0xfffffe535fec7740
uifree() at uifree+0x8b/frame 0xfffffe535fec7760
crfree() at crfree+0x85/frame 0xfffffe535fec7780
thread_wait() at thread_wait+0x9b/frame 0xfffffe535fec77a0
proc_reap() at proc_reap+0x3fe/frame 0xfffffe535fec77e0
proc_to_reap() at proc_to_reap+0x36c/frame 0xfffffe535fec7830
kern_wait6() at kern_wait6+0x28d/frame 0xfffffe535fec78d0
sys_wait4() at sys_wait4+0x78/frame 0xfffffe535fec7ac0
amd64_syscall() at amd64_syscall+0x769/frame 0xfffffe535fec7bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe535fec7bf0
--- syscall (7, FreeBSD ELF64, sys_wait4), rip = 0x79629a, rsp = 0x7fffffffcf78, rbp = 0x7fffffffd5f0 ---
Uptime: 2d19h33m23s

borg.lerctr.org /var/crash # cat version.txt
FreeBSD 12.0-CURRENT #5 r323021: Wed Aug 30 12:44:11 CDT 2017
    root@borg.lerctr.org:/usr/obj/usr/src/sys/VT-LER
borg.lerctr.org /var/crash #

Comment 1 Larry Rosenman freebsd_committer

2017-09-03 15:15:27 UTC

This happens on a large poudriere run.

Comment 2 Enji Cooper freebsd_committer

2017-09-14 19:37:10 UTC

I don't think avg@ can really provide much input here. This looks like a wait*(2)/proc change caused this issue.

Comment 3 Larry Rosenman freebsd_committer

2017-10-08 13:14:22 UTC

This is still happening.  I'll attach the text dumps I Have.

Comment 4 Larry Rosenman freebsd_committer

2017-10-08 13:15:33 UTC

Created attachment 186996 [details]
Latest (2017-10-07) text dump

Comment 5 Larry Rosenman freebsd_committer

2017-10-08 13:16:02 UTC

Created attachment 186997 [details]
info (2017-10-07) from last dump

Comment 6 Andriy Gapon freebsd_committer

2018-03-05 06:15:29 UTC

I've also just got this, also at the end of a poudriere run:

panic: destroying non-empty racct: 2334720 allocated for resource 4
cpuid = 0
curthread: 0xfffff80192300580
stack: 0xfffffe090f856000 - 0xfffffe090f85a000
stack pointer: 0xfffffe090f859910
time = 1520178482
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff804670bb = db_trace_self_wrapper+0x2b/frame 0xfffffe090f859840
kdb_backtrace() at 0xffffffff806d4c79 = kdb_backtrace+0x39/frame 0xfffffe090f8598f0
vpanic() at 0xffffffff80699da6 = vpanic+0x166/frame 0xfffffe090f859930
kassert_panic() at 0xffffffff80699adb = kassert_panic+0x16b/frame 0xfffffe090f8599a0
racct_destroy_locked() at 0xffffffff80689f82 = racct_destroy_locked+0x72/frame 0xfffffe090f8599d0
racct_destroy() at 0xffffffff80689ee5 = racct_destroy+0x35/frame 0xfffffe090f8599f0
prison_racct_free_locked() at 0xffffffff80667adc = prison_racct_free_locked+0x4c/frame 0xfffffe090f859a10
prison_racct_free() at 0xffffffff80667a2d = prison_racct_free+0x6d/frame 0xfffffe090f859a30
prison_racct_detach() at 0xffffffff80667b8e = prison_racct_detach+0x3e/frame 0xfffffe090f859a50
prison_deref() at 0xffffffff8066583e = prison_deref+0x2fe/frame 0xfffffe090f859a90
prison_remove_one() at 0xffffffff80666ca5 = prison_remove_one+0x145/frame 0xfffffe090f859ac0
sys_jail_remove() at 0xffffffff80666b49 = sys_jail_remove+0x1f9/frame 0xfffffe090f859b10
syscallenter() at 0xffffffff808fd92b = syscallenter+0x2cb/frame 0xfffffe090f859b50
amd64_syscall() at 0xffffffff808fd4e7 = amd64_syscall+0x17/frame 0xfffffe090f859bf0
Xfast_syscall() at 0xffffffff808e485b = Xfast_syscall+0xfb/frame 0xfffffe090f859bf0

Comment 7 Andriy Gapon freebsd_committer

2018-04-06 17:17:31 UTC

I think that my "fix" for bug 210315 was not a real fix.
I have another theory now which, if it holds, may explain all panics reported so far in that bug and in this bug.

I suspect that the problem could be with this pattern used in many places where process credentials change:

  proc_set_cred(p, newcred);
  PROC_UNLOCK(p);
  racct_proc_ucred_changed(p, oldcred, newcred);

You can see this in do_jail_attach() as well as in sys_setloginclass(), sys_setuid() and alike.

Now, let's compare this to how RACCT_RSS is updated by vm_daemon():
  PROC_LOCK(p)
  racct_set(p, RACCT_RSS, rsize);
  PROC_UNLOCK(p);

This happens concurrently to jail_attach or setuid, etc, so it's possible that once in a while this update will happen while the process has already been assigned the newcred, but before racct_proc_ucred_changed() is called.  In this case the new value would be applied to the process and its _new_ credentials will be adjusted.  But when racct_proc_ucred_changed() is called we will transfer the process's resource from oldcred to newcred.  So, essentially we will removed more than needed (provided the delta was positive) from oldcred and add more than needed to newcred.

I think that the theory explains the panic in comment #6 where a jail got some extra RACCT_RSS.  It also seems to explain bug# 210315, comment #2 when an underflow happened in the global space (prison0) during shutdown.

I think that doing racct_proc_ucred_changed() under the proc lock would prevent the problem, but there must be a good reason why that function requires that it is not called with under the lock...
So, not sure...

Comment 8 Andriy Gapon freebsd_committer

2018-04-06 17:27:05 UTC

*** Bug 210315 has been marked as a duplicate of this bug. ***

Comment 9 Andriy Gapon freebsd_committer

2018-04-11 13:43:24 UTC

I devised a scenario for this bug.  Running it for 10 minutes reliably reproduces the crash.

In one shell executed this small script as root:
    while true ; do ( sleep 2 ; exec su -c xuser -m root -c ':' ) & sleep 0.001 ; done
The idea behind this script is to make a process noticed by vm_daemon so that its RACCT_RSS gets set and then call setloginclass() hoping to expose the race described in comment #7.
"xuser" login class is selected because it's not used by anything else, so the experiment is more pure.

In another shell executed this dtrace command:
    dtrace -w -n 'sys_setloginclass:entry { self->t = 1; } sys_setloginclass:return { self->t = 0; }  proc_set_cred:entry /self->t/ { chill(100000000); }'
The idea is to make proc_set_cred() take much longer than usual while holding the proc lock, so that vm_daemon has much higher chances of running into that lock and getting blocked on it.  When sys_setloginclass() releases the lock, vm_daemon gets a chance to run and expose the race.

When stop the test and initiate a graceful reboot, I get a panic about RACCT_RSS underflow in "root" login class:
panic: racct_adjust_resource: resource 4 usage < 0
cpuid = 2
time = 1523437499
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe002b46d860
vpanic() at vpanic+0x19c/frame 0xfffffe002b46d8e0
kassert_panic() at kassert_panic+0x126/frame 0xfffffe002b46d950
racct_adjust_resource() at racct_adjust_resource+0xd2/frame 0xfffffe002b46d980
racct_set_locked() at racct_set_locked+0x135/frame 0xfffffe002b46d9d0
racct_set() at racct_set+0x51/frame 0xfffffe002b46da00
vmspace_exit() at vmspace_exit+0x147/frame 0xfffffe002b46da40
exit1() at exit1+0x5ad/frame 0xfffffe002b46dab0
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe002b46dac0
amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe002b46dbf0
fast_syscall_common() at fast_syscall_common+0x105/frame 0xfffffe002b46dbf0
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x8008eb77a, rsp = 0x7fffffffe1a8, rbp = 0x7fffffffe1c0 ---

(kgdb) bt
#0  kdb_enter (why=0xffffffff814ee179 "panic", msg=<optimized out>) at /usr/devel/svn/autoro/sys/kern/subr_kdb.c:479
#1  0xffffffff80ac6549 in vpanic (fmt=<optimized out>, ap=0xfffffe002b46d920) at /usr/devel/svn/autoro/sys/kern/kern_shutdown.c:826
#2  0xffffffff80ac62f6 in kassert_panic (fmt=0xffffffff814ec548 "%s: resource %d usage < 0") at /usr/devel/svn/autoro/sys/kern/kern_shutdown.c:723
#3  0xffffffff80ab6ea2 in racct_adjust_resource (racct=0xfffff800074c6dd0, resource=4, amount=<optimized out>) at /usr/devel/svn/autoro/sys/kern/kern_racct.c:526
#4  0xffffffff80ab6965 in racct_set_locked (p=<optimized out>, resource=4, amount=<optimized out>, force=0) at /usr/devel/svn/autoro/sys/kern/kern_racct.c:618
#5  0xffffffff80ab67d1 in racct_set (p=0xfffff800149f7000, resource=4, amount=0) at /usr/devel/svn/autoro/sys/kern/kern_racct.c:739
#6  0xffffffff80de9577 in vmspace_container_reset (p=<optimized out>) at /usr/devel/svn/autoro/sys/vm/vm_map.c:318
#7  vmspace_exit (td=0xfffff8005d353000) at /usr/devel/svn/autoro/sys/vm/vm_map.c:427
#8  0xffffffff80a8175d in exit1 (td=0xfffff8005d353000, rval=<optimized out>, signo=0) at /usr/devel/svn/autoro/sys/kern/kern_exit.c:403
#9  0xffffffff80a811ad in sys_sys_exit (td=0xffffffff81d58408 <cnputs_mtx>, uap=<optimized out>) at /usr/devel/svn/autoro/sys/kern/kern_exit.c:181
#10 0xffffffff80f7df7b in syscallenter (td=0xfffff8005d353000) at /usr/devel/svn/autoro/sys/amd64/amd64/../../kern/subr_syscall.c:134
#11 amd64_syscall (td=0xfffff8005d353000, traced=0) at /usr/devel/svn/autoro/sys/amd64/amd64/trap.c:936

(kgdb) fr 3
#3  0xffffffff80ab6ea2 in racct_adjust_resource (racct=0xfffff800074c6dd0, resource=4, amount=<optimized out>) at /usr/devel/svn/autoro/sys/kern/kern_racct.c:526
526                     KASSERT(RACCT_IS_SLOPPY(resource) || RACCT_IS_DECAYING(resource),
(kgdb) p *racct
$1 = {r_resources = {15961661599, 98304, 794624, 0, -1314816, 0, 3, 0, 51683328, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2817797305501, 132851, 0, 0, 0, 0}, r_rule_links = {lh_first = 0x0}}

(kgdb) p *p->p_ucred->cr_loginclass
$4 = {lc_next = {le_next = 0xfffff800030692c0, le_prev = 0xfffff8000377b380}, lc_name = "root", '\000' <repeats 28 times>, lc_refcount = 6, lc_racct = 0xfffff800074c6dd0}

(kgdb) p *p->p_ucred->cr_loginclass->lc_racct
$5 = {r_resources = {15961661599, 98304, 794624, 0, -1314816, 0, 3, 0, 51683328, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2817797305501, 132851, 0, 0, 0, 0}, r_rule_links = {lh_first = 0x0}}

Comment 10 Andriy Gapon freebsd_committer

2018-04-12 10:00:54 UTC

See https://reviews.freebsd.org/D15048

Comment 11 commit-hook freebsd_committer

2018-04-20 13:08:35 UTC

A commit references this bug:

Author: avg
Date: Fri Apr 20 13:08:05 UTC 2018
New revision: 332816
URL: https://svnweb.freebsd.org/changeset/base/332816

Log:
  call racct_proc_ucred_changed() under the proc lock

  The lock is required to ensure that the switch to the new credentials
  and the transfer of the process's accounting data from the old
  credentials to the new ones is done atomically.  Otherwise, some updates
  may be applied to the new credentials and then additionally transferred
  from the old credentials if the updates happen after proc_set_cred() and
  before racct_proc_ucred_changed().

  The problem is especially pronounced for RACCT_RSS because
  - there is a strict accounting for this resource (it's reclaimable)
  - it's updated asynchronously by the vm daemon
  - it's updated by setting an absolute value instead of applying a delta

  I had to remove a call to rctl_proc_ucred_changed() from
  racct_proc_ucred_changed() and make all callers of latter call the
  former as well.  The reason is that rctl_proc_ucred_changed, as it is
  implemented now, cannot be called while holding the proc lock, so the
  lock is dropped after calling racct_proc_ucred_changed.  Additionally,
  I've added calls to crhold / crfree around the rctl call, because
  without the proc lock there is no gurantee that the new credentials,
  owned by the process, will stay stable.  That does not eliminate a
  possibility that the credentials passed to the rctl will get stale.
  Ideally, rctl_proc_ucred_changed should be able to work under the proc
  lock.

  Many thanks to kib for pointing out the above problems.

  PR:		222027
  Discussed with:	kib
  No comment:	trasz
  MFC after:	2 weeks
  Differential Revision: https://reviews.freebsd.org/D15048

Changes:
  head/sys/kern/kern_jail.c
  head/sys/kern/kern_loginclass.c
  head/sys/kern/kern_prot.c
  head/sys/kern/kern_racct.c
  head/sys/kern/kern_rctl.c