Bug 156168 - [nfs] [panic] Kernel panic under concurrent access over NFS
Summary: [nfs] [panic] Kernel panic under concurrent access over NFS
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 8.2-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: Rick Macklem
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-04 08:10 UTC by niakrisn
Modified: 2011-11-22 01:40 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description niakrisn 2011-04-04 08:10:10 UTC
I have 3 apache22-itk web servers with DOCUMENT_ROOT shared over NFS.
Sometimes i get kernel panic:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x10
fault code              = supervisor write, page not present
instruction pointer     = 0x20:0xc0aa3236
stack pointer           = 0x28:0xea1ae528
frame pointer           = 0x28:0xea1ae5f4
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 78988 (httpd)
trap number             = 12
panic: page fault
cpuid = 3
KDB: stack backtrace:
#0 0xc08e0d07 at kdb_backtrace+0x47
#1 0xc08b1dc7 at panic+0x117
#2 0xc0be4b43 at trap_fatal+0x323
#3 0xc0be4dc0 at trap_pfault+0x270
#4 0xc0be5305 at trap+0x465
#5 0xc0bcbebc at calltrap+0x6
#6 0xc0aa89c7 at clnt_call_private+0xf7
#7 0xc0a97dcb at nlm_get_rpc+0x19b
#8 0xc0a98379 at nlm_host_get_rpc+0x169
#9 0xc0a949eb at nlm_clearlock+0xeb
#10 0xc0a95d2a at nlm_advlock_internal+0x9ca
#11 0xc0a9651a at nlm_advlock+0x3a
#12 0xc0a80239 at nfs_advlock+0xa9
#13 0xc0c038c7 at VOP_ADVLOCK_APV+0x47
#14 0xc0875dee at closef+0xfe
#15 0xc087653f at kern_close+0x17f
#16 0xc087661a at close+0x1a
#17 0xc08eca39 at syscallenter+0x329
Uptime: 6d22h54m7s
Physical memory: 3059 MB
Dumping 335 MB: 320 304 288 272 256 240 224 208 192 176 160 144 128 112 96 80 64 48 32 16
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2011-04-09 20:59:33 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 2 nonesuch 2011-09-29 16:32:12 UTC
All
  I am seeing a similar crash on 7.3-RELEASE-p2 amd64 when using
apache-1.3.34 with accf_httpd and a nfs docroot
The servers that have crashed are all FreeBSD 7.3-RELEASE amd64.
Hardware is HP Dl145 g2
They have 2G of ram and 2G swap with one single core opteron cpu.


We are using the following sysctls .

kern.ipc.maxsockbuf=2097152
kern.ipc.nmbclusters=32768
kern.ipc.somaxconn=1024
kern.maxfiles=131072
kern.maxfilesperproc=32768
net.inet.tcp.inflight.enable=0
net.inet.tcp.path_mtu_discovery=0
net.inet.tcp.recvbuf_inc=524288
net.inet.tcp.recvbuf_max=8388608
net.inet.tcp.recvspace=32768
net.inet.tcp.sendbuf_inc=16384
net.inet.tcp.sendbuf_max=8388608
net.inet.tcp.sendspace=32768
net.inet.udp.recvspace=42080
net.isr.direct=1
vm.pmap.shpgperproc=600


Up time prior to the crash was not the other system was up for 11 days
this one was 6 days.

Here is the contents of my crash


[root@web29 /var/crash]# kgdb /boot/kernel/kernel /var/crash/vmcore.0
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x258
fault code              = supervisor read data, page not present
instruction pointer     = 0x8:0xffffffff8051a66d
stack pointer           = 0x10:0xffffff803e69b1c0
frame pointer           = 0x10:0xffffff0001b50ae0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 9336 (libhttpd.ep)
trap number             = 12
panic: page fault
cpuid = 0
Uptime: 6d5h18m39s
Physical memory: 2034 MB
Dumping 1451 MB: 1436 1420 1404 1388 1372 1356 1340 1324 1308 1292
1276 1260 1244 1228 1212 1196 1180 1164 1148 1132 1116 1100 1084 1068
1052 1036 1020 1004 988 972 956 940 924 908 892 876 860 844 828 812
796 780 764 748 732 716 700 684 668 652 636 620 604 588 572 556 540
524 508 492 476 460 444 428 412 396 380 364 348 332 316 300 284 268
252 236 220 204 188 172 156 140 124 108 92 76 60 44 28 12

Reading symbols from /boot/kernel/accf_http.ko...Reading symbols from
/boot/kernel/accf_http.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/accf_http.ko
#0  doadump () at pcpu.h:195
195     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) bt
#0  doadump () at pcpu.h:195
#1  0x0000000000000004 in ?? ()
#2  0xffffffff805285f9 in boot (howto=260) at
/usr/src/sys/kern/kern_shutdown.c:418
#3  0xffffffff80528a02 in panic (fmt=0x104 <Address 0x104 out of
bounds>) at /usr/src/sys/kern/kern_shutdown.c:574
#4  0xffffffff807ec813 in trap_fatal (frame=0xffffff0001b50ae0,
eva=Variable "eva" is not available.
) at /usr/src/sys/amd64/amd64/trap.c:777
#5  0xffffffff807ecbe5 in trap_pfault (frame=0xffffff803e69b110,
usermode=0) at /usr/src/sys/amd64/amd64/trap.c:693
#6  0xffffffff807ed50c in trap (frame=0xffffff803e69b110) at
/usr/src/sys/amd64/amd64/trap.c:464
#7  0xffffffff807d614e in calltrap () at
/usr/src/sys/amd64/amd64/exception.S:218
#8  0xffffffff8051a66d in _mtx_lock_sleep (m=0xffffff002f3d7a80,
tid=18446742974226565856, opts=Variable "opts" is not available.
)
    at /usr/src/sys/kern/kern_mutex.c:339
#9  0xffffffff80701f60 in clnt_dg_create (so=0xffffff00017755a0,
svcaddr=0xffffff803e69b310, program=100000, version=4, sendsz=Variable
"sendsz" is not available.
)
    at /usr/src/sys/rpc/clnt_dg.c:259
#10 0xffffffff806e97c9 in nlm_get_rpc (sa=Variable "sa" is not available.
) at /usr/src/sys/nlm/nlm_prot_impl.c:327
#11 0xffffffff806e9d39 in nlm_host_get_rpc (host=0xffffff0001705000)
at /usr/src/sys/nlm/nlm_prot_impl.c:1199
#12 0xffffffff806e680f in nlm_clearlock (host=0xffffff0001705000,
ext=0xffffff803e69b9a0, vers=4, timo=0xffffff803e69b9d0,
    retries=2147483647, vp=0xffffff004881edc8, op=2,
fl=0xffffff803e69bac0, flags=64, svid=9336, fhlen=32,
fh=0xffffff803e69b750,
    size=689) at /usr/src/sys/nlm/nlm_advlock.c:943
#13 0xffffffff806e7801 in nlm_advlock_internal (vp=0xffffff004881edc8,
id=Variable "id" is not available.
) at /usr/src/sys/nlm/nlm_advlock.c:355
#14 0xffffffff806e8166 in nlm_advlock (ap=Variable "ap" is not available.
) at /usr/src/sys/nlm/nlm_advlock.c:392
#15 0xffffffff806ced28 in nfs_advlock (ap=0xffffff803e69ba90) at
/usr/src/sys/nfsclient/nfs_vnops.c:3153
#16 0xffffffff804f40e2 in closef (fp=0xffffff0073716d80,
td=0xffffff0001b50ae0) at vnode_if.h:1036
#17 0xffffffff804f462b in kern_close (td=0xffffff0001b50ae0,
fd=Variable "fd" is not available.
) at /usr/src/sys/kern/kern_descrip.c:1125
#18 0xffffffff807ece67 in syscall (frame=0xffffff803e69bc80) at
/usr/src/sys/amd64/amd64/trap.c:920
#19 0xffffffff807d635b in Xfast_syscall () at
/usr/src/sys/amd64/amd64/exception.S:339
#20 0x00000008009c5b1c in ?? ()
Previous frame inner to this frame (corrupt stack?)

-- 
mark saad | nonesuch@longcount.org
Comment 3 Rick Macklem freebsd_committer freebsd_triage 2011-10-20 23:42:03 UTC
State Changed
From-To: open->feedback


I have sent the person that reported this a patch to test 
and am waiting for feedback. I've taken responsibility for this. 


Comment 4 Rick Macklem freebsd_committer freebsd_triage 2011-10-20 23:42:03 UTC
Responsible Changed
From-To: freebsd-fs->rmacklem


I have sent the person that reported this a patch for testing 
and will update the status when I hear back from them.
Comment 5 dfilter service freebsd_committer freebsd_triage 2011-11-03 14:38:17 UTC
Author: rmacklem
Date: Thu Nov  3 14:38:03 2011
New Revision: 227059
URL: http://svn.freebsd.org/changeset/base/227059

Log:
  Both a crash reported on freebsd-current on Oct. 18 under the
  subject heading "mtx_lock() of destroyed mutex on NFS" and
  PR# 156168 appear to be caused by clnt_dg_destroy() closing
  down the socket prematurely. When to close down the socket
  is controlled by a reference count (cs_refs), but clnt_dg_create()
  checks for sb_upcall being non-NULL to decide if a new socket
  is needed. I believe the crashes were caused by the following race:
    clnt_dg_destroy() finds cs_refs == 0 and decides to delete socket
    clnt_dg_destroy() then loses race with clnt_dg_create() for
      acquisition of the SOCKBUF_LOCK()
    clnt_dg_create() finds sb_upcall != NULL and increments cs_refs to 1
    clnt_dg_destroy() then acquires SOCKBUF_LOCK(), sets sb_upcall to
      NULL and destroys socket
  
  This patch fixes the above race by changing clnt_dg_destroy() so
  that it acquires SOCKBUF_LOCK() before testing cs_refs.
  
  Tested by:	bz
  PR:		156168
  Reviewed by:	dfr
  MFC after:	2 weeks

Modified:
  head/sys/rpc/clnt_dg.c

Modified: head/sys/rpc/clnt_dg.c
==============================================================================
--- head/sys/rpc/clnt_dg.c	Thu Nov  3 14:36:56 2011	(r227058)
+++ head/sys/rpc/clnt_dg.c	Thu Nov  3 14:38:03 2011	(r227059)
@@ -1001,12 +1001,12 @@ clnt_dg_destroy(CLIENT *cl)
 	cs = cu->cu_socket->so_rcv.sb_upcallarg;
 	clnt_dg_close(cl);
 
+	SOCKBUF_LOCK(&cu->cu_socket->so_rcv);
 	mtx_lock(&cs->cs_lock);
 
 	cs->cs_refs--;
 	if (cs->cs_refs == 0) {
 		mtx_unlock(&cs->cs_lock);
-		SOCKBUF_LOCK(&cu->cu_socket->so_rcv);
 		soupcall_clear(cu->cu_socket, SO_RCV);
 		clnt_dg_upcallsdone(cu->cu_socket, cs);
 		SOCKBUF_UNLOCK(&cu->cu_socket->so_rcv);
@@ -1015,6 +1015,7 @@ clnt_dg_destroy(CLIENT *cl)
 		lastsocketref = TRUE;
 	} else {
 		mtx_unlock(&cs->cs_lock);
+		SOCKBUF_UNLOCK(&cu->cu_socket->so_rcv);
 		lastsocketref = FALSE;
 	}
 
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 6 Rick Macklem freebsd_committer freebsd_triage 2011-11-21 16:17:48 UTC
State Changed
From-To: feedback->closed


I believe this bug is fixed by r227059 which has been MFC'd 
to stable/8 r227601.
Comment 7 dfilter service freebsd_committer freebsd_triage 2011-11-22 01:33:17 UTC
Author: rmacklem
Date: Tue Nov 22 01:32:57 2011
New Revision: 227810
URL: http://svn.freebsd.org/changeset/base/227810

Log:
  MFC: r227059
  Both a crash reported on freebsd-current on Oct. 18 under the
  subject heading "mtx_lock() of destroyed mutex on NFS" and
  PR# 156168 appear to be caused by clnt_dg_destroy() closing
  down the socket prematurely. When to close down the socket
  is controlled by a reference count (cs_refs), but clnt_dg_create()
  checks for sb_upcall being non-NULL to decide if a new socket
  is needed. I believe the crashes were caused by the following race:
    clnt_dg_destroy() finds cs_refs == 0 and decides to delete socket
    clnt_dg_destroy() then loses race with clnt_dg_create() for
      acquisition of the SOCKBUF_LOCK()
    clnt_dg_create() finds sb_upcall != NULL and increments cs_refs to 1
    clnt_dg_destroy() then acquires SOCKBUF_LOCK(), sets sb_upcall to
      NULL and destroys socket
  
  This patch fixes the above race by changing clnt_dg_destroy() so
  that it acquires SOCKBUF_LOCK() before testing cs_refs.
  This is a slightly modified patch for stable/7. It fixes the
  above race, although others still exist, since some patches
  such as r193272 cannot be MFC'd.
  
  Tested by:	nonesuch at longcount.org (Mark Saad)
  PR:		kern/156168

Modified:
  stable/7/sys/rpc/clnt_dg.c
Directory Properties:
  stable/7/sys/   (props changed)
  stable/7/sys/cddl/contrib/opensolaris/   (props changed)
  stable/7/sys/contrib/dev/acpica/   (props changed)
  stable/7/sys/contrib/pf/   (props changed)

Modified: stable/7/sys/rpc/clnt_dg.c
==============================================================================
--- stable/7/sys/rpc/clnt_dg.c	Tue Nov 22 00:35:30 2011	(r227809)
+++ stable/7/sys/rpc/clnt_dg.c	Tue Nov 22 01:32:57 2011	(r227810)
@@ -811,18 +811,22 @@ clnt_dg_destroy(CLIENT *cl)
 	while (cu->cu_threads)
 		msleep(cu, &cs->cs_lock, 0, "rpcclose", 0);
 
+	mtx_unlock(&cs->cs_lock);
+	SOCKBUF_LOCK(&cu->cu_socket->so_rcv);
+	mtx_lock(&cs->cs_lock);
 	cs->cs_refs--;
 	if (cs->cs_refs == 0) {
-		mtx_destroy(&cs->cs_lock);
-		SOCKBUF_LOCK(&cu->cu_socket->so_rcv);
+		mtx_unlock(&cs->cs_lock);
 		cu->cu_socket->so_upcallarg = NULL;
 		cu->cu_socket->so_upcall = NULL;
 		cu->cu_socket->so_rcv.sb_flags &= ~SB_UPCALL;
 		SOCKBUF_UNLOCK(&cu->cu_socket->so_rcv);
+		mtx_destroy(&cs->cs_lock);
 		mem_free(cs, sizeof(*cs));
 		lastsocketref = TRUE;
 	} else {
 		mtx_unlock(&cs->cs_lock);
+		SOCKBUF_UNLOCK(&cu->cu_socket->so_rcv);
 		lastsocketref = FALSE;
 	}
 
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"