158197 – [geom] geom_cache with size>1000 leads to panics

Bug 158197 - [geom] geom_cache with size>1000 leads to panics

Summary: [geom] geom_cache with size>1000 leads to panics

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	Unspecified
Hardware:	Any Any

Importance:	Normal Affects Only Me
Assignee:	freebsd-geom (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-06-23 10:40 UTC by bsd
Modified:	2018-05-28 19:45 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description bsd 2011-06-23 10:40:08 UTC

Using geom_cache with size>1000 leads to panics in g_up/g_down (g_cache_deliver/g_cache_read)

panic44:    full log http://paste.org.ru/?l1sjnk
Tracing pid 13 tid 100014 td 0xfffffe0002c20460
kdb_enter() at kdb_enter+0x3b
panic() at panic+0x180
g_cache_deliver() at g_cache_deliver+0x274
g_cache_done() at g_cache_done+0x80
g_io_schedule_up() at g_io_schedule_up+0xce
g_up_procbody() at g_up_procbody+0x72
fork_exit() at fork_exit+0x135
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff8000275d00, rbp = 0 ---

panic45:    full log http://paste.org.ru/?egp7hk
Tracing pid 13 tid 100015 td 0xfffffe0002c20000
g_cache_read() at g_cache_read+0xee
g_cache_start() at g_cache_start+0x34e
g_io_schedule_down() at g_io_schedule_down+0x1d4
g_down_procbody() at g_down_procbody+0x72
fork_exit() at fork_exit+0x135
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff800027ad00, rbp = 0 ---

Fix: 

Using smaller size for gcache or turning it off with kern.geom.cache.enable=0
How-To-Repeat: Create geom cache with parameters like "gcache create -b128k -s2000 test".
(seems gcache works stable if using size 100-200, but panics on 1000+).
ports & src trees work good as test pattern for filesystem.  
  
panics occur during massive io operations:
"find /mnt/test" or reading all files often leads to panic in g_down;
dumping/restoring entire fs (ports+src) always triggers panic.

Comment 1 pdc 2011-06-23 11:11:08 UTC

Trace from older revision (with full vmcore)



Dump header from device /dev/ad4s1b
Architecture: amd64
Architecture Version: 2
Dump Length: 2973687808B (2835 MB)
Blocksize: 512
Dumptime: Wed Mar 16 03:33:59 2011
Hostname: cel.home
Magic: FreeBSD Kernel Dump
Version String: FreeBSD 9.0-CURRENT #0 r219141: Thu Mar 3 06:07:03 EET 2011
root@cel.home:/usr/obj/var/src9/sys/PDC90
Panic String:
Dump Parity: 2510468568
Bounds: 32
Dump Status: good




Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer = 0x20:0xffffffff8135924e
stack pointer = 0x28:0xffffff800007ab30
frame pointer = 0x28:0xffffff800007ab70
code segment = base rx0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 4 (g_down)
Physical memory: 8175 MB
Dumping 2835 MB: 2820 2804 2788 2772 ...





(kgdb) #0 doadump () at /var/src9/sys/kern/kern_shutdown.c:250
#1 0xffffffff802cb7fc in db_fncall (dummy1=Variable "dummy1" is not  
available.
)
at /var/src9/sys/ddb/db_command.c:548
#2 0xffffffff802cbb31 in db_command (last_cmdp=0xffffffff80bc5340,  
cmd_table=Variable "cmd_table" is not available.

) at /var/src9/sys/ddb/db_command.c:445
#3 0xffffffff802cbd80 in db_command_loop ()
at /var/src9/sys/ddb/db_command.c:498
#4 0xffffffff802cddf9 in db_trap (type=Variable "type" is not available.
) at /var/src9/sys/ddb/db_main.c:229
#5 0xffffffff8054e5b1 in kdb_trap (type=9, code=0, tf=0xffffff800007aa80)
at /var/src9/sys/kern/subr_kdb.c:533
#6 0xffffffff807cd88d in trap_fatal (frame=0xffffff800007aa80,  
eva=Variable "eva" is not available.
)
at /var/src9/sys/amd64/amd64/trap.c:798
#7 0xffffffff807cde5c in trap (frame=0xffffff800007aa80)
at /var/src9/sys/amd64/amd64/trap.c:614
#8 0xffffffff807b86ff in calltrap ()
at /var/src9/sys/amd64/amd64/exception.S:228
#9 0xffffffff8135924e in g_cache_read (sc=0xfffffe0008851800,
bp=0xfffffe019f8e2488)
at /var/src9/sys/modules/geom/geom_cache/../../../geom/cache/g_cache.c:116
#10 0xffffffff813597f2 in g_cache_start (bp=0xfffffe019f8e2488)
at /var/src9/sys/modules/geom/geom_cache/../../../geom/cache/g_cache.c:377
#11 0xffffffff804b82c4 in g_io_schedule_down (tp=Variable "tp" is not  
available.
)
at /var/src9/sys/geom/geom_io.c:632
#12 0xffffffff804b86af in g_down_procbody ()
at /var/src9/sys/geom/geom_kern.c:118
#13 0xffffffff804ee58a in fork_exit (
callout=0xffffffff804b8640 <g_down_procbody>, arg=0x0,
frame=0xffffff800007ac50) at /var/src9/sys/kern/kern_fork.c:885
#14 0xffffffff807b8bae in fork_trampoline ()
at /var/src9/sys/amd64/amd64/exception.S:569
#15 0x0000000000000000 in ?? ()
#16 0x0000000000000000 in ?? ()
#17 0x0000000000000001 in ?? ()
#18 0x0000000000000000 in ?? ()
#19 0x0000000000000000 in ?? ()





(kgdb) fr 9
#9 0xffffffff8135924e in g_cache_read (sc=0xfffffe0008851800,  
bp=0xfffffe019f8e2488) at  
/var/src9/sys/modules/geom/geom_cache/../../../geom/cache/g_cache.c:116
116 TAILQ_REMOVE(&sc->sc_usedlist, dp, d_used);
(kgdb) p sc->sc_usedlist
$1 = {tqh_first = 0xfffffe00a451e680, tqh_last = 0xfffffe00a451e6b8}
(kgdb) p dp
$2 = (struct g_cache_desc *) 0xfffffe00a451e680
(kgdb) p d_used
No symbol "d_used" in current context.

(kgdb) x 0xfffffe00a451e680
0xfffffe00a451e680: 0xdeadc0de
(kgdb) x 0xfffffe00a451e6b8
0xfffffe00a451e6b8: 0xdeadc0de
(kgdb) x dp
0xfffffe0008851800: 0x069cd300
(kgdb) x 0xfffffe019f8e2488
0xfffffe019f8e2488: 0x00000001

Comment 2 Mark Linimon freebsd_committer

2011-06-24 01:27:56 UTC

Responsible Changed
From-To: freebsd-amd64->freebsd-geom

Over to maintainer(s).

Comment 3 bsd 2011-09-08 12:57:57 UTC

Still affects on BETA1.

Thu Sep  8 13:58:49 EEST 2011

FreeBSD pp.fss.sunpp 9.0-BETA1 FreeBSD 9.0-BETA1 #0 r224843: Fri Aug 26
16:53:42 EEST 2011     root@pp.fss.sunpp:/usr/obj/usr/src/sys/PDC90  amd64

panic:

Unread portion of the kernel message buffer:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff8110a130
stack pointer	        = 0x28:0xffffff800024db40
frame pointer	        = 0x28:0xffffff800024db80
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 13 (g_down)
Dumping 1080 out of 4079
MB:..2%..11%..21%..32%..42%..51%..61%..72%..82%..91%


#6  0xffffffff807ee40d in trap_fatal (frame=0xffffff800024da90,
eva=Variable "eva" is not available.
)
       at /usr/src/sys/amd64/amd64/trap.c:800
#7  0xffffffff807ee9d5 in trap (frame=0xffffff800024da90)
       at /usr/src/sys/amd64/amd64/trap.c:616
#8  0xffffffff807d93bf in calltrap ()
       at /usr/src/sys/amd64/amd64/exception.S:228

#9  0xffffffff8110a130 in g_cache_read (sc=0xfffffe00097c7200,
       bp=0xfffffe00c5f75570)
       at
/usr/src/sys/modules/geom/geom_cache/../../../geom/cache/g_cache.c:117

(kgdb) fr 9
#9  0xffffffff8110a130 in g_cache_read (sc=0xfffffe00097c7200,
bp=0xfffffe00c5f75570)
       at
/usr/src/sys/modules/geom/geom_cache/../../../geom/cache/g_cache.c:117
117                     TAILQ_REMOVE(&sc->sc_usedlist, dp, d_used);

(kgdb) p sc->sc_usedlist
$2 = {tqh_first = 0xfffffe0083a7b980, tqh_last = 0xfffffe0083a7b9b8}

(kgdb) p *(struct g_cache_desc *) dp
$6 = {d_bno = -2401050962867404578, d_data = 0xdeadc0dedeadc0de <Address
0xdeadc0dedeadc0de out of bounds>, d_biolist = 0xdeadc0dedeadc0de,
     d_atime = -2401050962867404578, d_flags = -559038242, d_next = {le_next
= 0xdeadc0dedeadc0de, le_prev = 0xdeadc0dedeadc0de}, d_used = {
       tqe_next = 0xdeadc0dedeadc0de, tqe_prev = 0xdeadc0dedeadc0de}}

(kgdb) p d_used
No symbol "d_used" in current context.




#10 0xffffffff8110a6f2 in g_cache_start (bp=0xfffffe00c5f75570)
       at
/usr/src/sys/modules/geom/geom_cache/../../../geom/cache/g_cache.c:378

#11 0xffffffff804c6d43 in g_io_schedule_down (tp=Variable "tp" is not
available.
)
       at /usr/src/sys/geom/geom_io.c:632

#12 0xffffffff804c71d2 in g_down_procbody (arg=Variable "arg" is not
available.
)
       at /usr/src/sys/geom/geom_kern.c:110
#13 0xffffffff804fe685 in fork_exit (
       callout=0xffffffff804c7160 <g_down_procbody>, arg=0x0,
       frame=0xffffff800024dc50) at /usr/src/sys/kern/kern_fork.c:941
#14 0xffffffff807d98ee in fork_trampoline ()
       at /usr/src/sys/amd64/amd64/exception.S:603

----

More detailed core.txt available at
http://bak.sg01.com/tmp/panic55a.txt

Comment 4 bsd 2011-09-08 13:11:46 UTC

more correct printing

(kgdb)  p *sc
$6 = {sc_geom = 0xfffffe00097bb300, sc_type = 1, sc_bshift = 19, sc_bsize  
= 524288, sc_tail = 33822343168, sc_mtx = {lock_object = {
       lo_name = 0xffffffff8110bd5a "GEOM CACHE mutex", lo_flags =  
16973824, lo_data = 0, lo_witness = 0xffffff8000651c80},
     mtx_lock = 18446741874726241376}, sc_callout = {c_links = {sle =  
{sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xffffff807d5412c0}},
     c_time = 267371554, c_arg = 0xfffffe00097c7200, c_func =  
0xffffffff81109950 <g_cache_go>, c_lock = 0xfffffe00097c7220, c_flags = 6,  
c_cpu = 0},
   sc_desclist = {{lh_first = 0xfffffe0083a7b300}, {lh_first =  
0xfffffe0102d0f900}, {lh_first = 0xfffffe004c53bc00}, {
       lh_first = 0xfffffe0083a7b400}, {lh_first = 0xfffffe0017725600},  
{lh_first = 0xfffffe00096c2480}, {lh_first = 0xfffffe0009814a00}, {
       lh_first = 0xfffffe0083a7b080}}, sc_usedlist = {tqh_first =  
0xfffffe0083a7b980, tqh_last = 0xfffffe0083a7b9b8}, sc_zone =  
0xfffffe000974dc80,
   sc_maxent = 100, sc_nent = 39, sc_nused = 1, sc_invalid = 0, sc_reads =  
300805, sc_readbytes = 2057474048, sc_cachereads = 4918,
   sc_cachereadbytes = 43359232, sc_cachehits = 3857, sc_cachemisses =  
78918, sc_cachefull = 77857, sc_writes = 380543, sc_wrotebytes =  
8069953536}

Comment 5 pdc 2011-10-05 08:29:12 UTC

Panics occur when more than one instance of gcache are active.

So, having 2 geom caches for /usr and /usr/obj, and making install of ports
with WRKDIRPREFIX=/usr/obj/work leads to panic, no matter which size of  
geom caches used.

But single active instance of gcache for /usr/obj seems to show no problems
("inactive" instances are present, but not used to mount filesystems).

Comment 6 bsd 2012-08-06 10:54:53 UTC

Increasing kern.geom.cache.timeout sysctl (from 10 to 30-40 secs) rises  
the panic risk.
Increasing the block size forces cache usage & overfill.
Problem mostly appears after number of CacheFull events, cache memory  
release and filling it again with read operations.

So sometimes such situation comes (Entries > Size!):

# gcache list usr0
Geom name: usr0
WroteBytes: 2001752576
Writes: 93866
CacheFull: 65
CacheMisses: 1387
CacheHits: 4120
CacheReadBytes: 22902272
CacheReads: 5442
ReadBytes: 27303424
Reads: 6262
InvalidEntries: 0
UsedEntries: 0
Entries: 1025
TailOffset: 21474705408
BlockSize: 131072
Size: 1024


Few more info from the fresh stable version:

-----

   Architecture: amd64
   Architecture Version: 1
   Dump Length: 833024B (0 MB)
   Blocksize: 512
   Dumptime: Mon Aug  6 04:54:04 2012
   Hostname: cel.home
   Magic: FreeBSD Text Dump
   Version String: FreeBSD 9.1-PRERELEASE #1 r239036M: Sat Aug  4 15:38:54  
EEST 2012
     root@cel.home:/usr/obj/usr/src/sys/PDC90
   Panic String: Bad tailq NEXT(0xfffffe0012be7ab8->tqh_last) != NULL
   Dump Parity: 3722464325
   Bounds: 148
   Dump Status: good



panic: Bad tailq NEXT(0xfffffe0012be7ab8->tqh_last) != NULL
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff802defca = db_trace_self_wrapper+0x2a
kdb_backtrace() at 0xffffffff805a52e7 = kdb_backtrace+0x37
panic() at 0xffffffff8056e088 = panic+0x1d8
g_cache_deliver() at 0xffffffff81038b46 = g_cache_deliver+0x276
g_cache_read() at 0xffffffff81038d13 = g_cache_read+0x1a3
g_cache_start() at 0xffffffff8103920e = g_cache_start+0x34e
g_io_schedule_down() at 0xffffffff80506c24 = g_io_schedule_down+0x1d4
g_down_procbody() at 0xffffffff805070a2 = g_down_procbody+0x72
fork_exit() at 0xffffffff80540565 = fork_exit+0x135
fork_trampoline() at 0xffffffff807fbeae = fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff800027fcf0, rbp = 0 ---
KDB: enter: panic



db:1:on>  run lockinfo
db:2:lockinfo> show locks
exclusive sleep mutex GEOM CACHE mutex (GEOM CACHE mutex) r = 0  
(0xfffffe0012be7a20) locked @  
/usr/src/sys/modules/geom/geom_cache/../../../geom/cache/g_cache.c:262
db:2:locks>  show alllocks
Process 77823 (perl5.14.2) thread 0xfffffe01b6d288e0 (131440)
exclusive sx so_rcv_sx (so_rcv_sx) r = 0 (0xfffffe0214f6a0f8) locked @  
/usr/src/sys/kern/uipc_sockbuf.c:148
Process 77822 (perl5.14.2) thread 0xfffffe0136bbf470 (100575)
exclusive sx so_rcv_sx (so_rcv_sx) r = 0 (0xfffffe00b9a488f0) locked @  
/usr/src/sys/kern/uipc_sockbuf.c:148
Process 77821 (perl5.14.2) thread 0xfffffe0184d158e0 (150562)
exclusive sx so_rcv_sx (so_rcv_sx) r = 0 (0xfffffe00417a93a0) locked @  
/usr/src/sys/kern/uipc_sockbuf.c:148
Process 77820 (perl5.14.2) thread 0xfffffe0118db5470 (149413)
exclusive sx so_rcv_sx (so_rcv_sx) r = 0 (0xfffffe006bfa50f8) locked @  
/usr/src/sys/kern/uipc_sockbuf.c:148
Process 77819 (perl5.14.2) thread 0xfffffe020654f000 (102636)
exclusive sx so_rcv_sx (so_rcv_sx) r = 0 (0xfffffe00410abe40) locked @  
/usr/src/sys/kern/uipc_sockbuf.c:148
Process 77818 (perl5.14.2) thread 0xfffffe02066428e0 (127656)
exclusive sx so_rcv_sx (so_rcv_sx) r = 0 (0xfffffe01817cdb98) locked @  
/usr/src/sys/kern/uipc_sockbuf.c:148
Process 77817 (perl5.14.2) thread 0xfffffe0012ca1470 (100117)
exclusive sx so_rcv_sx (so_rcv_sx) r = 0 (0xfffffe0172cf8b98) locked @  
/usr/src/sys/kern/uipc_sockbuf.c:148
Process 6250 (darktable) thread 0xfffffe01816de000 (100720)
exclusive sleep mutex pmap (pmap) r = 0 (0xfffffe01810a6a60) locked @  
/usr/src/sys/amd64/amd64/pmap.c:3533
exclusive rw pvh global (pvh global) r = 0 (0xffffffff80e4cc00) locked @  
/usr/src/sys/amd64/amd64/pmap.c:3532
exclusive sleep mutex vm object (standard object) r = 0  
(0xfffffe0181366000) locked @ /usr/src/sys/vm/vm_fault.c:1067
shared sx user map (user map) r = 0 (0xfffffe01810a69a8) locked @  
/usr/src/sys/vm/vm_fault.c:823
Process 6250 (darktable) thread 0xfffffe01816de470 (100718)
shared sx user map (user map) r = 0 (0xfffffe01810a69a8) locked @  
/usr/src/sys/vm/vm_fault.c:823
Process 3201 (mysqld) thread 0xfffffe00bec8e8e0 (133280)
exclusive sx so_rcv_sx (so_rcv_sx) r = 0 (0xfffffe0181d6cb98) locked @  
/usr/src/sys/kern/uipc_sockbuf.c:148
Process 3201 (mysqld) thread 0xfffffe00410f2000 (100545)
exclusive sx so_rcv_sx (so_rcv_sx) r = 0 (0xfffffe018119c3a0) locked @  
/usr/src/sys/kern/uipc_sockbuf.c:148
Process 13 (geom) thread 0xfffffe0010177000 (100016)
exclusive sleep mutex GEOM CACHE mutex (GEOM CACHE mutex) r = 0  
(0xfffffe0012be7a20) locked @  
/usr/src/sys/modules/geom/geom_cache/../../../geom/cache/g_cache.c:262
db:2:alllocks>  show lockedvnods
Locked vnodes
db:1:lockedvnods>  show pcpu
cpuid        = 0
dynamic pcpu = 0x538a00
curthread    = 0xfffffe0010177000: pid 13 "g_down"
curpcb       = 0xffffff800027fd00
fpcurthread  = none
idlethread   = 0xfffffe001015d470: tid 100003 "idle: cpu0"
curpmap      = 0xffffffff80c30190
tssp         = 0xffffffff80e37fe0
commontssp   = 0xffffffff80e37fe0
rsp0         = 0xffffff800027fd00
gs32p        = 0xffffffff80e36138
ldt          = 0xffffffff80e36178
tss          = 0xffffffff80e36168
spin locks held:
db:1:pcpu>  bt
Tracing pid 13 tid 100016 td 0xfffffe0010177000
kdb_enter() at 0xffffffff805a4fab = kdb_enter+0x3b
panic() at 0xffffffff8056e081 = panic+0x1d1
g_cache_deliver() at 0xffffffff81038b46 = g_cache_deliver+0x276
g_cache_read() at 0xffffffff81038d13 = g_cache_read+0x1a3
g_cache_start() at 0xffffffff8103920e = g_cache_start+0x34e
g_io_schedule_down() at 0xffffffff80506c24 = g_io_schedule_down+0x1d4
g_down_procbody() at 0xffffffff805070a2 = g_down_procbody+0x72
fork_exit() at 0xffffffff80540565 = fork_exit+0x135
fork_trampoline() at 0xffffffff807fbeae = fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff800027fcf0, rbp = 0 ---
db:1:bt>

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Comment 7 Ivan Klymenko 2013-11-12 10:07:39 UTC

I have a similar situation but on the 10.0-BETA3
http://docs.freebsd.org/cgi/mid.cgi?20131112112825.0be85aee

Comment 8 Eugene Grosbein 2015-03-02 08:25:48 UTC

Today I had the same panic with 8.4-STABLE and single gcache instance serving disk slice containing the system. Unfortunally, crashdump was not generated but I have screenshot with KDB_TRACE:

http://www.grosbein.net/img/panic-gcache.jpg

After reboot:

# gcache list
Geom name: ca0
WroteBytes: 0
Writes: 0
CacheFull: 0
CacheMisses: 20
CacheHits: 88
CacheReadBytes: 324096
CacheReads: 108
ReadBytes: 328192
Reads: 116
InvalidEntries: 0
UsedEntries: 9
Entries: 9
TailOffset: 500107182080
BlockSize: 131072
Size: 1024
Providers:
1. Name: cache/ca0
   Mediasize: 500107280896 (465G)
   Sectorsize: 512
   Mode: r0w0e0
Consumers:
1. Name: mirror/gm0s1
   Mediasize: 500107281408 (465G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 258048
   Mode: r0w0e0

Comment 9 Eitan Adler freebsd_committer

2018-05-28 19:45:32 UTC

batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.