161768 – [ahci] [panic] Panics after AHCI timeouts

Bug 161768 - [ahci] [panic] Panics after AHCI timeouts

Summary: [ahci] [panic] Panics after AHCI timeouts

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	9.0-BETA2
Hardware:	Any Any

Importance:	Normal Affects Only Me
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:	crash

Depends on:
Blocks:

Reported:	2011-10-18 11:50 UTC by Alexey Shuvaev
Modified:	2022-10-17 12:19 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexey Shuvaev 2011-10-18 11:50:04 UTC

For some time (around half a year actually :) under some heavy disk load
my desktop panics. The panics are not 100% reproducible, two situations
which could lead to a panic are:
 - svn up in /usr/src
 - last stage of openoffice building (I think, during the packaging stage)

Updating the sources is less probable to panic the system (I would give
~30% of probability), but building OOO is close to 100% to cause the panic.

The root cause seems to be the Samsung hard drive in use -
it times out on some NCQ slots under heavy load.
Same problem is reported, for example here:
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/157397

Earlier this year (I would say March - May), AHCI handled these timeouts,
at least to the point when the disk (and, possibly, the controlled too)
were completely wedged (only power cycling had brought them back again,
reset was not sufficient). From ~June, however, the system paniced
right after the first timeout. So it is this panic immediately after ahci
timeout which could be hopefully fixed.

Some info about the system:

 From dmesg.boot:
 [snip]
 ahci0: <ATI IXP700 AHCI SATA controller> port
 +0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem
 +0xfe5ffc00-0xfe5fffff irq 19 at device 17.0 on pci0
 ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported
 ahcich0: <AHCI channel> at channel 0 on ahci0
 ahcich1: <AHCI channel> at channel 1 on ahci0
 ahcich2: <AHCI channel> at channel 2 on ahci0
 ahcich3: <AHCI channel> at channel 3 on ahci0
 ahcich4: <AHCI channel> at channel 4 on ahci0
 ahcich5: <AHCI channel> at channel 5 on ahci0
 [snip]
 ada0 at ahcich0 bus 0 scbus1 target 0 lun 0
 ada0: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
 ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
 ada0: Command Queueing enabled
 ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
 ada0: Previously was known as ad4
 [snip]
 
 ~> cat /etc/rc.local
 #!/bin/sh
 ddb script kdb.enter.panic="capture on; show pcpu; trace; ps; show locks;
 +alltrace; show alllocks; show lockedvnods; call doadump; reset"
 sysctl debug.debugger_on_panic=0
 
 I have two cores:
 ~> ll /var/crash/
 total 2628524
 -rw-r--r--  1 root  wheel           2 Oct  7 15:12 bounds
 -rw-r-----  1 root  wheel      224897 Aug  5 18:07 core.txt.3
 -rw-r-----  1 root  wheel      151478 Oct  7 15:13 core.txt.5
 -rw-r-----  1 root  wheel         475 Aug  5 18:06 info.3
 -rw-r-----  1 root  wheel         478 Oct  7 15:12 info.5
 -rw-r--r--  1 root  wheel           5 Jan 26  2011 minfree
 -rw-r-----  1 root  wheel  1390616576 Aug  5 18:07 vmcore.3
 -rw-r-----  1 root  wheel  1371832320 Oct  7 15:13 vmcore.5
 
 However, I do not have the old kernel for the core N 3 anymore
 (but the current kernel is the one which produced core N 5).
 If it is needed I can provide any information from the core N 5
 (and core N 3, if it is still possible), test patches, etc.
 
 
 From core.txt.3:
 [snip]
 Unread portion of the kernel message buffer:
 
 Fatal trap 9: general protection fault while in kernel mode
 cpuid = 0; apic id = 00
 instruction pointer     = 0x20:0xffffffff8092988e
 stack pointer           = 0x28:0xffffff8000256a80
 frame pointer           = 0x28:0xffffff8000256aa0
 code segment            = base rx0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1, def32 0, gran 1
 processor eflags        = interrupt enabled, resume, IOPL = 0
 current process         = 12 (swi4: clock)
 trap number             = 9
 panic: general protection fault
 cpuid = 0
 VNASSERT failed
 Uptime: 0xfffffe0073591780: 3dtag ufs, type VDIR
 7h59m    usecount 1, writecount 0, refcount 4 mountedhere 0
 45s
     flags ()
     v_object 0xfffffe00888501b0 ref 0 pages 1
     lock type ufs: UNLOCKED
         ino 2709060, on dev ada0p6
 VNASSERT failed
 0xfffffe0073591780: tag ufs, type VDIR
     usecount 1, writecount 0, refcount 4 mountedhere 0
     flags ()
     v_object 0xfffffe00888501b0 ref 0 pages 1
     lock type ufs: UNLOCKED
         ino 2709060, on dev ada0p6
 Dumping 1326 out of 7914 MB:..2%..11%..21%..31%..42%..51%..61%..72%..81%..91%
 [snip]
 #0  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:252
 252             if (textdump && textdump_pending) {
 (kgdb) #0  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:252
 #1  0xffffffff8081f3ca in kern_reboot (howto=260)
     at /usr/src/sys/kern/kern_shutdown.c:430
 #2  0xffffffff8081ee61 in panic (fmt=Variable "fmt" is not available.
 )
     at /usr/src/sys/kern/kern_shutdown.c:595
 #3  0xffffffff80b04b70 in trap_fatal (frame=0x9, eva=Variable "eva" is not
 +available.
 )
     at /usr/src/sys/amd64/amd64/trap.c:805
 #4  0xffffffff80b05145 in trap (frame=0xffffff80002569d0)
     at /usr/src/sys/amd64/amd64/trap.c:616
 #5  0xffffffff80aef91f in calltrap ()
     at /usr/src/sys/amd64/amd64/exception.S:228
 #6  0xffffffff8092988e in igmp_slowtimo () at /usr/src/sys/netinet/igmp.c:2088
 #7  0xffffffff8088648b in pfslowtimo (arg=0xffffffff8130b440)
     at /usr/src/sys/kern/uipc_domain.c:518
 #8  0xffffffff80832e0a in softclock (arg=Variable "arg" is not available.
 )
     at /usr/src/sys/kern/kern_timeout.c:564
 #9  0xffffffff807f6676 in intr_event_execute_handlers (p=Variable "p" is not
 +available.
 )
     at /usr/src/sys/kern/kern_intr.c:1257
 #10 0xffffffff807f74b2 in ithread_loop (arg=0xfffffe0005144be0)
     at /usr/src/sys/kern/kern_intr.c:1270
 #11 0xffffffff807f3b85 in fork_exit (
     callout=0xffffffff807f7400 <ithread_loop>, arg=0xfffffe0005144be0,
     frame=0xffffff8000256c50) at /usr/src/sys/kern/kern_fork.c:941
 #12 0xffffffff80aefe4e in fork_trampoline ()
     at /usr/src/sys/amd64/amd64/exception.S:603
 [snip]
 [last message in dmesg]
 ahcich0: Timeout on slot 29 port 0
 ahcich0: is 00000000 cs 00000000 ss ffffffff rs ffffffff tfd 40 serr 00000000
 +cmd 0000fc17
 [snip]
 
 
 From core.txt.5:
 [snip]
 Unread portion of the kernel message buffer:
 Memory modified after free 0xfffffe000416e200(248) val=79e8800 @
 +0xfffffe000416e200
 panic: Most recently used by cred
 
 cpuid = 2
 Uptime: 20h11m1s
 Dumping 1308 out of 7914 MB:..2%..12%..21%..31%..41%..51%..62%..71%..81%..91%
 [snip]
 #0  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:252
 252             if (textdump && textdump_pending) {
 (kgdb) #0  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:252
 #1  0xffffffff808234aa in kern_reboot (howto=260)
     at /usr/src/sys/kern/kern_shutdown.c:430
 #2  0xffffffff80822f41 in panic (fmt=Variable "fmt" is not available.
 )
     at /usr/src/sys/kern/kern_shutdown.c:595
 #3  0xffffffff80a6f7b4 in mtrash_ctor (mem=Variable "mem" is not available.
 ) at /usr/src/sys/vm/uma_dbg.c:137
 #4  0xffffffff80a6f01c in uma_zalloc_arg (zone=0xfffffe021ffe0700, udata=0x0,
     flags=258) at /usr/src/sys/vm/uma_core.c:2018
 #5  0xffffffff808108be in malloc (size=Variable "size" is not available.
 ) at uma.h:305
 #6  0xffffffff8081c21f in crget () at /usr/src/sys/kern/kern_prot.c:1809
 #7  0xffffffff8081c269 in crdup (cr=0xfffffe0143103300)
     at /usr/src/sys/kern/kern_prot.c:1911
 #8  0xffffffff808c5ca6 in kern_accessat (td=0xfffffe0007dd7000, fd=-100,
     path=0x80065c000 <Address 0x80065c000 out of bounds>,
     pathseg=UIO_USERSPACE, flags=Variable "flags" is not available.
 ) at /usr/src/sys/kern/vfs_syscalls.c:2201
 #9  0xffffffff8086719a in syscallenter (td=0xfffffe0007dd7000,
     sa=0xffffff8223f67bb0) at /usr/src/sys/kern/subr_trap.c:344
 #10 0xffffffff80b0b43c in syscall (frame=0xffffff8223f67c50)
     at /usr/src/sys/amd64/amd64/trap.c:910
 #11 0xffffffff80af617d in Xfast_syscall ()
     at /usr/src/sys/amd64/amd64/exception.S:384
 #12 0x000000080062dbdc in ?? ()
 Previous frame inner to this frame (corrupt stack?)
 [snip]
 [last message in dmesg]
 ahcich0: Timeout on slot 29 port 0
 ahcich0: is 00000000 cs 00000000 ss ffffffff rs ffffffff tfd 40 serr 00000000 cm
 d 0000fc17
 [snip]

How-To-Repeat: I don't know, if it is chipset specific problem, or AHCI timeout recovery
path is generally broken. At least one can try to provoke recoverable
AHCI timeout and see, if the system panics.

Comment 1 Armin Pirkovitsch 2011-10-18 14:53:52 UTC

same problem here:
machine 1:
ahci0: <JMicron JMB363 AHCI SATA controller> mem 0xfbcfe000-0xfbcfffff 
irq 16 at device 0.0 on pci4
ahci0: AHCI v1.00 with 2 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahci1: <Intel ICH10 AHCI SATA controller> port 
0x9c00-0x9c07,0x9880-0x9883,0x9800-0x9807,0x9480-0x9483,0x9400-0x941f 
mem 0xf7ffc000-0xf7ffc7ff irq 20 at device 31.2 on pci0
ahci1: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported
ahcich2: <AHCI channel> at channel 0 on ahci1
ahcich3: <AHCI channel> at channel 1 on ahci1
ahcich4: <AHCI channel> at channel 2 on ahci1
ahcich5: <AHCI channel> at channel 3 on ahci1
ahcich6: <AHCI channel> at channel 4 on ahci1
ahcich7: <AHCI channel> at channel 5 on ahci1
ada0 at ahcich2 bus 0 scbus4 target 0 lun 0
ada0: <Corsair Force 3 SSD 1.3> ATA-8 SATA 3.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 114473MB (234441648 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad10
ada1 at ahcich3 bus 0 scbus5 target 0 lun 0
ada1: <SAMSUNG HD154UI 1AG01118> ATA-7 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad12
ada2 at ahcich4 bus 0 scbus6 target 0 lun 0
ada2: <SAMSUNG HD154UI 1AG01118> ATA-7 SATA 2.x device
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
ada2: Previously was known as ad14

machine 2:
ahci0: <Intel 5 Series/3400 Series AHCI SATA controller> port 
0x5058-0x505f,0x5084-0x5087,0x5050-0x5057,0x5080-0x5083,0x5020-0x503f 
mem 0xb7806000-0xb78067ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <Corsair Force 3 SSD 1.3> ATA-8 SATA 3.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 114473MB (234441648 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad4

I doubt it is a Samsung problem (same problem on my Corsair SSDs) and my 
assumption that it is an ata-intel driver problem seems to be wrong as 
well (as you have a non-intel sata controller)

Comment 2 Alexander Motin freebsd_committer

2011-10-18 16:48:53 UTC

What do you mean by the "same problem"?

Command timeouts -- they could have million different reasons
(controllers, disks, firmwares, cables, power, radio interference, ...)
and just yesterday I've promised you to experiment more with error recovery.

Panics -- your backtraces look completely different from reported in
this PR and I have already comment to your that panic happens in file
system code. Ask file system people please.

-- 
Alexander Motin

Comment 3 Armin Pirkovitsch 2011-10-18 17:03:03 UTC

In both cases first the disk stops to reply and then the system dies 
somehow - sounds similar to me.
Imho the real problem is not the panic but the stuff that happens before 
- the panic is just the result of some operation which was unable to 
finish cleanly and therefor panics (in my case the fs which was unable 
to finish writing).

And the same ahcich0: ... stuff is happening prior to the panic in this 
pr - that's why I saw some relevance
(See Alexey's mail to current@ "Re: Panics after AHCI timeouts" at 15:13 
(+2) 18.Oct.2011  )

Comment 4 Eitan Adler freebsd_committer

2017-12-31 08:01:19 UTC

For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped

Comment 5 Graham Perrin freebsd_committer

2022-10-17 12:19:17 UTC

Keyword: 

    crash

– in lieu of summary line prefix: 

    [panic]

* bulk change for the keyword
* summary lines may be edited manually (not in bulk). 

Keyword descriptions and search interface: 

    <https://bugs.freebsd.org/bugzilla/describekeywords.cgi>