Bug 145211 - [panic] Memory modified after free
Summary: [panic] Memory modified after free
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: sparc64 (show other bugs)
Version: 9.0-CURRENT
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-sparc64 (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-30 16:50 UTC by Nathaniel Filardo
Modified: 2011-12-12 18:59 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nathaniel Filardo 2010-03-30 16:50:03 UTC
Kernel panic.  No dump to disk is made.  Moreover, despite having KDB
turned on, the system did not drop to a db> prompt.

login: Memory modified after free 0xfffff80019f97000(2048) val=dead0003 @ 0xfffff80019f97000
Memory modified after free 0xfffff8000569f000(2048) val=dead0003 @ 0xfffff8000569f000       
Memory modified after free 0xfffff80005686800(2048) val=dead0003 @ 0xfffff80005686800       
Memory modified after free 0xfffff800056dd800(2048) val=dead0003 @ 0xfffff800056dd800       
Memory modified after free 0xfffff800054ba800(2048) val=dead0003 @ 0xfffff800054ba800       
Memory modified after free 0xfffff8000565b000(2048) val=dead0003 @ 0xfffff8000565b000       
Memory modified after free 0xfffff80005609800(2048) val=dead0003 @ 0xfffff80005609800       
Memory modified after free 0xfffff80005608000(2048) val=dead0003 @ 0xfffff80005608000       
Memory modified after free 0xfffff80005695800(2048) val=dead0003 @ 0xfffff80005695800       
Memory modified after free 0xfffff8000563e800(2048) val=dead0003 @ 0xfffff8000563e800       
Memory modified after free 0xfffff800055c2000(2048) val=dead0003 @ 0xfffff800055c2000       
Memory modified after free 0xfffff80019f77800(2048) val=dead0003 @ 0xfffff80019f77800       
Memory modified after free 0xfffff8001920b000(2048) val=dead0003 @ 0xfffff8001920b000       
Memory modified after free 0xfffff80019fae000(2048) val=dead0003 @ 0xfffff80019fae000       
Memory modified after free 0xfffff800055a6800(2048) val=dead0003 @ 0xfffff800055a6800       
Memory modified after free 0xfffff8000565e000(2048) val=dead0003 @ 0xfffff8000565e000       
Memory modified after free 0xfffff80005641800(2048) val=dead0003 @ 0xfffff80005641800       
Memory modified after free 0xfffff80005675000(2048) val=dead0003 @ 0xfffff80005675000       
Memory modified after free 0xfffff8000564c800(2048) val=dead0003 @ 0xfffff8000564c800       
panic: pcib: PCI bus B error AFAR 0 AFSR 0 PCI CSR 0x10730b2aff IOMMU 0x3060003 STATUS 0x2a0
cpuid = 1

On pcib bus B I seem to have the following devices:

pcib0: <Sun Host-PCI bridge> mem 0x4000ff00000-0x4000ff0afff,0x4000fc10000-0x4000fc1701f,0x7f600000000-0x7f6000000ff,0x4000ff80000-0x4000ff8ffff irq 2035,2032,2033,2036,2019 on nexus0
pcib0: Tomatillo, version 4, IGN 0x1f, bus B, 66MHz
pcib0: DVMA map: 0xc0000000 to 0xdfffffff 65536 entries
pci0: <OFW PCI bus> on pcib0
pci0: <OFW PCI bus> on pcib0
bge0: <Broadcom BCM5704 A3, ASIC rev. 0x002003> mem 0x200000-0x20ffff,0x110000-0x11ffff at device 2.0 on pci0
bge1: <Broadcom BCM5704 A3, ASIC rev. 0x002003> mem 0x400000-0x40ffff,0x120000-0x12ffff at device 2.1 on pci0
atapci0: <AcerLabs M5229 UDMA100 controller> port 0x900-0x907,0x918-0x91b,0x910-0x917,0x908-0x90b,0x920-0x92f at device 13.0 on pci1
atapci0: [ITHREAD]
atapci0: using PIO transfers above 137GB as workaround for 48bit DMA access bug, expect reduced performance

There's only a DVD drive attached to atapci0, and the driver for that is not loaded.

pcib3: <Sun Host-PCI bridge> mem 0x4000ef00000-0x4000ef0afff,0x4000ec10000-0x4000ec1701f,0x7c600000000-0x7c6000000ff,0x4000ef80000-0x4000ef8ffff irq 1907,1904,1905,1908,1893 on nexus0
pcib3: Tomatillo, version 4, IGN 0x1d, bus B, 66MHz
pcib3: DVMA map: 0xc0000000 to 0xdfffffff 65536 entries
pci3: <OFW PCI bus> on pcib3
bge2: <Broadcom BCM5704 A3, ASIC rev. 0x002003> mem 0x200000-0x20ffff,0x110000-0x11ffff at device 2.0 on pci3
bge3: <Broadcom BCM5704 A3, ASIC rev. 0x002003> mem 0x400000-0x40ffff,0x120000-0x12ffff at device 2.1 on pci3
atapci1: <Marvell 88SX6081 SATA300 controller> port 0x300-0x3ff mem 0x600000-0x6fffff,0x800000-0xbfffff at device 1.0 on pci3
ata8: <ATA channel 4> on atapci1
ata9: <ATA channel 5> on atapci1
ata10: <ATA channel 6> on atapci1
ata11: <ATA channel 7> on atapci1
ad0: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata8-master UDMA100 SATA 3Gb/s
ad1: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata9-master UDMA100 SATA 3Gb/s
ad2: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata10-master UDMA100 SATA 3Gb/s
ad3: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata11-master UDMA100 SATA 3Gb/s

These four disks form a RAIDZ.

Kernel configuration options that seem relevant:

options         SMP
options         KDB
options         INVARIANTS
options         INVARIANT_SUPPORT
options         WITNESS
options         WITNESS_SKIPSPIN
device          ata
device          atadisk
nodevice        atapicd
nodevice        atapifd
nodevice        atapist
device          atamarvell

What more would be useful to know?

How-To-Repeat: Unknown; the crash has happened twice so far, once with a kernel from
January after weeks of uptime and once with a kernel from yesterday after
only a few hours.  The system routinely survives multiple zfs scrubs of
the four disks hanging off of pci3, so if it's an ATA bug it's a funny one.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2010-03-30 22:31:03 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-sparc64

Might be specific to sparc64.
Comment 2 Nathaniel Filardo 2010-03-31 19:49:40 UTC
It occurs to me to add that at least the second crash was correlated with a
burst of traffic on bge2, which usually sits idle.  FWIW, bge0 and bge3 are
typically busy, and bge1 is not connected.  Is it possible that this is a
bge bug?  I'll be recreating the busy-bge2 scenario to test other things
anyway and will report should it trigger a panic again.

While I'm recovering from filing an underinformative bug report, I'll note
that the machine is a Sun Fire V210 (with 2G of RAM and 2 1GHz CPUs).
Anything else that would help?

--nwf;
Comment 3 Anton Shterenlikht 2010-03-31 20:05:26 UTC
On Wed, Mar 31, 2010 at 06:50:12PM +0000, Nathaniel W Filardo wrote:
> The following reply was made to PR sparc64/145211; it has been noted by GNATS.
> 
> From: Nathaniel W Filardo <nwf@cs.jhu.edu>
> To: bug-followup@freebsd.org
> Cc:  
> Subject: Re: kern/145211: Memory modified after free
> Date: Wed, 31 Mar 2010 14:49:40 -0400
> 
>  It occurs to me to add that at least the second crash was correlated with a
>  burst of traffic on bge2, which usually sits idle.  FWIW, bge0 and bge3 are
>  typically busy, and bge1 is not connected.  Is it possible that this is a
>  bge bug?  I'll be recreating the busy-bge2 scenario to test other things
>  anyway and will report should it trigger a panic again.

FWIW I've had this twice on ia64 -current.
It also seems to follow bge activity,
but not sure about the "bursts":

http://seis.bris.ac.uk/~mexas/freebsd/ia64/rx2600/tzav/messages


-- 
Anton Shterenlikht
Room 2.6, Queen's Building
Mech Eng Dept
Bristol University
University Walk, Bristol BS8 1TR, UK
Tel: +44 (0)117 331 5944
Fax: +44 (0)117 929 4423
Comment 4 marius 2010-04-01 12:23:59 UTC
>  
> Memory modified after free 0xfffff80005675000(2048) val=dead0003 @ 0xfffff80005675000
> Memory modified after free 0xfffff8000564c800(2048) val=dead0003 @ 0xfffff8000564c800
> panic: pcib: PCI bus B error AFAR 0 AFSR 0 PCI CSR 0x10730b2aff IOMMU 0x3060003 STATUS 0x2a0

This is the IOMMU reporting an error as STX_PCI_CTRL_MMU_ERR is set in
the PCI CSR and TOM_PCI_IOMMU_ERR is set in the IOMMO CSR. Moreover the
TOM_PCI_IOMMU_INVALID_ERR set in the latter suggests that a DMA buffer
was used after it had been unloaded (and thus the TTE invalidated). So
it's quite likely that both the UMA and the IOMMU complaints are caused
by the same problem. Unfortunately, neither allows to identify the
culprit. If you could move the traffic in question from bge2 to bge1
and either use r206020 or the following patch that should allow to
identify at least the driver involved, i.e. ata(4) or bge(4), by
additionally indicating whether pcib0 or pcib3 triggered the panic.
http://people.freebsd.org/~marius/psycho_schizo_device_get_nameunit.diff

Which version of if_bge.c were you running when the panic occurred?

Marius
Comment 5 Nathaniel Filardo 2010-04-01 16:52:48 UTC
On Thu, Apr 01, 2010 at 01:23:59PM +0200, Marius Strobl wrote:
> This is the IOMMU reporting an error as STX_PCI_CTRL_MMU_ERR is set in
> the PCI CSR and TOM_PCI_IOMMU_ERR is set in the IOMMO CSR. Moreover the
> TOM_PCI_IOMMU_INVALID_ERR set in the latter suggests that a DMA buffer
> was used after it had been unloaded (and thus the TTE invalidated). So
> it's quite likely that both the UMA and the IOMMU complaints are caused
> by the same problem. Unfortunately, neither allows to identify the

Thank you for decoding that for me.

> culprit. If you could move the traffic in question from bge2 to bge1
> and either use r206020 or the following patch that should allow to
> identify at least the driver involved, i.e. ata(4) or bge(4), by
> additionally indicating whether pcib0 or pcib3 triggered the panic.
> http://people.freebsd.org/~marius/psycho_schizo_device_get_nameunit.diff

Just csup'd and am now rebuilding; will let you know.

> Which version of if_bge.c were you running when the panic occurred?

$FreeBSD: src/sys/dev/bge/if_bge.c,v 1.284 2010/03/25 17:17:35 yongari Exp $
Comment 6 Marius Strobl freebsd_committer freebsd_triage 2010-06-13 14:56:00 UTC
State Changed
From-To: open->feedback

This is expected to be fixed by r208862 (r208995 in stable/7, r208993 in 
stable/8), especially if a high number of if_iqdrops were seen. Could 
you please re-test with that revision in place?
Comment 7 Marius Strobl freebsd_committer freebsd_triage 2011-12-12 18:58:46 UTC
State Changed
From-To: feedback->closed

Close due to feedback timeout.