Bug 253237 - sys/netgraph/ng_macfilter_test:main test panics on RISC-V QEMU
Summary: sys/netgraph/ng_macfilter_test:main test panics on RISC-V QEMU
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: riscv Any
: --- Affects Only Me
Assignee: Mitchell Horne
URL: https://reviews.freebsd.org/D29064
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-03 22:11 UTC by Alex Richardson
Modified: 2021-03-08 23:07 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Richardson freebsd_committer freebsd_triage 2021-02-03 22:11:41 UTC
Running `cd /usr/tests/ && kyua test` on RISC-V built today (commit 65618fdda0f272a823e6701966421bdca0efa301) results in the following panic:

sys/netgraph/ng_macfilter_test:main  ->  WARNING: attempt to domain_add(netgraph) after domainfinalize()
Kernel page fault with the following non-sleepable locks held:
exclusive sleep mutex ng_node (ng_node) r = 0 (0xffffffd002139c70) locked @ /local/scratch/alr48/cheri/freebsd/sys/netgraph/ng_base.c:2325
stack backtrace:
#0 0xffffffc00031de5a at witness_checkorder+0xe78
#1 0xffffffc00031ef58 at witness_warn+0x3f4
#2 0xffffffc00054619a at do_trap_supervisor+0x3a8
#3 0xffffffc000545e56 at do_trap_supervisor+0x64
#4 0xffffffc000536718 at cpu_exception_handler_supervisor+0x68
t[0] == 0xffffffd07fd97280
t[1] == 0xffffffc0a5552b6c
t[2] == 0xffffffc0a4d6b1d0
t[3] == 0xffffffc0002999de
t[4] == 0x0000000000000000
t[5] == 0xffffffc0a60c52b0
t[6] == 0x0000000000000001
s[0] == 0x0000000000000000
s[1] == 0xffffffd002139c88
s[2] == 0xffffffd002139c00
s[3] == 0x0000000000001000
s[4] == 0xffffffd002139c68
s[5] == 0x00000000ffffffff
s[6] == 0xffffffd002139c00
s[7] == 0x0000000000000001
s[8] == 0x0000000000004f68
s[9] == 0xffffffd0268a03b0
s[10] == 0xffffffd001448cf0
s[11] == 0x0000000000000000
a[0] == 0x0000000000000002
a[1] == 0x0000000000000000
a[2] == 0x0000000000000000
a[3] == 0x0000000000100000
a[4] == 0x0000000000000000
a[5] == 0xffffffc0007d0f68
a[6] == 0xffffffd07fd8d380
a[7] == 0x0000000000000027
ra == 0xffffffc0a554df30
sp == 0xffffffc0a60c5870
gp == 0x0000000000000000
tp == 0xffffffd026ddf980
sepc == 0xffffffc0a554df3e
sstatus == 0x8000000200006120
panic: Fatal page fault at 0xffffffc0a554df3e: 0000000000000000
cpuid = 0
time = 1612390054
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_fetch_ksymtab() at db_fetch_ksymtab+0x15c
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x148
panic() at panic+0x26
do_trap_supervisor() at do_trap_supervisor+0x500
do_trap_supervisor() at do_trap_supervisor+0x64
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x68
--- exception 13, tval = 0
KDB: enter: panic
[ thread pid 90029 tid 103721 ]
Stopped at      kdb_enter+0x44: sd      zero,0(a0)
db>
Comment 1 Alex Richardson freebsd_committer freebsd_triage 2021-02-03 22:27:41 UTC
Seems to be reproducible. I just rebuilt the image and ran just that test:

```
To change this login announcement, see motd(5).
root@freebsd-riscv64:~ # cd /usr/tests/
root@freebsd-riscv64:/usr/tests # kyua test sys/netgraph/
sys/netgraph/ng_macfilter_test:main  ->  WARNING: attempt to domain_add(netgraph) after domainfinalize()
Kernel page fault with the following non-sleepable locks held:
exclusive sleep mutex ng_node (ng_node) r = 0 (0xffffffd002139a70) locked @ /local/scratch/alr48/cheri/freebsd/sys/netgraph/ng_base.c:2325
stack backtrace:
#0 0xffffffc00031de56 at witness_checkorder+0xe78
#1 0xffffffc00031ef54 at witness_warn+0x3f4
#2 0xffffffc00054619a at do_trap_supervisor+0x3a8
#3 0xffffffc000545e56 at do_trap_supervisor+0x64
#4 0xffffffc000536718 at cpu_exception_handler_supervisor+0x68
t[0] == 0xffffffd07fd97280
t[1] == 0xffffffc0a4e0fb6c
t[2] == 0xffffffc098c77750
t[3] == 0xffffffc0002999d6
t[4] == 0x0000000000000000
t[5] == 0xffffffc0985482b0
t[6] == 0x0000000000000001
s[0] == 0x0000000000000000
s[1] == 0xffffffd002139a88
s[2] == 0xffffffd002139a00
s[3] == 0x0000000000001000
s[4] == 0xffffffd002139a68
s[5] == 0x00000000ffffffff
s[6] == 0xffffffd002139a00
s[7] == 0x0000000000000001
s[8] == 0x0000000000004f68
s[9] == 0xffffffd009ea3b10
s[10] == 0xffffffd001448d70
s[11] == 0x0000000000000000
a[0] == 0x0000000000000002
a[1] == 0x0000000000000000
a[2] == 0x0000000000000000
a[3] == 0x0000000000100000
a[4] == 0x0000000000000000
a[5] == 0xffffffc0007d6a40
a[6] == 0xffffffd07fd8d380
a[7] == 0x0000000000000027
ra == 0xffffffc0a4e0af30
sp == 0xffffffc098548870
gp == 0x0000000000000000
tp == 0xffffffd009e04780
sepc == 0xffffffc0a4e0af3e
sstatus == 0x8000000200006120
panic: Fatal page fault at 0xffffffc0a4e0af3e: 0000000000000000
cpuid = 0
time = 1612391189
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_fetch_ksymtab() at db_fetch_ksymtab+0x15c
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x148
panic() at panic+0x26
do_trap_supervisor() at do_trap_supervisor+0x500
do_trap_supervisor() at do_trap_supervisor+0x64
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x68
--- exception 13, tval = 0
KDB: enter: panic
[ thread pid 781 tid 100044 ]
Stopped at      kdb_enter+0x44: sd      zero,0(a0)
db>
```

I start QEMU with /local/scratch/alr48/cheri/output/sdk/bin/qemu-system-riscv64cheri -M virt -m 2048 -nographic -bios default -kernel /local/scratch/alr48/cheri/output/freebsd-riscv64/boot/kernel/kernel -drive if=none,file=/local/scratch/alr48/cheri/output/freebsd-riscv64.img,id=drv,format=raw -device virtio-blk-device,drive=drv -device virtio-net-device,netdev=net0 -netdev 'user,id=net0,ipv6=off'
Comment 2 Mitchell Horne freebsd_committer freebsd_triage 2021-02-04 18:13:49 UTC
(In reply to Alex Richardson from comment #1)

Within QEMU, I can reproduce the panic with only the following command:

$ ngctl mkpeer vtnet0: macfilter lower ether

I have not yet had the chance to debug much further. My guess is that the netgraph module has been subtly broken on the platform all along, and r368443 was just the first to add tests for it.
Comment 3 Lutz Donnerhacke freebsd_committer freebsd_triage 2021-02-05 09:51:28 UTC
May you point me to a link, how to setup a test environment for this type of CPU?
Comment 4 Alex Richardson freebsd_committer freebsd_triage 2021-02-05 10:11:19 UTC
If you have space in your home directory, the fastest way would be:

$ git clone https://github.com/CTSRD-CHERI/cheribuild
$ cd cheribuild
$ ./cheribuild.py qemu build-and-run-freebsd-riscv64
  .... wait for the git clone + build to complete
  .... You should now have a QEMU login prompt
root
# cd /usr/tests && kyua test sys/netgraph/
Comment 5 Mitchell Horne freebsd_committer freebsd_triage 2021-03-04 18:10:39 UTC
I think I've narrowed this down to a pretty unsatisfying bug. See the linked review.
Comment 6 commit-hook freebsd_committer freebsd_triage 2021-03-04 21:01:03 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0d3b3beeb253e09b2b6b3805065594aecc7e2c2f

commit 0d3b3beeb253e09b2b6b3805065594aecc7e2c2f
Author:     Mitchell Horne <mhorne@FreeBSD.org>
AuthorDate: 2021-03-04 17:52:45 +0000
Commit:     Mitchell Horne <mhorne@FreeBSD.org>
CommitDate: 2021-03-04 20:59:58 +0000

    riscv: fix errors in some atomic type aliases

    This appears to be a copy-and-paste error that has simply been
    overlooked. The tree contains only two calls to any of the affected
    variants, but recent additions to the test suite started exercising the
    call to atomic_clear_rel_int() in ng_leave_write(), reliably causing
    panics.

    Apparently, the issue was inherited from the arm64 atomic header. That
    instance was addressed in c90baf6817a0, but the fix did not make its way
    to RISC-V.

    Note that the particular test case ng_macfilter_test:main still appears
    to fail on this platform, but this change reduces the panic to a
    timeout.

    PR:             253237
    Reported by:    Jenkins, arichardson
    Reviewed by:    kp, arichardson
    MFC after:      3 days
    Differential Revision:  https://reviews.freebsd.org/D29064

 sys/riscv/include/atomic.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
Comment 7 commit-hook freebsd_committer freebsd_triage 2021-03-08 14:04:59 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cc24f5bc6f6eb56a959bd23ebb051d3bf6ebf670

commit cc24f5bc6f6eb56a959bd23ebb051d3bf6ebf670
Author:     Mitchell Horne <mhorne@FreeBSD.org>
AuthorDate: 2021-03-04 17:52:45 +0000
Commit:     Mitchell Horne <mhorne@FreeBSD.org>
CommitDate: 2021-03-08 14:03:01 +0000

    riscv: fix errors in some atomic type aliases

    This appears to be a copy-and-paste error that has simply been
    overlooked. The tree contains only two calls to any of the affected
    variants, but recent additions to the test suite started exercising the
    call to atomic_clear_rel_int() in ng_leave_write(), reliably causing
    panics.

    Apparently, the issue was inherited from the arm64 atomic header. That
    instance was addressed in c90baf6817a0, but the fix did not make its way
    to RISC-V.

    Note that the particular test case ng_macfilter_test:main still appears
    to fail on this platform, but this change reduces the panic to a
    timeout.

    PR:             253237
    Reported by:    Jenkins, arichardson
    Reviewed by:    kp, arichardson

    (cherry picked from commit 0d3b3beeb253e09b2b6b3805065594aecc7e2c2f)

 sys/riscv/include/atomic.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
Comment 8 commit-hook freebsd_committer freebsd_triage 2021-03-08 23:05:35 UTC
A commit in branch releng/13.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=03572a87a84cde47f672480d3c5485713b7c39fb

commit 03572a87a84cde47f672480d3c5485713b7c39fb
Author:     Mitchell Horne <mhorne@FreeBSD.org>
AuthorDate: 2021-03-04 17:52:45 +0000
Commit:     Mitchell Horne <mhorne@FreeBSD.org>
CommitDate: 2021-03-08 23:04:25 +0000

    riscv: fix errors in some atomic type aliases

    This appears to be a copy-and-paste error that has simply been
    overlooked. The tree contains only two calls to any of the affected
    variants, but recent additions to the test suite started exercising the
    call to atomic_clear_rel_int() in ng_leave_write(), reliably causing
    panics.

    Apparently, the issue was inherited from the arm64 atomic header. That
    instance was addressed in c90baf6817a0, but the fix did not make its way
    to RISC-V.

    Note that the particular test case ng_macfilter_test:main still appears
    to fail on this platform, but this change reduces the panic to a
    timeout.

    PR:             253237
    Reported by:    Jenkins, arichardson
    Reviewed by:    kp, arichardson
    Approved by:    re (gjb)

    (cherry picked from commit 0d3b3beeb253e09b2b6b3805065594aecc7e2c2f)
    (cherry picked from commit cc24f5bc6f6eb56a959bd23ebb051d3bf6ebf670)

 sys/riscv/include/atomic.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)