Bug 250361 - Kernel trap #9 in sys_semop
Summary: Kernel trap #9 in sys_semop
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Konstantin Belousov
URL:
Keywords: panic
Depends on:
Blocks:
 
Reported: 2020-10-15 08:56 UTC by Olef
Modified: 2020-10-29 11:24 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Olef 2020-10-15 08:56:48 UTC
Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff80c4bff0
stack pointer           = 0x28:0xfffffe0013690870
frame pointer           = 0x28:0xfffffe0013690870
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 41551 (copy_db)
trap number             = 9
panic: general protection fault
cpuid = 0
time = 1602757996
KDB: stack backtrace:
#0 0xffffffff80c1d297 at kdb_backtrace+0x67
#1 0xffffffff80bd05cd at vpanic+0x19d
#2 0xffffffff80bd0423 at panic+0x43
#3 0xffffffff810a7d2c at trap_fatal+0x39c
#4 0xffffffff810a713c at trap+0x6c
#5 0xffffffff81081a0c at calltrap+0x8
#6 0xffffffff80c4bb99 at sys_semop+0x729
#7 0xffffffff810a88e4 at amd64_syscall+0x364
#8 0xffffffff81082330 at fast_syscall_common+0x101


Backtrace from kgdb:
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu.h:234
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:371
#2  0xffffffff80bd01c8 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:451
#3  0xffffffff80bd0629 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:877
#4  0xffffffff80bd0423 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:804
#5  0xffffffff810a7d2c in trap_fatal (frame=0xfffffe00136907b0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:943
#6  0xffffffff810a713c in trap (frame=0xfffffe00136907b0) at /usr/src/sys/amd64/amd64/trap.c:221
#7  <signal handler called>
#8  0xffffffff80c4bff0 in semu_alloc (td=<optimized out>) at /usr/src/sys/kern/sysv_sem.c:420
#9  semundo_adjust (td=0xfffff8000fe17000, supptr=0xfffffe00136908e0, semid=1, semseq=1, semnum=0, adjval=1)
    at /usr/src/sys/kern/sysv_sem.c:468
#10 0xffffffff80c4bb99 in sys_semop (td=0xfffff8000fe17000, uap=<optimized out>)
    at /usr/src/sys/kern/sysv_sem.c:1337
#11 0xffffffff810a88e4 in syscallenter (td=0xfffff8000fe17000)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#12 amd64_syscall (td=0xfffff8000fe17000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1186
#13 <signal handler called>

If worth mentioning, my /boot/loader.conf contains
kern.ipc.semopm=300
kern.ipc.semume=500

System is running in a VMWare instance, in case this is relevant.
Comment 1 Konstantin Belousov freebsd_committer 2020-10-15 11:48:19 UTC
Can you provide a minimal reproducer for the issue ?
Comment 2 Olef 2020-10-16 09:11:39 UTC
Hi,

I've tried to create something that replicates the behavior, but unfortunately I dont have any luck with it...

What happens internally is that a main process forks off 50 ish smaller processes that need to do system maintenance on 50 database files, and when attaching to the shared memory this segfault occurs.
Comment 3 Konstantin Belousov freebsd_committer 2020-10-16 10:00:19 UTC
Without reproducer I cannot say anything.  Perhaps try on 12.2, there were a fix that might be relevant, r358242 MFC of r357984.
Comment 4 Olef 2020-10-16 12:05:56 UTC
Fair enough.

Is there a way to find out what the calling process was actually doing to cause this?
kgdb only gives me the kernel fault, but doesn't give me anything on the state of the calling process.
Comment 5 Konstantin Belousov freebsd_committer 2020-10-16 12:24:39 UTC
You can try to do something with e.g. ktrace, but this would be hard because
system panics and records are not written.  Might be sync NFS mount from other
machine help, but I do not expect it.

So are you able to reproduce it at will, even with complex scenario ?
Try 12.2 or HEAD, you can install only kernel.  Perhaps enable INVARIANTS
when doing so.
Comment 6 Olef 2020-10-16 12:40:47 UTC
I can reproduce it with relative ease, i'm on 7 vmcore files so far.

Updating now to 12.2-RC2
Comment 7 Konstantin Belousov freebsd_committer 2020-10-17 09:08:34 UTC
I suspect I figured it out, please try the patch from
https://reviews.freebsd.org/D26826

That said, I am curious why do you need to adjust semume.
Comment 8 Olef 2020-10-19 09:29:30 UTC
Hi,

Thanks, though I did not receive the panic this morning after upgrading to 12.2, will check again tomorrow. If this fault still persists I'll patch in your suggestion.

I needed to increase SEMUME as some processes were complaining they could not semget (EINVAL)
Comment 9 Konstantin Belousov freebsd_committer 2020-10-19 10:59:37 UTC
(In reply to Olef from comment #8)
I am quite sure that there is the issue I described in the review, and since it
is a memory corruption kind of bug, it is quite specific to the kernel/machine/
load when and how it manifests itself.  I suggest you to add the patch to your
kernel and try the procedure that caused panic, manually, several time.
Comment 10 Olef 2020-10-19 12:50:30 UTC
I will, would it also manifest itself in 12.2 or shall I create a new VM ?
Comment 11 Konstantin Belousov freebsd_committer 2020-10-19 13:13:21 UTC
(In reply to Olef from comment #10)
The issue that patch fixes is in HEAD, stable/12, and all 12.x releases.
But since it is memory corruption, specific manifestation of it can be
arbitrary, for instance you might get data corruption instead of panic.

Yes, you can test with 12.2 VM.
Comment 12 Olef 2020-10-22 07:12:21 UTC
Hi,

So, in 12.2 RC2 I indeed still got the kernel panics after initial upgrade.
After patching the kernel I've not received this anymore in the last 3 days so all seems to work fine.

Thanks for your help!

PS: Increasing SEMUSZ would have also done the trick ?
Comment 13 Konstantin Belousov freebsd_committer 2020-10-22 09:27:35 UTC
(In reply to Olef from comment #12)
Yes increasing kern.ipc.semusz would also help, but you need to carefully
calculate how large to set it.  For instance, it is arch-dependent.
Comment 14 commit-hook freebsd_committer 2020-10-22 09:28:26 UTC
A commit references this bug:

Author: kib
Date: Thu Oct 22 09:28:12 UTC 2020
New revision: 366932
URL: https://svnweb.freebsd.org/changeset/base/366932

Log:
  sysv_sem: semusz depends on semume.

  Size of the per-process semaphore undo structure (semusz) depends on
  the number of the per-process undos.  If kern.ipc.semume is adjusted,
  semusz must be adjusted as well, and it makes no sense to delegate
  adjustment to user.  Make it automatic.

  Reported and tested by:	Olef <o.vandestadt@gmail.com>
  PR:	250361
  Reviewed by:	jhb, markj
  Sponsored by:	The FreeBSD Foundation
  MFC after:	1 week
  Differential revision:	https://reviews.freebsd.org/D26826

Changes:
  head/sys/kern/sysv_sem.c
Comment 15 commit-hook freebsd_committer 2020-10-29 11:09:57 UTC
A commit references this bug:

Author: kib
Date: Thu Oct 29 11:09:48 UTC 2020
New revision: 367128
URL: https://svnweb.freebsd.org/changeset/base/367128

Log:
  MFC r366932:
  sysv_sem: semusz depends on semume.

  PR:	250361

Changes:
_U  stable/12/
  stable/12/sys/kern/sysv_sem.c
Comment 16 commit-hook freebsd_committer 2020-10-29 11:19:59 UTC
A commit references this bug:

Author: kib
Date: Thu Oct 29 11:19:48 UTC 2020
New revision: 367129
URL: https://svnweb.freebsd.org/changeset/base/367129

Log:
  MFC r366932:
  sysv_sem: semusz depends on semume.

  PR:	250361

Changes:
_U  stable/11/
  stable/11/sys/kern/sysv_sem.c
Comment 17 Olef 2020-10-29 11:24:53 UTC
FYI,

I was able to reproduce this problem on an Intel NUC, applying the patch solved the problem.