Bug 264273

Summary: www/nginx: 13.0-p11 crashes after 12 > 13 upgrade: m_pullup -> ipfw_chk -> ipfw_check_frame -> ... -> vn_sendfile
Product: Base System Reporter: BB Lister <bblister>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed Overcome By Events    
Severity: Affects Only Me CC: bblister, chris, cy, joneum, lwhsu, rob2g2-freebsd, zlei
Priority: --- Keywords: crash, needs-qa
Version: 13.0-RELEASE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
the pkg info of the nginx that is causing kernel panics none

Description BB Lister 2022-05-27 05:26:36 UTC
I performed a binary upgrade from 12 to 13 Before 8 hours. I went to sleep, and in the morning I saw 2 kernel panics have occurred. 

Both traces indicate that all started from nginx and especially vn_sendfile.

nginx is installed from binary packages.

This is a production server running a couple of websites, inside a cloud VPS. 
It was running 12.2-RELEASE-p14 rock solid without any reboot, without any panic. The configuration is not changed.

I issued now a  sysctl net.inet.tcp.tso=0 
to see if it will make the server more reliable.
If kernel still panics I will have to disable sendfile in nginx.

I cannot switch to stable and perform kernel patches, because this is in production. I can issue commands or change configuration or display specific config files. In



>uname -a
FreeBSD arch.ece.uowm.gr 13.0-RELEASE-p11 FreeBSD 13.0-RELEASE-p11 #0: Tue Apr  5 18:54:35 UTC 2022     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64


Kernel panics:

May 27 04:13:49 arch kernel: Fatal trap 12: page fault while in kernel mode
May 27 04:13:49 arch kernel: cpuid = 2; apic id = 02
May 27 04:13:49 arch kernel: fault virtual address      = 0x58
May 27 04:13:49 arch kernel: fault code         = supervisor read data, page not present
May 27 04:13:49 arch kernel: instruction pointer        = 0x20:0xffffffff81086c80
May 27 04:13:49 arch kernel: stack pointer              = 0x28:0xfffffe0073dedf50
May 27 04:13:49 arch kernel: frame pointer              = 0x28:0xfffffe0073dedf50
May 27 04:13:49 arch kernel: code segment               = base rx0, limit 0xfffff, type 0x1b
May 27 04:13:49 arch kernel:                    = DPL 0, pres 1, long 1, def32 0, gran 1
May 27 04:13:49 arch kernel: processor eflags   = interrupt enabled, resume, IOPL = 0
May 27 04:13:49 arch kernel: current process            = 65527 (nginx)
May 27 04:13:49 arch kernel: trap number                = 12
May 27 04:13:49 arch kernel: panic: page fault
May 27 04:13:49 arch kernel: cpuid = 2
May 27 04:13:49 arch kernel: time = 1653612365
May 27 04:13:49 arch kernel: KDB: stack backtrace:
May 27 04:13:49 arch kernel: #0 0xffffffff80c57535 at kdb_backtrace+0x65
May 27 04:13:49 arch kernel: #1 0xffffffff80c09f11 at vpanic+0x181
May 27 04:13:49 arch kernel: #2 0xffffffff80c09d83 at panic+0x43
May 27 04:13:49 arch kernel: #3 0xffffffff8108b1a7 at trap_fatal+0x387
May 27 04:13:49 arch kernel: #4 0xffffffff8108b1ff at trap_pfault+0x4f
May 27 04:13:49 arch kernel: #5 0xffffffff8108a85d at trap+0x27d
May 27 04:13:49 arch kernel: #6 0xffffffff81061f08 at calltrap+0x8
May 27 04:13:49 arch kernel: #7 0xffffffff80c9c38f at m_pullup+0x1af
May 27 04:13:49 arch kernel: #8 0xffffffff821162bf at ipfw_chk+0x3fcf
May 27 04:13:49 arch kernel: #9 0xffffffff8211805c at ipfw_check_frame+0x13c
May 27 04:13:49 arch kernel: #10 0xffffffff80d422c7 at pfil_run_hooks+0x97
May 27 04:13:49 arch kernel: #11 0xffffffff80d23bf4 at ether_output_frame+0x94
May 27 04:13:49 arch kernel: #12 0xffffffff80d23b08 at ether_output+0x6b8
May 27 04:13:49 arch kernel: #13 0xffffffff80db3ca5 at ip_output_send+0x75
May 27 04:13:49 arch kernel: #14 0xffffffff80db3ac2 at ip_output+0x12b2
May 27 04:13:49 arch kernel: #15 0xffffffff80dc9ab4 at tcp_output+0x1b04
May 27 04:13:49 arch kernel: #16 0xffffffff80ddb189 at tcp_usr_send+0x229
May 27 04:13:49 arch kernel: #17 0xffffffff80c07e2a at vn_sendfile+0x197a
May 27 04:13:49 arch kernel: Uptime: 1h42m58s







--------------------Another Panic--------------------

May 27 05:09:25 arch kernel: Fatal trap 12: page fault while in kernel mode
May 27 05:09:25 arch kernel: cpuid = 0; apic id = 00
May 27 05:09:25 arch kernel: fault virtual address      = 0x580
May 27 05:09:25 arch kernel: fault code         = supervisor read data, page not present
May 27 05:09:25 arch kernel: instruction pointer        = 0x20:0xffffffff81086c80
May 27 05:09:25 arch kernel: stack pointer              = 0x28:0xfffffe00bb153f50
May 27 05:09:25 arch kernel: frame pointer              = 0x28:0xfffffe00bb153f50
May 27 05:09:25 arch kernel: code segment               = base rx0, limit 0xfffff, type 0x1b
May 27 05:09:25 arch kernel:                    = DPL 0, pres 1, long 1, def32 0, gran 1
May 27 05:09:25 arch kernel: processor eflags   = interrupt enabled, resume, IOPL = 0
May 27 05:09:25 arch kernel: current process            = 97681 (nginx)
May 27 05:09:25 arch kernel: trap number                = 12
May 27 05:09:25 arch kernel: panic: page fault
May 27 05:09:25 arch kernel: cpuid = 0
May 27 05:09:25 arch kernel: time = 1653615832
May 27 05:09:25 arch kernel: KDB: stack backtrace:
May 27 05:09:25 arch kernel: #0 0xffffffff80c57535 at kdb_backtrace+0x65
May 27 05:09:25 arch kernel: #1 0xffffffff80c09f11 at vpanic+0x181
May 27 05:09:25 arch kernel: #2 0xffffffff80c09d83 at panic+0x43
May 27 05:09:25 arch kernel: #3 0xffffffff8108b1a7 at trap_fatal+0x387
May 27 05:09:25 arch kernel: #4 0xffffffff8108b1ff at trap_pfault+0x4f
May 27 05:09:25 arch kernel: #5 0xffffffff8108a85d at trap+0x27d
May 27 05:09:25 arch kernel: #6 0xffffffff81061f08 at calltrap+0x8
May 27 05:09:25 arch kernel: #7 0xffffffff80c9c38f at m_pullup+0x1af
May 27 05:09:25 arch kernel: #8 0xffffffff8211f2bf at ipfw_chk+0x3fcf
May 27 05:09:25 arch kernel: #9 0xffffffff8212105c at ipfw_check_frame+0x13c
May 27 05:09:25 arch kernel: #10 0xffffffff80d422c7 at pfil_run_hooks+0x97
May 27 05:09:25 arch kernel: #11 0xffffffff80d23bf4 at ether_output_frame+0x94
May 27 05:09:25 arch kernel: #12 0xffffffff80d23b08 at ether_output+0x6b8
May 27 05:09:25 arch kernel: #13 0xffffffff80db3ca5 at ip_output_send+0x75
May 27 05:09:25 arch kernel: #14 0xffffffff80db3ac2 at ip_output+0x12b2
May 27 05:09:25 arch kernel: #15 0xffffffff80dc9ab4 at tcp_output+0x1b04
May 27 05:09:25 arch kernel: #16 0xffffffff80ddb189 at tcp_usr_send+0x229
May 27 05:09:25 arch kernel: #17 0xffffffff80c07e2a at vn_sendfile+0x197a
May 27 05:09:25 arch kernel: Uptime: 56m29s
Comment 1 BB Lister 2022-05-27 08:31:43 UTC
The sysctl net.inet.tcp.tso=0  did not help. Another panic (a bit different) occured, as shown bellow.
I reverted the change
sysctl net.inet.tcp.tso=1
and I have disabled sendfile now in nginx (sendfile off; in http{} of nginx).

The new panic message with  net.inet.tcp.tso=0 and senfile enabled is:

May 27 11:12:33 arch kernel: Fatal trap 12: page fault while in kernel mode
May 27 11:12:33 arch kernel: cpuid = 0; apic id = 00
May 27 11:12:33 arch kernel: fault virtual address      = 0x148
May 27 11:12:33 arch kernel: fault code         = supervisor read data, page not present
May 27 11:12:33 arch kernel: instruction pointer        = 0x20:0xffffffff81086c80
May 27 11:12:33 arch kernel: stack pointer              = 0x28:0xfffffe0073be2060
May 27 11:12:33 arch kernel: frame pointer              = 0x28:0xfffffe0073be2060
May 27 11:12:33 arch kernel: code segment               = base rx0, limit 0xfffff, type 0x1b
May 27 11:12:33 arch kernel:                    = DPL 0, pres 1, long 1, def32 0, gran 1
May 27 11:12:33 arch kernel: processor eflags   = interrupt enabled, resume, IOPL = 0
May 27 11:12:33 arch kernel: current process            = 12 (swi1: netisr 0)
May 27 11:12:33 arch kernel: trap number                = 12
May 27 11:12:33 arch kernel: panic: page fault
May 27 11:12:33 arch kernel: cpuid = 0
May 27 11:12:33 arch kernel: time = 1653634930
May 27 11:12:33 arch kernel: KDB: stack backtrace:
May 27 11:12:33 arch kernel: #0 0xffffffff80c57535 at kdb_backtrace+0x65
May 27 11:12:33 arch kernel: #1 0xffffffff80c09f11 at vpanic+0x181
May 27 11:12:33 arch kernel: #2 0xffffffff80c09d83 at panic+0x43
May 27 11:12:33 arch kernel: #3 0xffffffff8108b1a7 at trap_fatal+0x387
May 27 11:12:33 arch kernel: #4 0xffffffff8108b1ff at trap_pfault+0x4f
May 27 11:12:33 arch kernel: #5 0xffffffff8108a85d at trap+0x27d
May 27 11:12:33 arch kernel: #6 0xffffffff81061f08 at calltrap+0x8
May 27 11:12:33 arch kernel: #7 0xffffffff80c9c38f at m_pullup+0x1af
May 27 11:12:33 arch kernel: #8 0xffffffff821172bf at ipfw_chk+0x3fcf
May 27 11:12:33 arch kernel: #9 0xffffffff8211905c at ipfw_check_frame+0x13c
May 27 11:12:33 arch kernel: #10 0xffffffff80d422c7 at pfil_run_hooks+0x97
May 27 11:12:33 arch kernel: #11 0xffffffff80d23bf4 at ether_output_frame+0x94
May 27 11:12:33 arch kernel: #12 0xffffffff80d23b08 at ether_output+0x6b8
May 27 11:12:33 arch kernel: #13 0xffffffff80db3ca5 at ip_output_send+0x75
May 27 11:12:33 arch kernel: #14 0xffffffff80db3ac2 at ip_output+0x12b2
May 27 11:12:33 arch kernel: #15 0xffffffff80dc9ab4 at tcp_output+0x1b04
May 27 11:12:33 arch kernel: #16 0xffffffff80dc127b at tcp_do_segment+0x2c9b
May 27 11:12:33 arch kernel: #17 0xffffffff80dbd81e at tcp_input+0xabe
Comment 2 Kubilay Kocak freebsd_committer freebsd_triage 2022-05-28 00:50:41 UTC
@Reporter Can you confirm:

- For nginx, what is the package version. Please include `pkg info nginx` output as an attachment
- For the upgrade, were packages updated after base upgrade, or left with the same version?
- Is the panic reproducible without ipfw enabled?

If you are able to enable kernel crash processing, that would be great: 

  https://docs.freebsd.org/en/books/developers-handbook/kerneldebug/
Comment 3 BB Lister 2022-05-28 05:24:16 UTC
After I disabled sendfile on nginx, the machine seems stable for 24hours without any panic.

After the binary upgrade of the base system, I performed a binary update from the packages using:
pkg bootstrap -f -y
pkg-static upgrade -f -y


Concerning the version of nginx:

#pkg info | grep nginx
nginx-1.20.2_9,2               Robust and small WWW server

I will upload nginx info as attachment

I cannot disable ipfw, because this is a production server in the cloud and I wont leave it without any firewall. If there is a tool to automatically change the ipfw rules (over 1500) to pf or other firewall I could do it.

I enabled also 

dumpdev=AUTO  in rc.conf 

Is this enough for the RELEASE kernel that I am using?

Should I revert the sendfile to yes of nginx to cause a panic?
Comment 4 BB Lister 2022-05-28 05:25:20 UTC
Created attachment 234282 [details]
the pkg info of the nginx that is causing kernel panics

pkg info nginx
Comment 5 Jochen Neumeister freebsd_committer freebsd_triage 2022-05-28 08:43:16 UTC
After an update to another release, it is necessary to rebuild all ports: https://docs.freebsd.org/en/books/handbook/cutting-edge/#updating-upgrading-freebsdupdate
Point 24.2.3.2

Nginx is now on Version 1.22.x into the Ports
Comment 6 BB Lister 2022-05-28 22:14:20 UTC
I only use binary packages.
I performed the binary upgrade of the package using:

pkg bootstrap -f -y
pkg-static upgrade -f -y

#more /etc/pkg/FreeBSD.conf
FreeBSD: {
  url: "pkg+http://pkg.FreeBSD.org/${ABI}/quarterly",                                       mirror_type: "srv",
  signature_type: "fingerprints",
  fingerprints: "/usr/share/keys/pkg",
  enabled: yes
}


Perhaps the nginx binary package is not updated in  http://pkg.FreeBSD.org/${ABI}/quarterly 
I will change this line to latest and upgrade again the packages.


Nevertheless, I find it strange for an application in userspace to cause kernel panic, even if running as root. I believe this application calls syscals from kernel and in normal circumstances it should not caused any kernel panic.
Comment 7 BB Lister 2022-05-30 05:51:29 UTC
I enabled again for one time the sendfile=yes and within 1 hour the server panicked with: 

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address   = 0x580
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81086c80
stack pointer           = 0x28:0xfffffe00bc3dff50
frame pointer           = 0x28:0xfffffe00bc3dff50
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 67343 (nginx)
trap number             = 12
panic: page fault
cpuid = 2
time = 1653866395
KDB: stack backtrace:
#0 0xffffffff80c57535 at kdb_backtrace+0x65
#1 0xffffffff80c09f11 at vpanic+0x181
#2 0xffffffff80c09d83 at panic+0x43
#3 0xffffffff8108b1a7 at trap_fatal+0x387
#4 0xffffffff8108b1ff at trap_pfault+0x4f
#5 0xffffffff8108a85d at trap+0x27d
#6 0xffffffff81061f08 at calltrap+0x8
#7 0xffffffff80c9c38f at m_pullup+0x1af
#8 0xffffffff8211f2bf at ipfw_chk+0x3fcf
#9 0xffffffff8212105c at ipfw_check_frame+0x13c
#10 0xffffffff80d422c7 at pfil_run_hooks+0x97
#11 0xffffffff80d23bf4 at ether_output_frame+0x94
#12 0xffffffff80d23b08 at ether_output+0x6b8
#13 0xffffffff80db3ca5 at ip_output_send+0x75
#14 0xffffffff80db3ac2 at ip_output+0x12b2
#15 0xffffffff80dc9ab4 at tcp_output+0x1b04
#16 0xffffffff80ddb189 at tcp_usr_send+0x229
#17 0xffffffff80c07e2a at vn_sendfile+0x197a




This time I had enabled the dumpdev and I got a vmcore.0 file in /var/crash.
This file seems to have information on my system, thus I cannot upload it, but I can send it by email to any developer.

I installed devel/gdb to use kgdb but  I got an error that crashed kgdb and wanted to create a core file.


# kgdb /boot/kernel/kernel /var/crash/vmcore.0
....
Reading symbols from /boot/kernel/kernel...
(No debugging symbols found in /boot/kernel/kernel)
/wrkdirs/usr/ports/devel/gdb/work-py38/gdb-12.1/gdb/thread.c:1328: internal-error: switch_to_thread: Assertion `thr != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n)




/wrkdirs/usr/ports/devel/gdb/work-py38/gdb-12.1/gdb/thread.c:1328: internal-error: switch_to_thread: Assertion `thr != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n)
n
Command aborted.
(kgdb) bt
No thread selected.
(kgdb) info threads
No threads.
(kgdb)

I figured out that I should be using another version of kernel.

I downloaded the packaged kernel-dbg.txz for 13.0 and extracted to /tmp

kgdb /tmp/kernel/kernel.debug /var/crash/vmcore.0

but now I got

Reading symbols from /tmp/kernel/kernel.debug...
Failed to open vmcore: not a minidump for this platform
(kgdb)


Which means as I assume that I should boot with kernel.debug if I would like to perform debugging, which is difficult for this period.

Concluding: As soon as I enable sendfile on ; on nginx within 1 hour I have a kernel panic every time on 13.0-RELEASE-p11 FreeBSD
Comment 8 Jochen Neumeister freebsd_committer freebsd_triage 2022-05-30 07:29:57 UTC
I think I'm the wrong person. This doesn't really have anything to do with NGINX the kernel problems. 
I'm afraid I won't be able to help you there.
Comment 9 Li-Wen Hsu freebsd_committer freebsd_triage 2022-06-08 10:56:51 UTC
(In reply to BB Lister from comment #7)
Hi, I moved this ticket to base/kern and hope more people can join to debug.
In the mean while, can you check the status with 13.1-RELEASE?
Comment 10 BB Lister 2022-06-14 09:38:05 UTC
I binary upgraded the machine to:

13.1-RELEASE FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC amd64
and enabled
sendfile on;
on the nginx

The system remains stable for 50 hours. No reboots. In FreeBSD 13.0 I had a reboot within hours, but in 13.1 the server seems stable. I will keep the sendfile option on and report on something unusual.