Bug 220830 - STABLE/11 (11.1-RC3, with ZFS bootstrap loader r1.1) kernel (GENERIC) panic; integer divide fault, trap number 18, current process 16 (pf purge)
Summary: STABLE/11 (11.1-RC3, with ZFS bootstrap loader r1.1) kernel (GENERIC) panic; ...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Kristof Provost
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-18 13:58 UTC by Vinícius Zavam
Modified: 2017-08-25 08:36 UTC (History)
8 users (show)

See Also:


Attachments
freebsd 11.1-rc3 (generic) with zfs boostrap loader, kernel panic. pf purge. (21.67 KB, image/png)
2017-07-18 13:58 UTC, Vinícius Zavam
no flags Details
[GIF] freebsd 11.1-rc3 (generic) with zfs boostrap loader, kernel panic. pf purge. (670.98 KB, image/gif)
2017-07-19 12:02 UTC, Vinícius Zavam
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vinícius Zavam freebsd_committer freebsd_triage 2017-07-18 13:58:44 UTC
Created attachment 184470 [details]
freebsd 11.1-rc3 (generic) with zfs boostrap loader, kernel panic. pf purge.

how to reproduce?

  1) get a FreeBSD 11.1-RC3/amd64 install ISO;
  2) install it following its defaults;
        use "Auto ZFS" for disk partitions;
  3) boot into your new FreeBSD box;
  4) as root, run "sysrc -f /boot/loader.conf pf_load=YES";
  5) reboot.

it might be the case that previous revisions are also affected. first noticed this behavior on r320730. kernel panic reproduced in a virtual environment, and on bare metal.

  *) changing "boot options" to enable verbose (option number 6) is a workaround, and the machine boots without panic.

I'm building a DEBUG kernel to attach more details and info ASAP; if one can build and reproduce it too add more data, feel free.
Comment 1 Olivier Cochard freebsd_committer freebsd_triage 2017-07-18 15:04:03 UTC
I've installed the official 11.1-RC3 (https://download.freebsd.org/ftp/releases/amd64/amd64/ISO-IMAGES/11.1/FreeBSD-11.1-RC3-amd64-disc1.iso) into a Virtualbox machine (64bits, 2GB RAM, 16GB disk) and followed your instruction but didn't reach to reproduce your problem (I've rebooted 5 times).

Unrelated: Why did you load pf in /boot/loader.conf stage ?
Because RC script/etc/rc.d/pf already load automatically pf.
Comment 2 Kristof Provost freebsd_committer freebsd_triage 2017-07-18 15:34:21 UTC
Same here, unable to reproduce with bhyve (amd64, Root on ZFS, 2 GB memory).
Comment 3 Vinícius Zavam freebsd_committer freebsd_triage 2017-07-19 09:34:12 UTC
kristof, olivier,

thank you for writing back (:

I was also unable to reproduce my steps running a 11.1-RC3 guest here on my workstation with bhyve, qemu or even virtualbox. my host OS is a FreeBSD 12.0-CURRENT/amd64 (r320640, 1200037).

the "main" hypervisor I did use that presented me with a kernel panic is a linux-based OS with kvm; proxmox 5.0-23/af4267bf (debian 6.3.0-18, kernel 4.10.15-1-pve).

I did build a DEBUG kernel and just tested it; 11.1-PRERELEASE (r320976+) worked just fine. the sad thing is that I completely forgot to remove pf from the KERNCONF after adding the DEBUG options. so, I'm rebuilding it again and will test it again - following the very same procedures I reported before.

olivier,

the reason I am using loader.conf is: when one uses the rc.conf the rc.d/pf script doesn't load the module if there's no pf.conf available, or it loads the module quite "late". so, I was just being sure that pf.ko would be loaded no matter what.
Comment 4 Paul Herman 2017-07-19 10:12:49 UTC
Hi,

I can reproduce this.  This is a divide by zero in sys/netpfil/pf/pf.c  pf_purge_thread().  https://svnweb.freebsd.org/base/stable/11/sys/netpfil/pf/pf.c?annotate=316640#l1446   The V_pf_default_rule.timeout array hasn't been initialized yet.

This happens presumably because pf_load() is getting called before pfattach_vnet() (i.e. pf_load_vnet()).   Anyone know what determines the order, or how to enforce this type of "dependency"?

It smells like a race condition, which could be why not everyone can reproduce it, but on my hardware it's 100%.  I also suspect this has to do with EARLY_AP_STARTUP, because I have yet to see the same panic with EARLY_AP_STARTUP disabled.  Still testing.

BTW, my kernel config is simple:

include         GENERIC

device         carp
device         pf
#nooptions     EARLY_AP_STARTUP
Comment 5 Vinícius Zavam freebsd_committer freebsd_triage 2017-07-19 12:02:57 UTC
Created attachment 184503 [details]
[GIF] freebsd 11.1-rc3 (generic) with zfs boostrap loader, kernel panic. pf purge.
Comment 6 Kristof Provost freebsd_committer freebsd_triage 2017-07-19 18:42:44 UTC
(In reply to Paul Herman from comment #4)
I'm not sure how that would happen, but loos@ committed something that'll likely fix it to head a while back:

------------------------------------------------------------------------
r312943 | loos | 2017-01-29 03:17:52 +0100 (Sun, 29 Jan 2017) | 8 lines

Do not run the pf purge thread while the VNET variables are not
initialized, this can cause a divide by zero (if the VNET initialization
takes to long to complete).

Obtained from:  pfSense
MFC after:      2 weeks
Sponsored by:   Rubicon Communications, LLC (Netgate)

We probably want to MFC that one. Can you confirm that fixes the problem for you?
Comment 7 Paul Herman 2017-07-20 07:54:14 UTC
Nice catch.  Yep, applied r312943 to pf.c, panic gone.
Comment 8 Vinícius Zavam freebsd_committer freebsd_triage 2017-07-20 08:55:03 UTC
I also confirm that the panic is gone. Thanks!

The machine I patched has the very same config and it's running as guest OS in the very same hypervisor as the one crashing before. 

BTW, I used the 'src.txz' shipped with 11.1-RC3 (r320976) to apply the patch (https://svnweb.freebsd.org/base/head/sys/netpfil/pf/pf.c?view=patch&r1=312943&r2=312942&pathrev=312943), compile the kernel and install the patched version of it and its modules.

# date
Thu Jul 20 08:41:35 UTC 2017

# uname -a
FreeBSD freebsd11 11.1-RC3 FreeBSD 11.1-RC3 #0: Thu Jul 20 08:39:04 UTC 2017     root@freebsd11:/usr/obj/usr/src/sys/GENERIC  amd64

# diff /boot/kernel.generic/pf.ko /boot/kernel/pf.ko
Files /boot/kernel.generic/pf.ko and /boot/kernel/pf.ko differ

# grep pf /boot/loader.conf /etc/rc.conf
/boot/loader.conf:pf_load="YES"

# uptime
 8:41AM  up 1 min, 1 users, load averages: 0.15, 0.08, 0.03
Comment 9 commit-hook freebsd_committer freebsd_triage 2017-07-20 17:15:59 UTC
A commit references this bug:

Author: kp
Date: Thu Jul 20 17:15:19 UTC 2017
New revision: 321296
URL: https://svnweb.freebsd.org/changeset/base/321296

Log:
  MFC r312943

  Do not run the pf purge thread while the VNET variables are not
  initialized, this can cause a divide by zero (if the VNET initialization
  takes to long to complete).

  PR:		220830

Changes:
_U  stable/11/
  stable/11/sys/netpfil/pf/pf.c
Comment 10 Xin LI freebsd_committer freebsd_triage 2017-07-31 06:19:43 UTC
Adding re@.

The fix is now in stable/11 -- should we do an errata for this one?
Comment 11 Kristof Provost freebsd_committer freebsd_triage 2017-07-31 07:58:33 UTC
(In reply to Xin LI from comment #10)
It's on my todo list. Real soon now(tm)!
Comment 12 eborisch+FreeBSD 2017-08-10 13:21:13 UTC
For posterity, seconding Paul's assessment that this started tripping with the introduction of EARLY_AP_STARTUP. On this system I hit it (during testing) 100% of the time with EAP, and never without.

https://lists.freebsd.org/pipermail/freebsd-stable/2017-June/087245.html
Comment 13 Kristof Provost freebsd_committer freebsd_triage 2017-08-10 13:23:03 UTC
The errata notice has been sent so this can be closed.
Comment 14 Kris 2017-08-25 00:47:30 UTC
I faced this same problem with FreeBSD 11.1 Release and the fix here to sys/netpfil/pf/pf.c also stopped the panics on boot.

Should it not also be added to Release 11.1 too?

https://www.freebsd.org/releases/11.1R/errata.html
Comment 15 Kristof Provost freebsd_committer freebsd_triage 2017-08-25 08:36:19 UTC
(In reply to Kris from comment #14)
There was an errata notice for this problem: https://www.freebsd.org/security/advisories/FreeBSD-EN-17:08.pf.asc

I'm not quite sure why it's not listed in the 11.1 release errata, other than perhaps that it was not know/fixed at that point.