Bug 241133 - page fault in ng_l2tp_seq_rack_timeout
Summary: page fault in ng_l2tp_seq_rack_timeout
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: Gleb Smirnoff
Keywords: crash
Depends on:
Reported: 2019-10-08 15:53 UTC by zivillian
Modified: 2021-09-10 18:28 UTC (History)
5 users (show)

See Also:

patched l2tp mopdule (31.46 KB, application/x-object)
2020-10-14 15:54 UTC, zivillian
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description zivillian 2019-10-08 15:53:46 UTC
I'm using mpd5 to run an L2TP Server on FreeBSD. This crash is currently happening at least once within 24 hours.

At first I discovered this on the client side within pfSense (https://redmine.pfsense.org/issues/9058).

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address      = 0x1c
fault code         = supervisor read data, page not present
instruction pointer        = 0x20:0xffffffff80b79f76
stack pointer              = 0x28:0xfffffe0094a7ca60
frame pointer              = 0x28:0xfffffe0094a7caa0
code segment               = base rx0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process            = 369 (ng_queue1)
trap number                = 12
panic: page fault
cpuid = 2
KDB: stack backtrace:
#0 0xffffffff80b3d517 at kdb_backtrace+0x67
#1 0xffffffff80af6ab7 at vpanic+0x177
#2 0xffffffff80af6933 at panic+0x43
#3 0xffffffff80f7827f at trap_fatal+0x35f
#4 0xffffffff80f782d9 at trap_pfault+0x49
#5 0xffffffff80f77aa7 at trap+0x2c7
#6 0xffffffff80f5803c at calltrap+0x8
#7 0xffffffff82835bf4 at ng_l2tp_seq_rack_timeout+0x164
#8 0xffffffff8282868d at ng_apply_item+0xcd
#9 0xffffffff8282b084 at ngthread+0x1a4
#10 0xffffffff80aba033 at fork_exit+0x83
#11 0xffffffff80f58f6e at fork_trampoline+0xe
Uptime: 12h56m8s
Dumping 1124 out of 1999 MB:..2%..12%..22%..32%..42%..52%..62%..72%..82%..92%
Dump complete
Comment 1 Mark Johnston freebsd_committer 2020-09-24 18:09:09 UTC
Seems likely this was fixed by https://svnweb.freebsd.org/changeset/base/353027
Comment 2 commit-hook freebsd_committer 2020-09-25 18:56:08 UTC
A commit references this bug:

Author: markj
Date: Fri Sep 25 18:55:50 UTC 2020
New revision: 366167
URL: https://svnweb.freebsd.org/changeset/base/366167

  ng_l2tp: Fix callout synchronization in the rexmit timeout handler

  A received control packet may cause the transmit queue to be flushed, in
  which case ng_l2tp_seq_recv_nr() cancels the transmit timeout handler.
  The handler checks to see if it was cancelled before doing anything, but
  did so before acquiring the node lock, so a small race window could
  cause ng_l2tp_seq_rack_timeout() to attempt to flush an empty queue,
  ultimately causing a null pointer dereference.

  PR:		241133
  Reviewed by:	bz, glebius, Lutz Donnerhacke
  MFC after:	3 days
  Sponsored by:	Rubicon Communications, LLC (Netgate)
  Differential Revision:	https://reviews.freebsd.org/D26548

Comment 3 zivillian 2020-09-25 19:25:23 UTC
It's great to see progress on this. Can someone point me to a howto or tutorial, which explains, what I need to do to apply this fix to my environment? I've hit this bug multiple times during the last month and want to verify, that it fixes the crash.

I'm currently running FreeBSD 12.1-RELEASE-p8 GENERIC.
Comment 4 Mark Johnston freebsd_committer 2020-09-25 20:18:48 UTC
(In reply to zivillian from comment #3)
You would have to either build a new copy of ng_l2tp.ko with the patches applied, or update to 12.2, which is still being finalized.  The most recent fix is not yet in the 12 branch but I expect to merge in a few days so it'll appear in 12.2 (due to be released in about a month).

If you want to try building ng_l2tp.ko from 12.1 sources, the steps are:

$ svn checkout https://svn.freebsd.org/base/releng/12.1 /usr/src
$ cd /usr/src
$ svn merge -c 353027 ^/head .
$ svn merge -c 366167 ^/head .
$ cd sys/modules/l2tp
$ make
$ sudo make install

This will install ng_l2tp.ko into /boot/modules.  By default the system will use the copy in /boot/kernel, so either copy the new one over or ensure that ng_l2tp.ko gets loaded from /boot/modules (e.g., by kldload'ing it manually).
Comment 5 commit-hook freebsd_committer 2020-09-28 11:46:24 UTC
A commit references this bug:

Author: markj
Date: Mon Sep 28 11:46:04 UTC 2020
New revision: 366220
URL: https://svnweb.freebsd.org/changeset/base/366220

  MFC r366167:
  ng_l2tp: Fix callout synchronization in the rexmit timeout handler

  PR:	241133

_U  stable/12/
Comment 6 commit-hook freebsd_committer 2020-09-28 11:48:27 UTC
A commit references this bug:

Author: markj
Date: Mon Sep 28 11:48:20 UTC 2020
New revision: 366221
URL: https://svnweb.freebsd.org/changeset/base/366221

  MFC r366167:
  ng_l2tp: Fix callout synchronization in the rexmit timeout handler

  PR:	241133

_U  stable/11/
Comment 7 Mark Johnston freebsd_committer 2020-09-28 11:53:29 UTC
I will mark the bug as resolved for now.  Please re-open if you are still able to trigger ng_l2tp panics with r353027 and r366167 applied.
Comment 8 commit-hook freebsd_committer 2020-09-28 12:15:32 UTC
A commit references this bug:

Author: markj
Date: Mon Sep 28 12:14:38 UTC 2020
New revision: 366223
URL: https://svnweb.freebsd.org/changeset/base/366223

  MFS r366220:
  MFC r366167:
  ng_l2tp: Fix callout synchronization in the rexmit timeout handler

  PR:		241133
  Approved by:	re (gjb)

_U  releng/12.2/
Comment 9 Mark Johnston freebsd_committer 2020-09-30 17:58:35 UTC
Related patch: https://reviews.freebsd.org/D26586
Comment 10 zivillian 2020-10-05 08:16:56 UTC
After applying 353027 and 366167 it crashed again after 5 days.
Comment 11 Mark Johnston freebsd_committer 2020-10-06 13:30:44 UTC
(In reply to zivillian from comment #10)
Thanks for testing.  Could you try applying this patch on top of the others? https://people.freebsd.org/~markj/patches/ng_l2tp_debug.diff
Comment 12 zivillian 2020-10-13 19:19:40 UTC
I've applied the patch, and it crashed again after 6 days. Please let me know, if you need any dump or logs.
Comment 13 Mark Johnston freebsd_committer 2020-10-13 19:29:21 UTC
(In reply to zivillian from comment #12)
Argh.  Thanks for your patience.  Would you be willing to provide the vmcore and debug symbols?
Comment 14 zivillian 2020-10-13 20:11:59 UTC
My vmcore is 30MB (gzipped), so I can't upload it here, do you have any preferred hoster? And I don't know where I can find the debug symbols (I've build the module with the commands from Comment 4, but there are no obvious files).
Comment 15 Mark Johnston freebsd_committer 2020-10-13 20:38:08 UTC
(In reply to zivillian from comment #14)
I have no preference.  Feel free to mail me a link privately if you prefer.

Debug symbols should be under /usr/lib/debug/boot/*.  A tarball of that is required.
Comment 16 zivillian 2020-10-14 15:54:39 UTC
Created attachment 218744 [details]
patched l2tp mopdule

/usr/lib/debug/boot only contains an empty folder kernel. Have I missed something?

I've attached the module and file says it's not stripped, but I'm unsure if this is enough.
The vmcore can be downloaded from https://we.tl/t-wPU1j9UOwm
Comment 17 YUAN RUI 2021-08-26 08:36:43 UTC
It looks like pfsense has added a check to the latest code.

Comment 18 Gleb Smirnoff freebsd_committer 2021-09-09 17:05:49 UTC
I have also seen that and got a patch in progress: https://reviews.freebsd.org/D31476
Comment 19 commit-hook freebsd_committer 2021-09-10 18:28:56 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=89042ff77668555e77c88549e6ba697088ee72f9

commit 89042ff77668555e77c88549e6ba697088ee72f9
Author:     Gleb Smirnoff <glebius@FreeBSD.org>
AuthorDate: 2021-08-06 23:04:31 +0000
Commit:     Gleb Smirnoff <glebius@FreeBSD.org>
CommitDate: 2021-09-10 18:27:19 +0000

    ng_l2tp: improve callout locking.

    Apparently e62e4b85942 wasn't enough to close the race between
    a queue being flushed by a packet and callout executing, because
    the callouts used without a lock aren't 100% bulletproof. To close
    the race use callout_init_mtx() for L2TP timers, and make sure that
    all calls to ng_callout()/ng_uncallout() are done under the seq lock.

    If used properly, a locked callout can be used transparently with
    old netgraph KPI of ng_callout/ng_uncallout which predates locked

    While here, utilize ng_uncallout_drain() instead of ng_uncallout()
    on the node shutdown.

    PR:                     241133
    Reviewed by:            mjg, markj
    Differential Revision:  https://reviews.freebsd.org/D31476

 sys/netgraph/ng_l2tp.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)