Bug 192558 - [patch] hard-to-hit crash in bpf catchpacket()
Summary: [patch] hard-to-hit crash in bpf catchpacket()
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 9.2-STABLE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2014-08-10 18:56 UTC by Chris Torek
Modified: 2018-05-23 10:26 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Torek 2014-08-10 18:56:08 UTC
The symptom is simply a crash that looks like a kernel
	NULL pointer bug, with the program counter in bcopy():

	  Fatal trap 12: page fault while in kernel mode
	  cpuid = 8; apic id = 08
	  fault virtual address   = 0x0
	  fault code              = supervisor write data, page not present
	[snip]

	The stack trace in ddb looks like this (I stripped out
	the frames - I don't have a full dump, we have that
	disabled for various mostly-good reasons, so the raw
	frames are not really useful here):

	  bcopy() at bcopy+0x16
	  bpf_mtap() at bpf_mtap+0x1d0
	  ether_nh_input() at ether_nh_input+0x167
	  netisr_dispatch_src() at netisr_dispatch_src+0x5e
	  igb_rxeof() at igb_rxeof+0x56d
	  igb_msix_que() at igb_msix_que+0x101
	  ...

	although of course it will be some other rxeof() if you're
	on some other network device.

	The bpf_mtap() pc is actually in (inlined) catchpacket(),
	where we do:

		bpf_append_bytes(d, d->bd_sbuf, curlen, ...)

        where the bd_bufmode is BPF_BUFMODE_BUFFER (it must be, we
        have not used the sysctl that enables the zero copy stuff)
        so this becomes bpf_buffer_append_bytes() which becomes
        bcopy(), and it appears the whole mess of calls has been
        inlined.

	The crash clearly has a NULL d->bd_sbuf:

	  bcopy+0x16:     repe movsq      (%rsi),%es:(%rdi)

	  rsi  0xfffff82044dbd768
	  rdi                   0

        (%es is not very interesting).  This means d->bd_sbuf was
        NULL, which was a bit of a mystery, as the code path
        starts with a lock assertion:

		BPFD_LOCK_ASSERT(d);

	and then has this bit:

		if (d->bd_fbuf == NULL) {
			/*
			 * There's no room in the store buffer, and no
			 * prospect of room, so drop the packet.  Notify
			 * the
			 * buffer model.
			 */
			bpf_buffull(d);
			++d->bd_dcount;
			return;
		}

	before doing this:

		ROTATE_BUFFERS(d);

	(the ROTATE causes bd_fbuf to move into bd_sbuf).

        But then I realized that it did this extra step in between
        these "test fbuf for NULL" and "ROTATE" steps:

		while (d->bd_hbuf_in_use)
			mtx_sleep(&d->bd_hbuf_in_use, &d->bd_lock,
			    PRINET, "bd_hbuf", 0);

        So, if we assume that the hbuf (hold buffer) *is* in use,
        we'll sleep waiting for the user to finish with it and
        then wake us up.  While we're asleep, we'll give up
        d->bd_lock.

        It sure looks like someone else (some other bpf consumer
        using the same d->bd_* fields) snuck in while we were
        asleep and used up d->bd_fbuf, setting it to NULL.  Then
        we got the lock and did our own ROTATE_BUFFERS() and moved
        the NULL to d->bd_sbuf.

        If this analysis is correct, the fix is simply to wait for
        the hbuf to be available *before* checking d->bd_fbuf,
        i.e., move the while loop up.

Environment:
System: FreeBSD 9.2-STABLE #0: Sat Aug  9 20:01:00 PDT 2014: root@builder:/usr/obj/usr/src/sys/XXX amd64

	Note, this applies to 9.x and 10.x -- the code is the same
	in both.

How-To-Repeat:
Difficult.  I don't have an actual reproducer.  The crash
	happened just once and I did the analysis based on stack
	trace and panic, etc.

        Clearly you need to have two or more threads/procs using
        bpf, so that one of them can have hbuf in use (during a
        copyout()).  You also need to have lots of traffic flying
        on the Ethernet (so that the buffers fill).

	I know the system has dhcpd running, not sure what other things
	are running at the time that have promiscuous mode enabled on
	the igb device (but promisc is enabled).

Fix:
I'm not 100% convinced this is the fix, or the entire fix.
        But it seems correct and is at least a step to fixing.
        Here it is in patch form...

        BTW, I note that there is a similar loop for the zero copy
        code, waiting for hbuf.  It may need to repeat its test
        after mtx_sleeping.  I did not look closely at the zero
        copy path after I verified we were not using it.

        It might also be wise to recheck curlen-vs-totlen, etc.
        Maybe we should just start the whole routine over if we
        sleep, for each hbuf sleep case.  But I think with this
        change, the not-zero-copy code will always work, albeit
	possibly in sub-optimal ways in this rare case.

			---

bpf: check d->bd_fbuf after sleep, not before

The code in catchpacket() checks that there's a valid d->bd_fbuf
before doing a ROTATE_BUFFERS, which will move the sbuf (store
buffer) to the hbuf (hold buffer) and make the fbuf (free buffer)
become the sbuf.  OK so far, but *after* it verifies this fbuf,
it then mtx_sleep-waits for the hold buffer to be available.  If
the fbuf goes NULL during the wait when we drop the lock, there
will be no sbuf once we do the ROTATE_BUFFERS.

To fix this, simply check fbuf after possibly waiting for hbuf,
rather than before.


diff --git a/sys/net/bpf.c b/sys/net/bpf.c
--- a/sys/net/bpf.c
+++ b/sys/net/bpf.c
@@ -2352,6 +2352,9 @@ catchpacket(struct bpf_d *d, u_char *pkt
 #endif
 		curlen = BPF_WORDALIGN(d->bd_slen);
 	if (curlen + totlen > d->bd_bufsize || !bpf_canwritebuf(d)) {
+		while (d->bd_hbuf_in_use)
+			mtx_sleep(&d->bd_hbuf_in_use, &d->bd_lock,
+			    PRINET, "bd_hbuf", 0);
 		if (d->bd_fbuf == NULL) {
 			/*
 			 * There's no room in the store buffer, and no
@@ -2362,9 +2365,6 @@ catchpacket(struct bpf_d *d, u_char *pkt
 			++d->bd_dcount;
 			return;
 		}
-		while (d->bd_hbuf_in_use)
-			mtx_sleep(&d->bd_hbuf_in_use, &d->bd_lock,
-			    PRINET, "bd_hbuf", 0);
 		ROTATE_BUFFERS(d);
 		do_wakeup = 1;
 		curlen = 0;

_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
Comment 1 John Baldwin freebsd_committer freebsd_triage 2014-08-11 19:38:04 UTC
I think your diagnosis is correct, but I'm unsure of the fix.  In particular, I suspect we may want to avoid sleeping if the fbuf is full so that this "fails fast" to avoid blocking network traffic.  Also, when you are awakened, other threads might have already called catchpacket() (e.g. a multiq NIC), so you don't actually know that you should be in the current state.  I worry that you need to jump back to the preceding if, so something like:

    again:
        if (curlen + totlen > d->bd_bufsize || !bpf_canwritebuf(d)) {
            if (d->bd_fbuf == NULL) {
                ...
            }
            if (d->hd_buf_in_use) {
                mtx_sleep(...);
                goto again;
            }
            ...
        }


Arguably the earlier call to mtx_sleep() earlier in this function is also racy as you don't know that the condition is still true when you awake given a concurrent call to catchpacket() from another RX queue on the same NIC.  That is, I think that loop should also be altered, though it can probably avoid a goto:

        while (d->bd_fbuf == NULL && bpf_canfreebuf(d)) {
            if (d->bd_hbuf_in_use) {
                mtx_sleep(...);
                continue;
            }
            ...
        }

Possibly, the 'again' label needs to move all the way up to before this 'while' loop, and if so, the hrdlen/totlen/curlen assigments should be moved up before the first while.
Comment 2 Guy Helmer freebsd_committer 2014-08-13 14:03:19 UTC
(adding myself to CC list to stay abreast of discussion)
Comment 3 Eitan Adler freebsd_committer freebsd_triage 2018-05-23 10:26:58 UTC
batch change of PRs untouched in 2018 marked "in progress" back to open.