Bug 227116 - CURRENT doesn't boot with integer divide fault in uma_startup_count
Summary: CURRENT doesn't boot with integer divide fault in uma_startup_count
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Gleb Smirnoff
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2018-03-30 17:09 UTC by Daniel Kolesa
Modified: 2018-04-16 15:08 UTC (History)
5 users (show)

See Also:


Attachments
backtrace (702.69 KB, image/jpeg)
2018-03-30 17:09 UTC, Daniel Kolesa
no flags Details
backtrace on vanilla (565.92 KB, image/jpeg)
2018-03-30 17:28 UTC, Daniel Kolesa
no flags Details
patched kernel (567.73 KB, image/jpeg)
2018-03-31 18:47 UTC, Daniel Kolesa
no flags Details
different panic after patch (574.79 KB, image/jpeg)
2018-03-31 19:45 UTC, Daniel Kolesa
no flags Details
Add a case for a situation when zsize fits into slab space, but not after an alignment correction. (925 bytes, patch)
2018-03-31 20:09 UTC, Gleb Smirnoff
no flags Details | Diff
Add a case for a situation when zsize fits into slab space, but not after an alignment correction #2 (1.49 KB, patch)
2018-04-01 19:59 UTC, Gleb Smirnoff
no flags Details | Diff
successful boot (412.34 KB, image/jpeg)
2018-04-01 20:56 UTC, Daniel Kolesa
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Kolesa 2018-03-30 17:09:51 UTC
Created attachment 191980 [details]
backtrace

Hello,

trying to boot a recent 12-CURRENT snapshot memstick image results in kernel panic. The system is actually HardenedBSD, but the affected code suggests it's not relevant to HBSD-related changes. Attached are pictures with a backtrace; sadly I can't save a dump because it's an installer image. The pictured snapshot is from March 21, I tried a recent one from today as well.

This is on 2x Xeon E5-2683v3, so it's a NUMA configuration. The kernel in question has NUMA off and is effectively equivalent to GENERIC mins old FreeBSD compat and plus hardening.

I can possibly try later on a vanilla FreeBSD -CURRENT snapshot and see if I can reproduce it there as well.
Comment 1 Daniel Kolesa 2018-03-30 17:28:22 UTC
Created attachment 191983 [details]
backtrace on vanilla

Confirmed on 12-CURRENT vanilla memstick snapshot as well. 11-STABLE boots.
Comment 2 Daniel Kolesa 2018-03-30 17:41:01 UTC
cc'd relevant people who modified the affected code recently
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2018-03-30 22:21:31 UTC
Look up the source line of the trap (of course, on the FreeBSD sources).  I suspect that this would be the one of several hownmany()s in denominator.

How many vm domains do you have ?  If 4 domains, i.e. 'cluster on die' is turned on in bios, does it help with the boot to turn it off ?
Comment 4 Daniel Kolesa 2018-03-31 00:52:33 UTC
(In reply to Konstantin Belousov from comment #3)

I believe CoD is disabled right now, considering numactl on Linux shows two memory domains total (one per CPU).

$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1

If it was on, it should've shown four, two per each CPU, right? This is my primary desktop/workstation computer, with mostly desktop style workloads (i.e. largely NUMA-unaware) so it's best to keep it disabled.

Should I try what happens when I enable it?
Comment 5 Daniel Kolesa 2018-03-31 00:55:45 UTC
Oh, and yes, I did look up the source of the error, and I tried disassembling the function to check out what's wrong, however I cannot make much sense of it; it seems like there is a divide by zero, but I don't understand why there would be a zero.

It seems to me like glebius's recent changes (in February) could've been the breakage, but I cannot figure out what's wrong with it. I'm unfamiliar with FreeBSD's internals and even less with kernel subsystems :)
Comment 6 Konstantin Belousov freebsd_committer freebsd_triage 2018-03-31 14:23:12 UTC
(In reply to Daniel Kolesa from comment #5)
Yes, cluster on die creates two numa domains per socket, then multiplied by the number of sockets.

Load the failing kernel into kgdb, then do 'list *uma_startup_count+0xe6', the expression is the symbolic address where the trap occured.  You might need to use kernel.full or put kernel.debug in the right place.
Comment 7 Daniel Kolesa 2018-03-31 14:32:06 UTC
(In reply to Konstantin Belousov from comment #6)

The problem really is that I have no working FreeBSD setup on this hardware, only an installer image; the filesystem in the image is read-only and therefore I cannot save a crash dump and I have no debug version of the kernel.

What I could possibly try is get a 11-STABLE installed on a flash drive, build a CURRENT kernel from a fresh source and try rebooting into that, saving a dump in the process, then boot into kernel.old (i.e. 11-STABLE kernel) and try kgdb from there. I was hoping to be able to avoid having to go through that though :) But if that is necessary, I'll try.
Comment 8 Konstantin Belousov freebsd_committer freebsd_triage 2018-03-31 14:46:42 UTC
(In reply to Daniel Kolesa from comment #7)
You do not need the crash dump to get the information I asked about.  I suspect that you can even use linux gdb.
Comment 9 Daniel Kolesa 2018-03-31 14:48:55 UTC
Actually, I managed to obtain a kernel-dbg.txz of the same snapshot and kgdb it on my server, which runs HardenedBSD 11-STABLE.

This is the output:

(kgdb) list *uma_startup_count+0xe6
0xffffffff80e13296 is in uma_startup_count (/usr/src/sys/vm/uma_core.c:1827).
1822	
1823		bucket_init();
1824	
1825		booted = UMA_STARTUP;
1826	
1827	#ifdef UMA_DEBUG
1828		printf("UMA startup complete.\n");
1829	#endif
1830	}
1831	

The printed lines appear to be wrong. The actual line (1827) though appears correct, it's consistent with what I guessed from my disassembly.

1827 		pages += howmany(zones,
1828 		    UMA_SLAB_SPACE / roundup2(zsize, UMA_BOOT_ALIGN));

Sadly, without a dump, it won't be possible inspect the memory, and I realized I cannot obtain a crash dump as the kernel panic happens immediately after it's loaded, before userspace and therefore before any swap/dump device could be set up.
Comment 10 Daniel Kolesa 2018-03-31 15:03:48 UTC
The incorrect lines seem to be coming from the 11-STABLE kernel running on the server, I guess this makes sense. But the 1827 line should still be correct...
Comment 11 Konstantin Belousov freebsd_committer freebsd_triage 2018-03-31 15:49:53 UTC
(In reply to Konstantin Belousov from comment #8)
In fact it is clear where the things get broken:

UMA_SLAB_SPACE is UMA_SLAB_SIZE - sizeof(struct uma_slab) == 3984.

You have 2 14-core CPUs, hyperthreaded.  This gives mp_maxid = 2 * 2 * 14 == 56.  Number of vm_domains is 2.
Then zsize, according to gdb, is 3984.

UMA_BOOT_ALIGN is 32, and roundup2(zsize, UMA_BOOT_ALIGN) is 4000.

So kernel tries to calculate howmany(zones, 3984 / 4000) == howmany(zones, 0).  If you look at the definition of howmany, you will see the division by the second arg.
Comment 12 Gleb Smirnoff freebsd_committer freebsd_triage 2018-03-31 15:56:32 UTC
Thanks Kostik. I was just preparing a patch to print out all variables.

So, the fix should be:

Index: uma_core.c
===================================================================
--- uma_core.c  (revision 331844)
+++ uma_core.c  (working copy)
@@ -1820,7 +1820,7 @@ uma_startup_count(int vm_zones)
 #endif
 
        /* Memory for the rest of startup zones, UMA and VM, ... */
-       if (zsize > UMA_SLAB_SIZE)
+       if (roundup2(zsize, UMA_BOOT_ALIGN) > UMA_SLAB_SIZE)
                pages += (zones + vm_zones) *
                    howmany(roundup2(zsize, UMA_BOOT_ALIGN), UMA_SLAB_SIZE);
        else
Comment 13 Daniel Kolesa 2018-03-31 15:59:06 UTC
(In reply to Konstantin Belousov from comment #11)

That's amusing, I just came to the same conclusion by isolating the code into a standalone .c program with stubbed out variables.

---
    /* Memory for the rest of startup zones, UMA and VM, ... */
    if (zsize > UMA_SLAB_SIZE) {
        printf("branch1\n");
        pages += (zones + vm_zones) *
            howmany(roundup2(zsize, UMA_BOOT_ALIGN), UMA_SLAB_SIZE);
    } else {
        printf("branch2\n");
        printf("roundup2 %d\n", roundup2(zsize, UMA_BOOT_ALIGN));
        pages += howmany(zones,
            UMA_SLAB_SPACE / roundup2(zsize, UMA_BOOT_ALIGN));
    }
---

---
$ ./foo                
KSIZE 304
ZSIZE 3984
UMA_SLAB_SIZE 4096
sizeof uma_slab 112
UMA_SLAB_SPACE 3984
branch2
roundup2 4000
zsh: floating point exception  ./foo
---

So yeah, indeed, this seems to be the source of error.
Comment 14 Daniel Kolesa 2018-03-31 16:00:35 UTC
(In reply to Gleb Smirnoff from comment #12)

That's great. Looks like we've come to a conclusion pretty quickly after all :)
Comment 15 Gleb Smirnoff freebsd_committer freebsd_triage 2018-03-31 17:23:04 UTC
Can you please confirm that patch works for you?
Comment 16 Shawn Webb 2018-03-31 17:30:48 UTC
(In reply to Gleb Smirnoff from comment #15)
Daniel is relying on me to do custom builds for him to test. I'll perform a build this evening (US Eastern) for him to test with the patch.
Comment 17 Daniel Kolesa 2018-03-31 17:52:17 UTC
(In reply to Gleb Smirnoff from comment #15)

I can try, but if you could perhaps upload a kernel binary somewhere, that'd be great - otherwise I'll have to build on my HardenedBSD server, which has slow hardware.
Comment 18 Daniel Kolesa 2018-03-31 18:47:39 UTC
Created attachment 192014 [details]
patched kernel

I built the kernel on my server and replaced it in the installer image, but it still fails at the same place. Backtrace photo in attachment, kgdb here:

(kgdb) list *uma_startup_count+0xd6
0xffffffff80e3e7b6 is in uma_startup_count (/usr/src/sys/vm/uma_core.c:1827).
1822		/* Memory for the rest of startup zones, UMA and VM, ... */
1823		if (roundup2(zsize, UMA_BOOT_ALIGN) > UMA_SLAB_SIZE)
1824			pages += (zones + vm_zones) *
1825			    howmany(roundup2(zsize, UMA_BOOT_ALIGN), UMA_SLAB_SIZE);
1826		else
1827			pages += howmany(zones,
1828			    UMA_SLAB_SPACE / roundup2(zsize, UMA_BOOT_ALIGN));
1829	
1830		/* ... and their kegs. Note that zone of zones allocates a keg! */
1831		pages += howmany(zones + 1,

As you can see from the source, it still fails in the very same place, and you can also see how I patched it. You can also see from the uname in the photo that it is indeed the patched kernel.
Comment 19 Daniel Kolesa 2018-03-31 18:50:11 UTC
Which now that I'm looking at it is actually pretty obvious, because UMA_SLAB_SIZE is 4096 (i.e. pagesize), and roundup2(zsize, UMA_BOOT_ALIGN) is 4000, which is bigger than zsize (3984) but still smaller than UMA_SLAB_SIZE.
Comment 20 Daniel Kolesa 2018-03-31 18:53:47 UTC
(In reply to Daniel Kolesa from comment #19)

Maybe, shouldn't the condition be

if (roundup2(zsize, UMA_BOOT_ALIGN) > UMA_SLAB_SPACE)

instead?
Comment 21 Daniel Kolesa 2018-03-31 19:45:49 UTC
Created attachment 192015 [details]
different panic after patch

I rebuilt the kernel once again with my patch this time (the updated condition, see comment above).

It didn't panic there, but it did panic in another place. Backtrace attached. Here's relevant kgdb listing:

0xffffffff80e3f015 is in keg_ctor (/usr/src/sys/vm/uma_core.c:1294).
1289			shsize = 0;
1290		else 
1291			shsize = sizeof(struct uma_slab);
1292	
1293		keg->uk_ipers = (slabsize - shsize) / rsize;
1294		KASSERT(keg->uk_ipers > 0 && keg->uk_ipers <= SLAB_SETSIZE,
1295		    ("%s: keg->uk_ipers %u", __func__, keg->uk_ipers));
1296	
1297		memused = keg->uk_ipers * rsize + shsize;
1298		wastedspace = slabsize - memused;


Looks like the assertion is failing this time.
Comment 22 Gleb Smirnoff freebsd_committer freebsd_triage 2018-03-31 20:03:01 UTC
I see the problem. The size of zone before alignment fits into single slab, but after alignment it doesn't. However, since we have only 1 item per slab, we don't need alignment correction at all. I'm working on a patch.
Comment 23 Daniel Kolesa 2018-03-31 20:06:19 UTC
(In reply to Gleb Smirnoff from comment #22)

Alright. Let me know when you have a patch so I can test; without being familiar with the source, this is not something I can patch myself here, and blindly patching wouldn't yield any reasonable results because it takes nearly 30 minutes for buildkernel to finish on my server.

So when there is a patch I'll try again and report back...
Comment 24 Gleb Smirnoff freebsd_committer freebsd_triage 2018-03-31 20:09:56 UTC
Created attachment 192017 [details]
Add a case for a situation when zsize fits into slab space, but not after an alignment correction.

Let's try this. All previous changes need to be reverted.
Comment 25 Daniel Kolesa 2018-03-31 20:15:04 UTC
(In reply to Gleb Smirnoff from comment #24)

Running a fresh kernel build now.
Comment 26 Daniel Kolesa 2018-03-31 20:55:56 UTC
(In reply to Gleb Smirnoff from comment #24)

This patch results in the same assertion failure (keg->uk_ipers 0) as my previous patch. Not posting a backtrace because it's the same.
Comment 27 Daniel Kolesa 2018-04-01 18:34:47 UTC
I'll try adding some debug logging tomorrow in order to try and figure out why the assertion is failing... unless anyone has a better idea.
Comment 28 Gleb Smirnoff freebsd_committer freebsd_triage 2018-04-01 19:59:46 UTC
Created attachment 192075 [details]
Add a case for a situation when zsize fits into slab space, but not after an alignment correction #2

Hmm, looks like there is not only a bug in pretty new function uma_startup_count() but also in old UMA core needs special handling for this edge case.

I have tested this patch emulating your amount of CPUs and domains. Please test it on a real hardware.
Comment 29 Daniel Kolesa 2018-04-01 20:03:27 UTC
(In reply to Gleb Smirnoff from comment #28)

Alright, thanks, I'll test this tomorrow. If that still fails I'll at least provide a bunch of logs with the values so things can be debugged better.
Comment 30 Daniel Kolesa 2018-04-01 20:56:25 UTC
Created attachment 192077 [details]
successful boot

(In reply to Gleb Smirnoff from comment #28)

Seems like there will be no need for that. I decided I wasn't going to wait until tomorrow and did a fresh build, and it seems it booted successfully all the way into the installer now :)

So unless there is a breakage on another hardware, this is probably okay to commit. Thanks for the fast solution :)
Comment 31 commit-hook freebsd_committer freebsd_triage 2018-04-02 05:12:22 UTC
A commit references this bug:

Author: glebius
Date: Mon Apr  2 05:12:00 UTC 2018
New revision: 331871
URL: https://svnweb.freebsd.org/changeset/base/331871

Log:
  Handle a special case when a slab can fit only one allocation,
  and zone has a large alignment. With alignment taken into
  account uk_rsize will be greater than space in a slab. However,
  since we have only one item per slab, it is always naturally
  aligned.

  Code that will panic before this change with 4k page:

  	z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0);
  	uma_zalloc(z, M_WAITOK);

  A practical scenario to hit the panic is a machine with 56 CPUs
  and 2 NUMA domains, which yields in zone size of 3984.

  PR:		227116
  MFC after:	2 weeks

Changes:
  head/sys/vm/uma_core.c
Comment 32 commit-hook freebsd_committer freebsd_triage 2018-04-02 05:15:28 UTC
A commit references this bug:

Author: glebius
Date: Mon Apr  2 05:14:32 UTC 2018
New revision: 331872
URL: https://svnweb.freebsd.org/changeset/base/331872

Log:
  In uma_startup_count() handle special case when zone will fit into
  single slab, but with alignment adjustment it won't. Again, when
  there is only one item in a slab alignment can be ignored. See
  previous revision of this file for more info.

  PR:		227116

Changes:
  head/sys/vm/uma_core.c
Comment 33 commit-hook freebsd_committer freebsd_triage 2018-04-16 15:08:09 UTC
A commit references this bug:

Author: glebius
Date: Mon Apr 16 15:07:20 UTC 2018
New revision: 332572
URL: https://svnweb.freebsd.org/changeset/base/332572

Log:
  Merge r331871:
    Handle a special case when a slab can fit only one allocation,
    and zone has a large alignment. With alignment taken into
    account uk_rsize will be greater than space in a slab. However,
    since we have only one item per slab, it is always naturally
    aligned.

    Code that will panic before this change with 4k page:

          z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0);
          uma_zalloc(z, M_WAITOK);

    A practical scenario to hit the panic is a machine with 56 CPUs
    and 2 NUMA domains, which yields in zone size of 3984 (on head).

  PR:		227116

Changes:
_U  stable/11/
  stable/11/sys/vm/uma_core.c