Bug 237269

Summary: panic in glabel (g_label_destroy) stop after resizing GPT partition
Product: Base System Reporter: Andrew "RhodiumToad" Gierth <andrew>
Component: kernAssignee: freebsd-geom (Nobody) <geom>
Status: Closed FIXED    
Severity: Affects Some People CC: admin, cem, markj, pawel.worach, yannk
Priority: --- Keywords: crash, needs-qa
Version: CURRENTFlags: koobs: mfc-stable12?
koobs: mfc-stable11?
Hardware: Any   
OS: Any   
Attachments:
Description Flags
patch against stable/12 none

Description Andrew "RhodiumToad" Gierth 2019-04-13 21:03:59 UTC
Tested on stable/11 at r346167, but the backtrace below comes from an older stable/11 build (maybe 2 months back).

To reproduce:

md_unit=$(mdconfig -t swap -s 30MB)
geom part create -s GPT "$md_unit"
geom part add -s 10M -t linux-swap -l tst0 "$md_unit"
geom part resize -i 1 -s 20M "$md_unit"

# at this point "glabel status" shows two gpt/tst0 entries,
# one of which has no consumer; trying to correct this causes
# a panic:

glabel stop gpt/tst0
glabel stop gpt/tst0  # BOOM

Trace:

#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:315
#2  0xffffffff80468255 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:383
#3  0xffffffff80468621 in vpanic (fmt=<optimized out>, ap=0xfffffe022c5c25f0)
    at /usr/src/sys/kern/kern_shutdown.c:776
#4  0xffffffff80468463 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:707
#5  0xffffffff80722faf in trap_fatal (frame=0xfffffe022c5c27e0, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:904
#6  0xffffffff80723009 in trap_pfault (frame=0xfffffe022c5c27e0, usermode=0)
    at /usr/src/sys/amd64/amd64/trap.c:744
#7  0xffffffff80722732 in trap (frame=0xfffffe022c5c27e0)
    at /usr/src/sys/amd64/amd64/trap.c:438
#8  <signal handler called>
#9  g_slice_spoiled (cp=0x0) at /usr/src/sys/geom/geom_slice.c:511
#10 0xffffffff80f826d0 in g_label_destroy (force=<optimized out>, 
    gp=<optimized out>) at /usr/src/sys/geom/label/g_label.c:267
#11 g_label_ctl_destroy (req=<optimized out>, mp=<optimized out>)
    at /usr/src/sys/geom/label/g_label.c:514
#12 g_label_config (req=0xfffffe0236bb67c0, 
    mp=0xffffffff80f84fd8 <g_label_class>, verb=<optimized out>)
    at /usr/src/sys/geom/label/g_label.c:545
#13 0xffffffff803f9700 in one_event () at /usr/src/sys/geom/geom_event.c:264
#14 g_run_events () at /usr/src/sys/geom/geom_event.c:286
#15 0xffffffff804374d5 in fork_exit (
    callout=0xffffffff803fb820 <g_event_procbody>, arg=0x0, 
    frame=0xfffffe022c5c29c0) at /usr/src/sys/kern/kern_fork.c:1072
#16 <signal handler called>

g_label_destroy clearly isn't expecting the case where the label has no consumer at all.
Comment 1 Andrew "RhodiumToad" Gierth 2019-04-14 18:04:33 UTC
Also confirmed on stable/12 r346169

The resize of the partition provokes these messages, which I had previously missed:

g_access(958): provider gptid/0d17d86a-5edf-11e9-971a-00a0985beaef has error 6 set
g_access(958): provider gptid/0d17d86a-5edf-11e9-971a-00a0985beaef has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_dev_taste: make_dev_p() failed (gp->name=gpt/tst0, error=17)
g_dev_taste: make_dev_p() failed (gp->name=gptid/0d17d86a-5edf-11e9-971a-00a0985beaef, error=17)
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2019-04-14 18:13:58 UTC
On CURRENT I see:

g_access(958): provider gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9 has error 6 set
g_access(958): provider gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9 has error 6 set
g_access(958): provider gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9 has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_dev_taste: make_dev_p() failed (gp->name=gpt/tst0, error=17)
g_dev_taste: make_dev_p() failed (gp->name=gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9, error=17)
md0p1 resized

# glabel status
                                      Name  Status  Components
                                  gpt/tst0     N/A  N/A
gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9     N/A  N/A
                                  gpt/tst0     N/A  md0p1
gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9     N/A  md0p1

# glabel stop gpt/tst0
# glabel stop gpt/tst0
...
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x0
fault code              = supervisor read data  , page not present
...
current process         = 13 (g_event)
...
--- trap 0xc, rip = 0xffffffff80af008d, rsp = 0xfffffe00004ff980, rbp = 0xfffffe00004ff990 ---
g_slice_spoiled() at g_slice_spoiled+0x1d/frame 0xfffffe00004ff990
g_label_config() at g_label_config+0x23f/frame 0xfffffe00004ffa10
g_ctl_req() at g_ctl_req+0x6e/frame 0xfffffe00004ffa30
g_run_events() at g_run_events+0xf5/frame 0xfffffe00004ffa70
fork_exit() at fork_exit+0x84/frame 0xfffffe00004ffab0

So issue exists in head as well.
Comment 3 Conrad Meyer freebsd_committer freebsd_triage 2019-04-14 18:15:12 UTC
> g_label_destroy clearly isn't expecting the case where the label has no consumer at all.

This assumption may be ok, generally; it looks like the root of the problem is that "gpart resize" fails in an unacceptable way (doesn't clean up after itself).
Comment 4 Conrad Meyer freebsd_committer freebsd_triage 2019-04-14 18:17:08 UTC
(My repro from comment #2 is on a somewhat old CURRENT, FWIW: r345283 + some local changes.  But I don't recall any gpart/glabel changes in recent history.)
Comment 5 Andrew "RhodiumToad" Gierth 2019-04-14 20:03:10 UTC
(In reply to Conrad Meyer from comment #3)

The resize does actually succeed, not fail.
Comment 6 Conrad Meyer freebsd_committer freebsd_triage 2019-04-14 23:46:10 UTC
(In reply to andrew from comment #5)
> The resize does actually succeed, not fail.

The failure is leaving behind the duplicate copy of "tst0."
Comment 7 Andrew "RhodiumToad" Gierth 2019-04-17 03:53:56 UTC
I think I have identified the sequence of events:

1. When we resize the partition, it spoils the attached LABEL consumers (correctly, I think, since those might depend on the partition size);

2. The spoiled LABELs are orphaned, orphaning the DEV geoms that are their sole consumers;

3. the DEV geoms destruct, detaching from the LABELs and removing the /dev entries;

4. BEFORE anything significant can happen, DEV re-tastes the now-orphaned LABEL geoms, getting an ENXIO from trying to open them, but for whatever reason attaches to them anyway, creating new /dev entries and attaching new consumers to the orphan LABELs;

5. Since the old LABELs now have attached consumers that aren't going to die, the withering process never completes, and the /dev entries remain attached to the orphaned LABEL geoms that now have no partition under them;

6. When DEV gets to taste the new providers for the new LABEL geoms for the resized partitions, it can't create the /dev entries for them because they already exist.

You'd think that step 4 wouldn't happen because a withering geom shouldn't be offered for tasting, but it turns out there's a code path where this happens: and strangely enough it's in g_resize_provider_event.

So there are several places here where questionable things are happening, even beyond the assumption that actually causes the crash (LABEL's assumption that it always has a consumer).
Comment 8 Andrew "RhodiumToad" Gierth 2021-04-24 14:26:33 UTC
I can no longer reproduce this on 13.x, possibly it was fixed by https://reviews.freebsd.org/D26658 ?

(which afaict was not MFC'd despite the "MFC after" annotation)
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2022-03-16 16:05:12 UTC
(In reply to andrew from comment #8)
It was eventually MFCed into 13.0, for what that's worth.  It's not obvious to me that that change fixed the problem though.
Comment 10 Vladyslav V. Prodan 2022-07-05 00:49:12 UTC
This issue is reproducible in FreeBSD 12.3-STABLE.
Will there be a patch for 12.x?

# uname 
FreeBSD 12.3-STABLE #0 r372168M: Thu Jun 23 09:30:29 EEST 2022
Comment 11 Mark Johnston freebsd_committer freebsd_triage 2022-07-05 14:41:43 UTC
Created attachment 235084 [details]
patch against stable/12

If it's reproducible and you can test custom kernels, please try the attached patch.
Comment 12 Andrew "RhodiumToad" Gierth 2023-05-23 13:30:42 UTC
I can no longer reproduce this on 13-stable, so I'm closing it.