Bug 237269 - panic in glabel (g_label_destroy) stop after resizing GPT partition
Summary: panic in glabel (g_label_destroy) stop after resizing GPT partition
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-geom mailing list
URL:
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2019-04-13 21:03 UTC by andrew
Modified: 2019-07-10 17:10 UTC (History)
2 users (show)

See Also:
koobs: mfc-stable11?
koobs: mfc-stable12?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description andrew 2019-04-13 21:03:59 UTC
Tested on stable/11 at r346167, but the backtrace below comes from an older stable/11 build (maybe 2 months back).

To reproduce:

md_unit=$(mdconfig -t swap -s 30MB)
geom part create -s GPT "$md_unit"
geom part add -s 10M -t linux-swap -l tst0 "$md_unit"
geom part resize -i 1 -s 20M "$md_unit"

# at this point "glabel status" shows two gpt/tst0 entries,
# one of which has no consumer; trying to correct this causes
# a panic:

glabel stop gpt/tst0
glabel stop gpt/tst0  # BOOM

Trace:

#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:315
#2  0xffffffff80468255 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:383
#3  0xffffffff80468621 in vpanic (fmt=<optimized out>, ap=0xfffffe022c5c25f0)
    at /usr/src/sys/kern/kern_shutdown.c:776
#4  0xffffffff80468463 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:707
#5  0xffffffff80722faf in trap_fatal (frame=0xfffffe022c5c27e0, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:904
#6  0xffffffff80723009 in trap_pfault (frame=0xfffffe022c5c27e0, usermode=0)
    at /usr/src/sys/amd64/amd64/trap.c:744
#7  0xffffffff80722732 in trap (frame=0xfffffe022c5c27e0)
    at /usr/src/sys/amd64/amd64/trap.c:438
#8  <signal handler called>
#9  g_slice_spoiled (cp=0x0) at /usr/src/sys/geom/geom_slice.c:511
#10 0xffffffff80f826d0 in g_label_destroy (force=<optimized out>, 
    gp=<optimized out>) at /usr/src/sys/geom/label/g_label.c:267
#11 g_label_ctl_destroy (req=<optimized out>, mp=<optimized out>)
    at /usr/src/sys/geom/label/g_label.c:514
#12 g_label_config (req=0xfffffe0236bb67c0, 
    mp=0xffffffff80f84fd8 <g_label_class>, verb=<optimized out>)
    at /usr/src/sys/geom/label/g_label.c:545
#13 0xffffffff803f9700 in one_event () at /usr/src/sys/geom/geom_event.c:264
#14 g_run_events () at /usr/src/sys/geom/geom_event.c:286
#15 0xffffffff804374d5 in fork_exit (
    callout=0xffffffff803fb820 <g_event_procbody>, arg=0x0, 
    frame=0xfffffe022c5c29c0) at /usr/src/sys/kern/kern_fork.c:1072
#16 <signal handler called>

g_label_destroy clearly isn't expecting the case where the label has no consumer at all.
Comment 1 andrew 2019-04-14 18:04:33 UTC
Also confirmed on stable/12 r346169

The resize of the partition provokes these messages, which I had previously missed:

g_access(958): provider gptid/0d17d86a-5edf-11e9-971a-00a0985beaef has error 6 set
g_access(958): provider gptid/0d17d86a-5edf-11e9-971a-00a0985beaef has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_dev_taste: make_dev_p() failed (gp->name=gpt/tst0, error=17)
g_dev_taste: make_dev_p() failed (gp->name=gptid/0d17d86a-5edf-11e9-971a-00a0985beaef, error=17)
Comment 2 Conrad Meyer freebsd_committer 2019-04-14 18:13:58 UTC
On CURRENT I see:

g_access(958): provider gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9 has error 6 set
g_access(958): provider gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9 has error 6 set
g_access(958): provider gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9 has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_access(958): provider gpt/tst0 has error 6 set
g_dev_taste: make_dev_p() failed (gp->name=gpt/tst0, error=17)
g_dev_taste: make_dev_p() failed (gp->name=gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9, error=17)
md0p1 resized

# glabel status
                                      Name  Status  Components
                                  gpt/tst0     N/A  N/A
gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9     N/A  N/A
                                  gpt/tst0     N/A  md0p1
gptid/b2f6ac01-5ee0-11e9-b200-00a098d53cc9     N/A  md0p1

# glabel stop gpt/tst0
# glabel stop gpt/tst0
...
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x0
fault code              = supervisor read data  , page not present
...
current process         = 13 (g_event)
...
--- trap 0xc, rip = 0xffffffff80af008d, rsp = 0xfffffe00004ff980, rbp = 0xfffffe00004ff990 ---
g_slice_spoiled() at g_slice_spoiled+0x1d/frame 0xfffffe00004ff990
g_label_config() at g_label_config+0x23f/frame 0xfffffe00004ffa10
g_ctl_req() at g_ctl_req+0x6e/frame 0xfffffe00004ffa30
g_run_events() at g_run_events+0xf5/frame 0xfffffe00004ffa70
fork_exit() at fork_exit+0x84/frame 0xfffffe00004ffab0

So issue exists in head as well.
Comment 3 Conrad Meyer freebsd_committer 2019-04-14 18:15:12 UTC
> g_label_destroy clearly isn't expecting the case where the label has no consumer at all.

This assumption may be ok, generally; it looks like the root of the problem is that "gpart resize" fails in an unacceptable way (doesn't clean up after itself).
Comment 4 Conrad Meyer freebsd_committer 2019-04-14 18:17:08 UTC
(My repro from comment #2 is on a somewhat old CURRENT, FWIW: r345283 + some local changes.  But I don't recall any gpart/glabel changes in recent history.)
Comment 5 andrew 2019-04-14 20:03:10 UTC
(In reply to Conrad Meyer from comment #3)

The resize does actually succeed, not fail.
Comment 6 Conrad Meyer freebsd_committer 2019-04-14 23:46:10 UTC
(In reply to andrew from comment #5)
> The resize does actually succeed, not fail.

The failure is leaving behind the duplicate copy of "tst0."
Comment 7 andrew 2019-04-17 03:53:56 UTC
I think I have identified the sequence of events:

1. When we resize the partition, it spoils the attached LABEL consumers (correctly, I think, since those might depend on the partition size);

2. The spoiled LABELs are orphaned, orphaning the DEV geoms that are their sole consumers;

3. the DEV geoms destruct, detaching from the LABELs and removing the /dev entries;

4. BEFORE anything significant can happen, DEV re-tastes the now-orphaned LABEL geoms, getting an ENXIO from trying to open them, but for whatever reason attaches to them anyway, creating new /dev entries and attaching new consumers to the orphan LABELs;

5. Since the old LABELs now have attached consumers that aren't going to die, the withering process never completes, and the /dev entries remain attached to the orphaned LABEL geoms that now have no partition under them;

6. When DEV gets to taste the new providers for the new LABEL geoms for the resized partitions, it can't create the /dev entries for them because they already exist.

You'd think that step 4 wouldn't happen because a withering geom shouldn't be offered for tasting, but it turns out there's a code path where this happens: and strangely enough it's in g_resize_provider_event.

So there are several places here where questionable things are happening, even beyond the assumption that actually causes the crash (LABEL's assumption that it always has a consumer).