Bug 211028 - [GEOM][Hyper-V] gpart can't detect the new free space after the disk capacity changes
Summary: [GEOM][Hyper-V] gpart can't detect the new free space after the disk capacity...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-12 09:32 UTC by Dexuan Cui
Modified: 2016-09-30 03:49 UTC (History)
7 users (show)

See Also:


Attachments
The patch can work, but it may be a little ugly... (1.24 KB, patch)
2016-07-15 13:38 UTC, Dexuan Cui
no flags Details | Diff
Increase debugging in geom_disk (1.49 KB, patch)
2016-07-24 10:53 UTC, Andrey V. Elsukov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dexuan Cui 2016-07-12 09:32:25 UTC
When testing Hyper-V's Online VHDX Resize feature (refer to Bug 211000), I noticed an issue: after resizing “da1”, if there was a disk read before "camcontrol reprobe da1", gpart couldn't detect the new “free space", though it could detect the new disk capacity.

To reproduce the bug, we need to run FreeBSD 12-CURRENT guest on Hyper-V (we need to make sure the 2 patches mentioned by https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211000#c4 are used to avoid the known issues) and do the following

1. This is before the disk capacity change:
[root@decui-b11 ~]# gpart show  da1
=>      63  83886017  da1  MBR  (40G)
        63  83886017       - free -  (40G)

2. Expand the disk capacity from 40GB to 50GB by Hyper-V management tool.

3. 
[root@decui-b11 ~]# dd if=/dev/da1 of=/dev/null bs=512 count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.000715 secs (715864 bytes/sec)
[root@decui-b11 ~]# gpart show  da1
=>      63  83886017  da1  MBR  (50G)
        63  83886017       - free -  (40G)

We can see the 'free' space can't grow to 50GB.

4. This can't help:
[root@decui-b11 ~]# camcontrol reprobe da1
[root@decui-b11 ~]# gpart show  da1
=>      63  83886017  da1  MBR  (50G)
        63  83886017       - free -  (40G)
Comment 1 Dexuan Cui 2016-07-12 09:34:13 UTC
BTW, dmesg shows:

This corresponds to the "dd":
(da1:storvsc1:0:0:0): storvsc scsi_status = 2
(da1:storvsc1:0:0:0): Capacity data has changed
(da1:storvsc1:0:0:0): storvsc skips the validation for short inquiry (6) [0 b2 0 4 1]
(da1:storvsc1:0:0:0): storvsc skips the validation for short inquiry (5) [0 b0 0 3c 0]
(da1:storvsc1:0:0:0): storvsc skips the validation for short inquiry (5) [0 b1 0 3c 0]

This corresponds to the "reprobe":
(probe0:storvsc1:0:0:0): storvsc has passed inquiry response (36) validation
(probe0:storvsc1:0:0:0): storvsc scsi_status = 2
(probe0:storvsc1:0:0:0): storvsc skips the validation for short inquiry (5) [0 0 0 8 0]
(probe0:storvsc1:0:0:0): storvsc skips the validation for short inquiry (6) [0 83 0 30 1]
(da1:storvsc1:0:0:0): storvsc skips the validation for short inquiry (6) [0 b2 0 4 1]
(da1:storvsc1:0:0:0): storvsc skips the validation for short inquiry (5) [0 b0 0 3c 0]
(da1:storvsc1:0:0:0): storvsc skips the validation for short inquiry (5) [0 b1 0 3c 0]
Comment 2 Dexuan Cui 2016-07-12 09:45:52 UTC
(In reply to Dexuan Cui from comment #0)
In comment 0, if I skip step 3, that is, I don't run the "dd" and only run "camcontrol reprobe da1", "gpart show  da1" can detect the new "free" space correctly.


BTW, when I get the bug, there is a workaround:
dd if=/dev/da1 of=/dev/da1 bs=512 count=0

This means, when da1 is opened for writing, somehow geom will be forced to recalculate the free space.
Comment 3 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-07-12 12:03:00 UTC
So, disk provider has changed its size, I assume `camcontrol reprobe` has initiated g_disk_resize event.
How it should work in my understanding:
1. `camcontrol reprobe` triggers calling of disk_resize()
2. disk_resize() calls g_disk_resize() event
3. GPART class receives resize event, g_part_resize() should be called
4. GPART_MBR's method g_part_mbr_resize() should be called
5. "GEOM_PART: da1 was automatically resized. Use `gpart commit da1` to save changes or `gpart undo da1` to revert them." message should be printed in the console.

If you don't see the message from p.5 - something doesn't work as expected.
You can use the following dtrace command to debug:

# kldload dtraceall
# dtrace -n 'dtrace -n 'fbt::*disk_resize:entry fbt::g_part_*resize:entry {stack(5);}'
Comment 4 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-07-12 12:04:43 UTC
# kldload dtraceall
# dtrace -n 'fbt::*disk_resize:entry fbt::g_part_*resize:entry {stack(5);}'
Comment 5 Dexuan Cui 2016-07-13 11:49:58 UTC
(In reply to Andrey V. Elsukov from comment #3)
Thank you Andrey for the detailed instructions!

g_resize_provider() returns at Line 671 and at this line, both 'size' and 'pp->mediasize' are the new disk capacity! This is pretty strange, because at this time pp->mediasize is supposed to be the old disk capacity. I'm trying to find out why this happens.

 661 void
 662 g_resize_provider(struct g_provider *pp, off_t size)
 663 {
 664         struct g_hh00 *hh;
 665
 666         G_VALID_PROVIDER(pp);
 667         if (pp->flags & G_PF_WITHER)
 668                 return;
 669
 670         if (size == pp->mediasize)
 671                 return;
 672
 673         hh = g_malloc(sizeof *hh, M_WAITOK | M_ZERO);
 674         hh->pp = pp;
 675         hh->size = size;
 676         g_post_event(g_resize_provider_event, hh, M_WAITOK, NULL);
 677 }
Comment 6 Edward Tomasz Napierala freebsd_committer freebsd_triage 2016-07-13 13:01:49 UTC
It looks like the da(4) driver correctly noticed the updated capacity:

(da1:storvsc1:0:0:0): Capacity data has changed

So "camcontrol reprobe" shouldn't really do anything.
Comment 7 Edward Tomasz Napierala freebsd_committer freebsd_triage 2016-07-13 13:04:57 UTC
Also - correct me if I'm wrong, but even if you don't do reprobe, you still see the new size of da1 - 50G, so GEOM already "sees" the updated size, it's just that g_part(4) doesn't update the "free" size.  Does that make sense?
Comment 8 Dexuan Cui 2016-07-13 13:16:24 UTC
(In reply to Edward Tomasz Napierala from comment #7)
Hi Edword,
Yes, exactly.

After the disk capacity change, if I do "camcontrol reprobe da1", gpart can see  both the capacity & free space are updated.

After the disk capacity change, if first I open /dev/da1 for reading (e.g. dd if=/dev/da1 of=tmp bs=512 count=1), da driver notices "Capacity data has changed", and gpart can see the new capacity, but the "free" space shown by gpart is still the old. Next, "camcontrol reprobe da1" can't help and the "free" space shown by gpart is still the old. Next, opening da1 for writing (e.g. dd if=/dev/da1 of=/dev/da1 bs=512 count=0) can make gpart show the new "free" space.
Comment 9 Edward Tomasz Napierala freebsd_committer freebsd_triage 2016-07-13 13:47:28 UTC
Hm.  Okay.  The read dd works, because the da(4) driver gets notified about the capacity change on the first read command sent to the disk; basically the disk responds with Unit Attention indicating this.  The write dd works, because it forces GEOM retaste.  So it still fits the theory that GEOM itself is fine, and it's just g_part(4) not handling it properly.
Comment 10 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-07-13 13:52:59 UTC
(In reply to Edward Tomasz Napierala from comment #9)
> Hm.  Okay.  The read dd works, because the da(4) driver gets notified about
> the capacity change on the first read command sent to the disk; basically
> the disk responds with Unit Attention indicating this.  The write dd works,
> because it forces GEOM retaste.  So it still fits the theory that GEOM
> itself is fine, and it's just g_part(4) not handling it properly.

As I said, tracing via dtrace or geom's internal capability should show that g_part receives resize event. In previous messages I didn't see this.
Comment 11 Edward Tomasz Napierala freebsd_committer freebsd_triage 2016-07-13 14:02:10 UTC
Hm, you're right.

Dexuan - the g_resize_provider() you're seeing, is that after "camcontrol reprobe", or after the first read?  Can you bump GEOM debugflags and show what happens on the first read, when da(4) is supposed to notice the updated capacity?
Comment 12 Dexuan Cui 2016-07-13 14:17:33 UTC
(In reply to Edward Tomasz Napierala from comment #11)
Thank you both for the analysis!

The g_resize_provider() is after the first read, and before the "camcontrol reprobe". I'm going to use try debugflags and dtrace.
Comment 13 Dexuan Cui 2016-07-13 14:32:20 UTC
(In reply to Edward Tomasz Napierala from comment #11)
With kern.geom.debugflags=253 (log everything except G_T_BIO)  and debug.bootverbose=1, after the disk capacity was changed from 180GB to 200GB, I got the below when running "dd if=/dev/da1 of=tmp bs=512 count=1":

g_dev_open(da1, 1, 8192, 0xfffff80004433500)
g_access(0xfffff80003faba00(da1), 1, 0, 0)
open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xfffff80003d19c00(da1)
g_disk_access(da1, 1, 0, 0)
(da1:storvsc1:0:0:0): storvsc scsi_status = 2
(da1:storvsc1:0:0:0): Capacity data has changed
g_post_event_x(0xffffffff80996f40, 0xfffff800043fd800, 1, 0)
(da1:storvsc1:0:0:0): storvsc skips the validation for short inquiry (6) [0 b2 0 4 1]
(da1:storvsc1:0:0:0): storvsc skips the validation for short inquiry (5) [0 b0 0 3c 0]
(da1:storvsc1:0:0:0): storvsc skips the validation for short inquiry (5) [0 b1 0 3c 0]
g_dev_close(da1, 131073, 8192, 0xfffff80004433500)
g_access(0xfffff80003faba00(da1), -1, 0, 0)
open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xfffff80003d19c00(da1)
g_disk_access(da1, -1, 0, 0)

Next, gpart can't show the new free space:
[root@decui-b11 ~/bsd.git/sys]# gpart show da1
=>       63  377487297  da1  MBR  (200G)
         63  377487297       - free -  (180G)
Comment 14 Dexuan Cui 2016-07-13 14:50:43 UTC
(In reply to Dexuan Cui from comment #13)

When I got the kernel messages in comment 13, dtrace showed this:

[root@decui-b11 ~]# dtrace -n 'fbt::*disk_resize:entry fbt::g_part_*resize:entry {stack(5);}'
dtrace: description 'fbt::*disk_resize:entry fbt::g_part_*resize:entry ' matched 7 probes
CPU     ID                    FUNCTION:NAME
  0  36534                disk_resize:entry
  0  13496              g_disk_resize:entry

and g_part_resize() was not called.
Comment 15 Dexuan Cui 2016-07-15 13:36:30 UTC
Hi all,
I know why g_disk_resize() can't work here when I open da1 for reading after a disk capacity change:

When /dev/da1 is opened for reading (or writing) every time, amd64_syscall() -> kern_openat() -> ... -> g_dev_open() -> g_access() -> g_disk_access() -> daopen() -> dareprobe() sends the DA_STATE_PROBE_RC16 command; if the disk capacity was changed, in g_disk_access(), we update provide's mediasize ***immediately***: pp->mediasize = dp->d_mediasize;

Later, in disk_resize(), g_disk_resize() -> g_resize_provider(): because of disk's size == pp->mediasize, g_resize_provider() doesn't post the g_resize_provider_event event at all. This is why "camcontrol reprobe da1" can't help.

BTW, opening da1 for writing can work because the provider is destroyed and re-created, as you mentioned.

I'll be proposing a patch.
Comment 16 Dexuan Cui 2016-07-15 13:38:48 UTC
Created attachment 172554 [details]
The patch can work, but it may be a little ugly...
Comment 17 Dexuan Cui 2016-07-15 13:44:08 UTC
(In reply to Dexuan Cui from comment #16)
Please review the attached patch. It may be a little ugly since I'm pretty new to GEOM code... Please suggest how we should fix the issue in a better way.
Comment 18 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-07-15 15:20:13 UTC
(In reply to Dexuan Cui from comment #17)
> (In reply to Dexuan Cui from comment #16)
> Please review the attached patch. It may be a little ugly since I'm pretty
> new to GEOM code... Please suggest how we should fix the issue in a better
> way.

GEOM serves not only the disks, you can break g_provider_resize() for other types of providers.
Can you test this patch instead?
Index: geom_disk.c
===================================================================
--- geom_disk.c	(revision 302840)
+++ geom_disk.c	(working copy)
@@ -126,7 +126,6 @@ g_disk_access(struct g_provider *pp, int r, int w,
 			if (error != 0)
 				return (error);
 		}
-		pp->mediasize = dp->d_mediasize;
 		pp->sectorsize = dp->d_sectorsize;
 		if (dp->d_maxsize == 0) {
 			printf("WARNING: Disk drive %s%d has no d_maxsize\n",
@@ -143,6 +142,7 @@ g_disk_access(struct g_provider *pp, int r, int w,
 		pp->stripeoffset = dp->d_stripeoffset;
 		pp->stripesize = dp->d_stripesize;
 		dp->d_flags |= DISKFLAG_OPEN;
+		g_resize_provider(pp, dp->d_mediasize);
 	} else if ((pp->acr + pp->acw + pp->ace) > 0 && (r + w + e) == 0) {
 		if (dp->d_close != NULL) {
 			error = dp->d_close(dp);
Comment 19 Dexuan Cui 2016-07-15 16:48:46 UTC
(In reply to Andrey V. Elsukov from comment #18)
Your patch is good and it works per my tests. 
Thank you very much, Andrey!

Can you please commit it to the CURRENT branch and can we consider merging it to stable/11 too? I think this small patch should be riskless.
Comment 20 Dexuan Cui 2016-07-19 01:50:44 UTC
(In reply to Dexuan Cui from comment #19)

The 11-beta3 build will begin on July 22.
I hope we can have Andrey's patch committed before the date so that it can get more tests. I think we definitely want to make sure the bug is fixed in the final 11 release.
Comment 21 commit-hook freebsd_committer freebsd_triage 2016-07-19 05:37:19 UTC
A commit references this bug:

Author: ae
Date: Tue Jul 19 05:36:21 UTC 2016
New revision: 303019
URL: https://svnweb.freebsd.org/changeset/base/303019

Log:
  Use g_resize_provider() to change the size of GEOM_DISK provider,
  when it is being opened. This should fix the possible loss of a resize
  event when disk capacity changed.

  PR:		211028
  Reported by:	Dexuan Cui <decui at microsoft dot com>
  MFC after:	3 weeks

Changes:
  head/sys/geom/geom_disk.c
Comment 22 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-07-19 05:38:35 UTC
I think we need a bit more time to test this, so I don't think it will be in 11.0-release.
Comment 23 Dexuan Cui 2016-07-19 06:24:28 UTC
(In reply to Andrey V. Elsukov from comment #22)
Got it. Thank you, Andrey!
Comment 24 Peter Wemm freebsd_committer freebsd_triage 2016-07-24 04:48:43 UTC
We're seeing new panics on machines in the freebsd.org cluster that appear to be related to this.  Sometimes it boots fine, other times it panics like this:


Trying to mount root from zfs:zroot []...
GEOM_PART: da0 was automatically resized.
  Use `gpart commit da0` to save changes or `gpart undo da0` to revert them.
GEOM_PART: integrity check failed (da0, GPT)

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x48
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80740005
stack pointer           = 0x28:0xfffffe01f119db10
frame pointer           = 0x28:0xfffffe01f119db30
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 13 (g_event)
[ thread pid 13 tid 100019 ]
Stopped at      g_part_resize+0x35:     testb   $0x8,0x48(%rbx)

When it boots, the partition info for da0 looks like this:

root@tiny.nyi:/home/peter # gpart show
=>       40  143374576  da0  GPT  (68G)
         40        216    1  freebsd-boot  (108K)
        256    8388608    2  freebsd-swap  (4.0G)
    8388864  134985752    3  freebsd-zfs  (64G)
Comment 25 Peter Wemm freebsd_committer freebsd_triage 2016-07-24 04:52:38 UTC
In case this matters it is exploding on actual SCSI disk systems.
Comment 26 Peter Wemm freebsd_committer freebsd_triage 2016-07-24 06:27:59 UTC
On closer examination, the GPT headers on these drives have strange things going on.  It does look like there's a 6 sector overallocation:

GEOM_PART: da1 was automatically resized.
  Use `gpart commit da1` to save changes or `gpart undo da1` to revert them.
GEOM_PART: partition 3 has end offset beyond last LBA: 143374615 > 143374610
GEOM_PART: integrity check failed (da1, GPT)

However..  I am puzzled as to why this is only sometimes detected.  As I said in #24 above, sometimes the machine boots fine without a peep.  All 6 drives in the machine in question are in this state.
Comment 27 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-07-24 08:23:26 UTC
(In reply to Peter Wemm from comment #26)
> On closer examination, the GPT headers on these drives have strange things
> going on.  It does look like there's a 6 sector overallocation:
> GEOM_PART: da1 was automatically resized.
>   Use `gpart commit da1` to save changes or `gpart undo da1` to revert them.
> GEOM_PART: partition 3 has end offset beyond last LBA: 143374615 > 143374610
> GEOM_PART: integrity check failed (da1, GPT)
> 
> However..  I am puzzled as to why this is only sometimes detected.  As I
> said in #24 above, sometimes the machine boots fine without a peep.  All 6
> drives in the machine in question are in this state.

My guess is the following, it looks like disk reported that its size decreased. This leads to resize and integrity check in geom_part. Since integrity check has been failed, geom is going to be withered. In the meantime disk reported again about changed size, this event handled before geom has fully destroyed, but its softc now will be NULL, so this leads to panic what you see. Probably the following patch will fix this panic, but I'm not sure that there will not another panic due to disappeared GPT.

Index: geom_subr.c
===================================================================
--- geom_subr.c	(revision 303255)
+++ geom_subr.c	(working copy)
@@ -636,7 +636,7 @@ g_resize_provider_event(void *arg, int flag)
 	
 	LIST_FOREACH_SAFE(cp, &pp->consumers, consumers, cp2) {
 		gp = cp->geom;
-		if (gp->resize != NULL)
+		if ((gp->flags & G_GEOM_WITHER) == 0 && gp->resize != NULL)
 			gp->resize(cp);
 	}
Comment 28 Peter Wemm freebsd_committer freebsd_triage 2016-07-24 09:25:26 UTC
There's no size decreases.  The sizes match what is in the initial probe messages.

What's happening is that the disk is mis-partitioned like this:

Disk size: 143374650 sectors
sector 0: PMBR
sector 1-39: GPT
40-143374615: partition space inside gpt
143374610-143374649 backup GPT

Note the overlap, da0-da5 are all like this.

dmesg says:
da1 at mpt0 bus 0 scbus0 target 1 lun 0
da1: <FUJITSU MAS3735NC 5B08> Fixed Direct Access SCSI-3 device
da1: Serial Number A107P49025H2
da1: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da1: Command Queueing enabled
da1: 70007MB (143374650 512 byte sectors)

There is no size change.  The problem is that instead of copying the media size, you're introducing a new resize event at open time that wasn't there before - even when there's no resize.  g_part is now calling its resize code paths and trying to make sense.

GEOM_PART: partition 3 has end offset beyond last LBA: 143374615 > 143374610
GEOM_PART: integrity check failed (da1, GPT)

This matches the bad partitioning and the GPT should have been rejected the very first time around.  The problem is that it is only being detected perhaps 1 in 10 times.

Aside from the panic, the second problem is the inconsistency.  It should be attempting to repair all 6 drives, but isn't. Sometimes it doesn't detect any of the bad disks and boots successfully.  Is the event getting lost?  I think you're racing with initialization / tasting somehow and I suspect that's a bigger problem.
Comment 29 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-07-24 10:30:08 UTC
(In reply to Peter Wemm from comment #28)
> There is no size change.  The problem is that instead of copying the media
> size, you're introducing a new resize event at open time that wasn't there
> before - even when there's no resize.  g_part is now calling its resize code
> paths and trying to make sense.

When geom_disk object is created its size is initialized. If its size is the same, resize event will not be invoked. So, there is definitely present size change. At least sometimes disks reported different size.

> GEOM_PART: partition 3 has end offset beyond last LBA: 143374615 > 143374610
> GEOM_PART: integrity check failed (da1, GPT)
> 
> This matches the bad partitioning and the GPT should have been rejected the
> very first time around.  The problem is that it is only being detected
> perhaps 1 in 10 times.
> 
> Aside from the panic, the second problem is the inconsistency.  It should be
> attempting to repair all 6 drives, but isn't. Sometimes it doesn't detect
> any of the bad disks and boots successfully.  Is the event getting lost?  I
> think you're racing with initialization / tasting somehow and I suspect
> that's a bigger problem.

The metadata on the disk cannot be sometimes correct and sometimes not.
Comment 30 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-07-24 10:53:35 UTC
Created attachment 172924 [details]
Increase debugging in geom_disk

Can you apply this patch and enable bootverbose mode?
Comment 31 Peter Wemm freebsd_committer freebsd_triage 2016-07-24 20:13:22 UTC
The first few boot attempts looked like this:
...
(da0:mpt0:0:0:0): UNMAPPED
(da1:mpt0:0:1:0): da0 at mpt0 bus 0 scbus0 target 0 lun 0
da0: <FUJITSU MAS3735NC 5B08> Fixed Direct Access SCSI-3 device
da0: Serial Number A107P49025GU
da0: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da0: Command Queueing enabled
da0: 70007MB (143374650 512 byte sectors)
da0: Delete methods: <NONE(*),ZERO>
ses0: Nfans 0 Npwr 0 Nslots 6 Lck 0 Ntherm 0 Nspkrs 0 Ntstats 0
UNMAPPED
(da2:mpt0:0:2:0): UNMAPPED
(da3:mpt0:0:3:0): UNMAPPED
(da4:mpt0:0:4:0): UNMAPPED
(da5:mpt0:0:5:GEOM: new disk da0 (73407820800 bytes)
GEOM: new disk da1 (0 bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GEOM: new disk da2 (0 bytes)
da3 at mpt0 bus 0 scbus0 target 3 lun 0
da3: <FUJITSU MAS3735NC 5B08> Fixed Direct Access SCSI-3 device
da3: Serial Number A107P49025H7
da3: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da3: Command Queueing enabled
da3: 70007MB (143374650 512 byte sectors)
da3: Delete methods: <NONE(*),ZERO>
da2 at mpt0 bus 0 scbus0 target 2 lun 0
da2: <FUJITSU MAS3735NC 5B08> Fixed Direct Access SCSI-3 device
da2: Serial Number A107P49025GB
da2: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da2: Command Queueing enabled
da2: 70007MB (143374650 512 byte sectors)
da2: Delete methods: <NONE(*),ZERO>
da1 at mpt0 bus 0 scbus0 target 1 lun 0
da1: <FUJITSU MAS3735NC 5B08> Fixed Direct Access SCSI-3 device
da1: Serial Number A107P49025H2
da1: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da1: Command Queueing enabled
da1: 70007MB (143374650 512 byte sectors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
da1: Delete methods: <NONE(*),ZERO>
da4 at mpt0 bus 0 scbus0 target 4 lun 0
da4: <MAXTOR ATLAS15K_73SCA DT60> Fixed Direct Access SCSI-3 device
da4: Serial Number C80666ZK
da4: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da4: Command Queueing enabled
da4: 70007MB (143374650 512 byte sectors)
da4: Delete methods: <NONE(*),ZERO>
GEOM: new disk da3 (73407820800 bytes)
GEOM: new disk da4 (73407820800 bytes)
0): UNMAPPED
pass0 at mpt0 bus 0 scbus0 target 0 lun 0
da5 at mpt0 bus 0 scbus0 target 5 lun 0
da5: <FUJITSU MAS3735NC 5B08> Fixed Direct Access SCSI-3 device
da5: Serial Number A107P49025GP
da5: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da5: Command Queueing enabled
da5: 70007MB (143374650 512 byte sectors)
da5: Delete methods: <NONE(*),ZERO>
...

Note the "GEOM new disk, 0 bytes" happens before the probe messages.  Presumably the nature of the original panic we were seeing was that was trying to resize an uninitialized disk.

It no longer detects the corrupt GPT but no longer panics either.

The machine in question is slow to boot, but I'll try a few more iterations.
Comment 32 Peter Wemm freebsd_committer freebsd_triage 2016-07-24 20:28:27 UTC
It fairly consistently prints one or two 'GEOM new disk .. 0 bytes' before the disk probe messages but has not crashed.  It is usually different disks each time.  And it has not ever detected the corrupt GPT data since the patch.
Comment 33 Peter Wemm freebsd_committer freebsd_triage 2016-07-25 00:51:40 UTC
With the patch from #30 the machine in question hasn't had any more hiccups. It is behaving in a bug-compatible way with the way it was before the commit that started the panics.
Comment 34 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-07-25 08:52:54 UTC
(In reply to Peter Wemm from comment #31)
> Note the "GEOM new disk, 0 bytes" happens before the probe messages. 
> Presumably the nature of the original panic we were seeing was that was
> trying to resize an uninitialized disk.

No, at this time there are no GEOM_PART objects yet, so it is nothing to resize.

> It no longer detects the corrupt GPT but no longer panics either.
> 
> The machine in question is slow to boot, but I'll try a few more iterations.

Thanks, I will commit the change to g_resize_provider_event() and wait additional time before MFC.
Comment 35 commit-hook freebsd_committer freebsd_triage 2016-07-25 09:12:42 UTC
A commit references this bug:

Author: ae
Date: Mon Jul 25 09:12:08 UTC 2016
New revision: 303288
URL: https://svnweb.freebsd.org/changeset/base/303288

Log:
  Do not invoke resize method if geom is being withered.

  PR:		211028
  MFC after:	2 weeks

Changes:
  head/sys/geom/geom_subr.c
Comment 36 commit-hook freebsd_committer freebsd_triage 2016-08-01 20:55:43 UTC
A commit references this bug:

Author: ae
Date: Mon Aug  1 20:54:54 UTC 2016
New revision: 303637
URL: https://svnweb.freebsd.org/changeset/base/303637

Log:
  Do not invoke resize event if initial disk size is zero. Some disks
  report the size only after first opening.  And due to the events are
  asynchronous, some consumers can receive this event too late and
  this confuses them. This partially restores previous behaviour, and
  at the same time this should fix the problem, when already opened
  provider loses resize event.

  PR:		211028
  MFC after:	3 weeks

Changes:
  head/sys/geom/geom_disk.c
Comment 37 commit-hook freebsd_committer freebsd_triage 2016-09-30 03:46:16 UTC
A commit references this bug:

Author: ae
Date: Fri Sep 30 03:45:41 UTC 2016
New revision: 306476
URL: https://svnweb.freebsd.org/changeset/base/306476

Log:
  MFC r303019:
    Use g_resize_provider() to change the size of GEOM_DISK provider,
    when it is being opened. This should fix the possible loss of a resize
    event when disk capacity changed.

  MFC r303288:
    Do not invoke resize method if geom is being withered.

  MFC r303637:
    Do not invoke resize event if initial disk size is zero. Some disks
    report the size only after first opening.  And due to the events are
    asynchronous, some consumers can receive this event too late and
    this confuses them. This partially restores previous behaviour, and
    at the same time this should fix the problem, when already opened
    provider loses resize event.

  PR:		211028

Changes:
_U  stable/11/
  stable/11/sys/geom/geom_disk.c
  stable/11/sys/geom/geom_subr.c
Comment 38 Andrey V. Elsukov freebsd_committer freebsd_triage 2016-09-30 03:49:02 UTC
Fixed in head/ and stable/11. Thanks!