Created attachment 234514 [details] sg_ses -p7 output Getting the following panic on HPE system with HPE enclosure: panic: make_dev_alias_v: bad si_name (error=22, si_name=enc@n....../type@0/slot@1/elmdesc@{"Name":"DriveBay1"}/pass4) db_trace_self_wrapper() vpanic() panic() make_dev_alias_v() make_dev_alias_p() make_dev_physpath_alias() pass_add_physpath() taskqueue_run_locked() taskqueue_thread_loop() fork_exit() fork_trampoline() Output of `sg_ses -p7` is attached (taken on illumos as I am unable to completely boot FreeBSD on this HPE system due to iLo issues even if I disable smartpqi driver attach).
@Yuri Can you add - uname -a output - /var/run/dmesg.boot output (as attachment) - pciconf -lv output (as attachment)
(In reply to Kubilay Kocak from comment #1) I am unable to boot FreeBSD yet on this system even if I disable smartpqi adapter (see also bug 263008) -- mounting root from virtual CD in iLO fails with error 19. I am using FreeBSD-14.0-CURRENT-amd64-20220603-326a8d3e085-255938-disc1.iso currently.
Created attachment 234605 [details] Don't panic if the physpath cannot create a valid device alias Those are some pretty weird element descriptors, but they're technically legal. We definitely shouldn't panic in a case like this. Can you please test the attached patch?
The change looks good to my eye... Reviewed by: imp
I was thinking towards sanitizing the string, but this is good too.
(In reply to Alexander Motin from comment #5) I think we can do both. Not panicking is the priority.
Hi I'm sitting here in the same kind of boat... - applying this patch is helping a bit...but now crashing a bit later (when ZFS is waking up(?)): GEOM: da66: using the secondary instead -- recovery strongly advised. GEOM: da67: the primary GPT table is corrupt or invalid. GEOM: da67: using the secondary instead -- recovery strongly advised. GEOM: da68: the primary GPT table is corrupt or invalid. GEOM: da68: using the secondary instead -- recovery strongly advised. GEOM: da69: the primary GPT table is corrupt or invalid. GEOM: da69: using the secondary instead -- recovery strongly advised. GEOM: da70: the primary GPT table is corrupt or invalid. GEOM: da70: using the secondary instead -- recovery strongly advised. GEOM: da71: the primary GPT table is corrupt or invalid. GEOM: da71: using the secondary instead -- recovery strongly advised. GEOM_MIRROR: Device mirror/swap launched (2/2). Dual Console: Serial Primary, Video Secondary Setting hostuuid: 6c209ded-1cfb-11e6-a61e-1402ec6b98f8. Setting hostid: 0x89351389. Fatal trap 12: page fault while in kernel mode cpuid = 8; apic id = 10 fault virtual address = 0x48 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8148ff8c stack pointer = 0x28:0xfffffe03bf785b80 frame pointer = 0x28:0xfffffe03bf785c90 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 5 (vdev_open_6) trap number = 12 panic: page fault cpuid = 8 time = 1655964748 KDB: stack backtrace: #0 0xffffffff8080bac5 at kdb_backtrace+0x65 #1 0xffffffff807bfc3f at vpanic+0x17f #2 0xffffffff807bfab3 at panic+0x43 #3 0xffffffff80b67c35 at trap_fatal+0x385 #4 0xffffffff80b67c8f at trap_pfault+0x4f #5 0xffffffff80b3f7f8 at calltrap+0x8 #6 0xffffffff81491f8e at vdev_attach_ok+0x6e #7 0xffffffff81491b1e at vdev_geom_open_by_guids+0x16e #8 0xffffffff81490cda at vdev_geom_open+0x30a #9 0xffffffff8157bb60 at vdev_open+0xf0 #10 0xffffffff815824ee at vdev_open_child+0x1e #11 0xffffffff81487a0f at taskq_run+0x1f #12 0xffffffff808200a1 at taskqueue_run_locked+0x181 #13 0xffffffff808213b2 at taskqueue_thread_loop+0xc2 #14 0xffffffff8077e82e at fork_exit+0x7e #15 0xffffffff80b4086e at fork_trampoline+0xe This is FreeBSD-13.1 stable with the applied patch.
Here another data-point: I have two of these machines, the other is running FreeBSD-12.2, and is running fine. This (crashing) machine got upgraded (I think via FreeBSD-12.3) to FreeBSD-13.0 first, then crashed, but never had the time to look into it (suspected too old firmware somewhere). Then I also upgraded it to FreeBSD-13.1, same crash. So I suspect something bad "happened" between FreeBSD-12 and FreeBSD-13, which is causing (this) trouble. And in this case here it's not HPE D3610, but HP(E) D3600. (<HP D3600 1.72> at scbus6 target 33 lun 0 (ses6,pass82))
(In reply to sebo from comment #7) Could you please tell me what line number it's crashing at? You can determine that by doing # kgdb (kgdb) l *(vdev_attach_ok+0x6e)
Here it is: (kgdb) l *(vdev_attach_ok+0x6e) 0xffffffff8194ff8e is in vdev_attach_ok (/usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/vdev_geom.c:664). 659 ZFS_LOG(1, "Unable to attach tasting instance to %s.", 660 pp->name); 661 return (NO_MATCH); 662 } 663 g_topology_unlock(); 664 nlabels = vdev_geom_read_config(cp, &config); 665 g_topology_lock(); 666 vdev_geom_detach(cp, B_TRUE); 667 if (nlabels == 0) { 668 ZFS_LOG(1, "Unable to read config from %s.", pp->name); (kgdb)
Hmm, that isn't as helpful as I'd hoped. Too much code is getting inlined. Are you able to get a core dump?
So....this is messy.... - much more messy than I thought. I have to do some more backtracking (tomorrow). Because my swap-space is a little bit too small (by a factor of 4 or so), so I thought I will just reduce the optimization-levels and disable inlining... - # grep COPT /etc/make.conf COPTFLAGS= -Og -pipe -fno-inline-functions and also put this "into some other files". The system is booting up now, like there are no issues. zpool is up, zfs-filesystem is mounted and healthy.... So... two and a half things come to my mind now: - just applying the patches should have worked (and fixed the problem), but I forgot to make a make clean (which also doesn't make sense, because then I wouldn't have come a big step further with the boot process) - we are hitting an optimization-compiler-bug a) caused by an "-O2" somewhere b) an function-inlining bug - it's something else.... I'll have another look at it tomorrow, to see, whether I can trigger the problem by changing some COPTFLAGS values to some more "aggressive" values. (For the moment I'm a bit confused by all this...;-))
Yeah, that's messy. I'm going to go ahead and commit the patch for the original issue, because that seems to be solved now. If you can reproduce the ZFS panic, then please open a new bug for that one.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=5f438dd3acba47e54e63b13bfff31a49bcc6ddea commit 5f438dd3acba47e54e63b13bfff31a49bcc6ddea Author: Alan Somers <asomers@FreeBSD.org> AuthorDate: 2022-06-10 22:44:59 +0000 Commit: Alan Somers <asomers@FreeBSD.org> CommitDate: 2022-06-23 17:19:20 +0000 ses: don't panic if disk elements have really weird descriptors SES allows element descriptors to contain characters like spaces and quotes that devfs does not allow to appear in device aliases. Since SES element descriptors are outside of the kernel's control, we should gracefully handle a failure to create a device physical path alias. PR: 264513 Reported by: Yuri <yuri@aetern.org> Reviewed by: imp, mav Sponsored by: Axcient MFC after: 2 weeks sys/cam/scsi/scsi_pass.c | 5 +++-- sys/geom/geom_dev.c | 4 ++-- 2 files changed, 5 insertions(+), 4 deletions(-)
Sorry for the delay, was looking into other issues breaking the boot on this HPE system. I can confirm that this fix (along with smartpqi and usb quirk ones) allows to boot 14-CURRENT without issues, thanks!
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b3c8ab9ff9081748277906a348aff9d331c09092 commit b3c8ab9ff9081748277906a348aff9d331c09092 Author: Alan Somers <asomers@FreeBSD.org> AuthorDate: 2022-06-10 22:44:59 +0000 Commit: Alan Somers <asomers@FreeBSD.org> CommitDate: 2022-07-24 15:41:18 +0000 ses: don't panic if disk elements have really weird descriptors SES allows element descriptors to contain characters like spaces and quotes that devfs does not allow to appear in device aliases. Since SES element descriptors are outside of the kernel's control, we should gracefully handle a failure to create a device physical path alias. PR: 264513 Reported by: Yuri <yuri@aetern.org> Reviewed by: imp, mav Sponsored by: Axcient (cherry picked from commit 5f438dd3acba47e54e63b13bfff31a49bcc6ddea) sys/cam/scsi/scsi_pass.c | 5 +++-- sys/geom/geom_dev.c | 4 ++-- 2 files changed, 5 insertions(+), 4 deletions(-)
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=3c69525933e1ae5f0e7149ef30320bf4e64d9628 commit 3c69525933e1ae5f0e7149ef30320bf4e64d9628 Author: Alan Somers <asomers@FreeBSD.org> AuthorDate: 2022-06-10 22:44:59 +0000 Commit: Alan Somers <asomers@FreeBSD.org> CommitDate: 2022-08-20 02:51:58 +0000 ses: don't panic if disk elements have really weird descriptors SES allows element descriptors to contain characters like spaces and quotes that devfs does not allow to appear in device aliases. Since SES element descriptors are outside of the kernel's control, we should gracefully handle a failure to create a device physical path alias. PR: 264513 Reported by: Yuri <yuri@aetern.org> Reviewed by: imp, mav Sponsored by: Axcient (cherry picked from commit 5f438dd3acba47e54e63b13bfff31a49bcc6ddea) sys/cam/scsi/scsi_pass.c | 5 +++-- sys/geom/geom_dev.c | 4 ++-- 2 files changed, 5 insertions(+), 4 deletions(-)
*** Bug 266183 has been marked as a duplicate of this bug. ***