264513 – cam: panic: panic: make_dev_alias_v: bad si_name (error=22, on HPE JBOD

Bug 264513 - cam: panic: panic: make_dev_alias_v: bad si_name (error=22, on HPE JBOD

Summary: cam: panic: panic: make_dev_alias_v: bad si_name (error=22, on HPE JBOD

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Alan Somers

URL:
Keywords:	crash, needs-qa

Duplicates (1):	266183 (view as bug list)
Depends on:
Blocks:

Reported:	2022-06-07 03:40 UTC by Yuri
Modified:	2022-10-24 09:42 UTC (History)
CC List:	5 users (show)

See Also:

Flags:	koobs: maintainer-feedback? (mav) asomers: mfc-stable13+

Attachments
sg_ses -p7 output (5.78 KB, text/plain) 2022-06-07 03:40 UTC, Yuri	no flags	Details
Don't panic if the physpath cannot create a valid device alias (1.88 KB, patch) 2022-06-10 22:50 UTC, Alan Somers	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yuri 2022-06-07 03:40:51 UTC

Created attachment 234514 [details]
sg_ses -p7 output

Getting the following panic on HPE system with HPE enclosure:

panic: make_dev_alias_v: bad si_name (error=22,
si_name=enc@n....../type@0/slot@1/elmdesc@{"Name":"DriveBay1"}/pass4)

db_trace_self_wrapper()
vpanic()
panic()
make_dev_alias_v()
make_dev_alias_p()
make_dev_physpath_alias()
pass_add_physpath()
taskqueue_run_locked()
taskqueue_thread_loop()
fork_exit()
fork_trampoline()

Output of `sg_ses -p7` is attached (taken on illumos as I am unable to completely boot FreeBSD on this HPE system due to iLo issues even if I disable smartpqi driver attach).

Comment 1 Kubilay Kocak freebsd_committer

2022-06-07 23:06:14 UTC

@Yuri Can you add 

- uname -a output
- /var/run/dmesg.boot output (as attachment)
- pciconf -lv output (as attachment)

Comment 2 Yuri 2022-06-08 04:59:11 UTC

(In reply to Kubilay Kocak from comment #1)
I am unable to boot FreeBSD yet on this system even if I disable smartpqi adapter (see also bug 263008) -- mounting root from virtual CD in iLO fails with error 19.

I am using FreeBSD-14.0-CURRENT-amd64-20220603-326a8d3e085-255938-disc1.iso currently.

Comment 3 Alan Somers freebsd_committer

2022-06-10 22:50:26 UTC

Created attachment 234605 [details]
Don't panic if the physpath cannot create a valid device alias

Those are some pretty weird element descriptors, but they're technically legal.  We definitely shouldn't panic in a case like this.  Can you please test the attached patch?

Comment 4 Warner Losh freebsd_committer

2022-06-11 03:50:24 UTC

The change looks good to my eye...

Reviewed by: imp

Comment 5 Alexander Motin freebsd_committer

2022-06-11 13:20:34 UTC

I was thinking towards sanitizing the string, but this is good too.

Comment 6 Warner Losh freebsd_committer

2022-06-11 15:57:02 UTC

(In reply to Alexander Motin from comment #5)
I think we can do both. Not panicking is the priority.

Comment 7 sebo 2022-06-23 05:06:36 UTC

Hi 

I'm sitting here in the same kind of boat... - applying this patch
is helping a bit...but now crashing a bit later (when ZFS is waking up(?)):

GEOM: da66: using the secondary instead -- recovery strongly advised.
GEOM: da67: the primary GPT table is corrupt or invalid.
GEOM: da67: using the secondary instead -- recovery strongly advised.
GEOM: da68: the primary GPT table is corrupt or invalid.
GEOM: da68: using the secondary instead -- recovery strongly advised.
GEOM: da69: the primary GPT table is corrupt or invalid.
GEOM: da69: using the secondary instead -- recovery strongly advised.
GEOM: da70: the primary GPT table is corrupt or invalid.
GEOM: da70: using the secondary instead -- recovery strongly advised.
GEOM: da71: the primary GPT table is corrupt or invalid.
GEOM: da71: using the secondary instead -- recovery strongly advised.
GEOM_MIRROR: Device mirror/swap launched (2/2).
Dual Console: Serial Primary, Video Secondary
Setting hostuuid: 6c209ded-1cfb-11e6-a61e-1402ec6b98f8.
Setting hostid: 0x89351389.


Fatal trap 12: page fault while in kernel mode
cpuid = 8; apic id = 10
fault virtual address	= 0x48
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8148ff8c
stack pointer	        = 0x28:0xfffffe03bf785b80
frame pointer	        = 0x28:0xfffffe03bf785c90
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 5 (vdev_open_6)
trap number		= 12
panic: page fault
cpuid = 8
time = 1655964748
KDB: stack backtrace:
#0 0xffffffff8080bac5 at kdb_backtrace+0x65
#1 0xffffffff807bfc3f at vpanic+0x17f
#2 0xffffffff807bfab3 at panic+0x43
#3 0xffffffff80b67c35 at trap_fatal+0x385
#4 0xffffffff80b67c8f at trap_pfault+0x4f
#5 0xffffffff80b3f7f8 at calltrap+0x8
#6 0xffffffff81491f8e at vdev_attach_ok+0x6e
#7 0xffffffff81491b1e at vdev_geom_open_by_guids+0x16e
#8 0xffffffff81490cda at vdev_geom_open+0x30a
#9 0xffffffff8157bb60 at vdev_open+0xf0
#10 0xffffffff815824ee at vdev_open_child+0x1e
#11 0xffffffff81487a0f at taskq_run+0x1f
#12 0xffffffff808200a1 at taskqueue_run_locked+0x181
#13 0xffffffff808213b2 at taskqueue_thread_loop+0xc2
#14 0xffffffff8077e82e at fork_exit+0x7e
#15 0xffffffff80b4086e at fork_trampoline+0xe

This is FreeBSD-13.1 stable with the applied patch.

Comment 8 sebo 2022-06-23 12:07:07 UTC

Here another data-point:

I have two of these machines, the other is running FreeBSD-12.2, 
and is running fine. This (crashing) machine got upgraded (I think via FreeBSD-12.3) to FreeBSD-13.0 first, then crashed, but never had the time to look into it (suspected too old firmware somewhere). 
Then I also upgraded it to FreeBSD-13.1, same crash.

So I suspect something bad "happened" between FreeBSD-12 and FreeBSD-13, which
is causing (this) trouble.

And in this case here it's not HPE D3610, but HP(E) D3600.

(<HP D3600 1.72>                    at scbus6 target 33 lun 0 (ses6,pass82))

Comment 9 Alan Somers freebsd_committer

2022-06-23 13:20:11 UTC

(In reply to sebo from comment #7)
Could you please tell me what line number it's crashing at?  You can determine that by doing
# kgdb
(kgdb) l *(vdev_attach_ok+0x6e)

Comment 10 sebo 2022-06-23 13:36:45 UTC

Here it is:

(kgdb) l *(vdev_attach_ok+0x6e)
0xffffffff8194ff8e is in vdev_attach_ok (/usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/vdev_geom.c:664).
659			ZFS_LOG(1, "Unable to attach tasting instance to %s.",
660			    pp->name);
661			return (NO_MATCH);
662		}
663		g_topology_unlock();
664		nlabels = vdev_geom_read_config(cp, &config);
665		g_topology_lock();
666		vdev_geom_detach(cp, B_TRUE);
667		if (nlabels == 0) {
668			ZFS_LOG(1, "Unable to read config from %s.", pp->name);
(kgdb)

Comment 11 Alan Somers freebsd_committer

2022-06-23 15:12:34 UTC

Hmm, that isn't as helpful as I'd hoped.  Too much code is getting inlined.  Are you able to get a core dump?

Comment 12 sebo 2022-06-23 16:23:22 UTC

So....this is messy.... - much more messy than I thought.
I have to do some more backtracking (tomorrow).

Because my swap-space is a little bit too small (by a factor of 4 or so), so I thought I will just reduce the optimization-levels and disable inlining... - 

# grep COPT /etc/make.conf
COPTFLAGS= -Og -pipe  -fno-inline-functions

and also put this "into some other files".

The system is booting up now, like there are no issues.
zpool is up, zfs-filesystem is mounted and healthy....

So... two and a half things come to my mind now:

 - just applying the patches should have worked (and fixed the problem), but I forgot to make a make clean (which also doesn't make sense, because then I wouldn't have come a big step further with the boot process)

 - we are hitting an optimization-compiler-bug
     a) caused by an "-O2" somewhere
     b) an function-inlining bug

 - it's something else....

I'll have another look at it tomorrow, to see, whether I can trigger 
the problem by changing some COPTFLAGS values to some more "aggressive" values.
(For the moment I'm a bit confused by all this...;-))

Comment 13 Alan Somers freebsd_committer

2022-06-23 17:18:42 UTC

Yeah, that's messy.  I'm going to go ahead and commit the patch for the original issue, because that seems to be solved now.  If you can reproduce the ZFS panic, then please open a new bug for that one.

Comment 14 commit-hook freebsd_committer

2022-06-23 17:21:38 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=5f438dd3acba47e54e63b13bfff31a49bcc6ddea

commit 5f438dd3acba47e54e63b13bfff31a49bcc6ddea
Author:     Alan Somers <asomers@FreeBSD.org>
AuthorDate: 2022-06-10 22:44:59 +0000
Commit:     Alan Somers <asomers@FreeBSD.org>
CommitDate: 2022-06-23 17:19:20 +0000

    ses: don't panic if disk elements have really weird descriptors

    SES allows element descriptors to contain characters like spaces and
    quotes that devfs does not allow to appear in device aliases.  Since SES
    element descriptors are outside of the kernel's control, we should
    gracefully handle a failure to create a device physical path alias.

    PR:             264513
    Reported by:    Yuri <yuri@aetern.org>
    Reviewed by:    imp, mav
    Sponsored by:   Axcient
    MFC after:      2 weeks

 sys/cam/scsi/scsi_pass.c | 5 +++--
 sys/geom/geom_dev.c      | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

Comment 15 Yuri 2022-06-26 09:39:10 UTC

Sorry for the delay, was looking into other issues breaking the boot on this HPE system.  I can confirm that this fix (along with smartpqi and usb quirk ones) allows to boot 14-CURRENT without issues, thanks!

Comment 16 commit-hook freebsd_committer

2022-07-24 15:41:50 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b3c8ab9ff9081748277906a348aff9d331c09092

commit b3c8ab9ff9081748277906a348aff9d331c09092
Author:     Alan Somers <asomers@FreeBSD.org>
AuthorDate: 2022-06-10 22:44:59 +0000
Commit:     Alan Somers <asomers@FreeBSD.org>
CommitDate: 2022-07-24 15:41:18 +0000

    ses: don't panic if disk elements have really weird descriptors

    SES allows element descriptors to contain characters like spaces and
    quotes that devfs does not allow to appear in device aliases.  Since SES
    element descriptors are outside of the kernel's control, we should
    gracefully handle a failure to create a device physical path alias.

    PR:             264513
    Reported by:    Yuri <yuri@aetern.org>
    Reviewed by:    imp, mav
    Sponsored by:   Axcient

    (cherry picked from commit 5f438dd3acba47e54e63b13bfff31a49bcc6ddea)

 sys/cam/scsi/scsi_pass.c | 5 +++--
 sys/geom/geom_dev.c      | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

Comment 17 commit-hook freebsd_committer

2022-08-20 02:57:42 UTC

A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=3c69525933e1ae5f0e7149ef30320bf4e64d9628

commit 3c69525933e1ae5f0e7149ef30320bf4e64d9628
Author:     Alan Somers <asomers@FreeBSD.org>
AuthorDate: 2022-06-10 22:44:59 +0000
Commit:     Alan Somers <asomers@FreeBSD.org>
CommitDate: 2022-08-20 02:51:58 +0000

    ses: don't panic if disk elements have really weird descriptors

    SES allows element descriptors to contain characters like spaces and
    quotes that devfs does not allow to appear in device aliases.  Since SES
    element descriptors are outside of the kernel's control, we should
    gracefully handle a failure to create a device physical path alias.

    PR:             264513
    Reported by:    Yuri <yuri@aetern.org>
    Reviewed by:    imp, mav
    Sponsored by:   Axcient

    (cherry picked from commit 5f438dd3acba47e54e63b13bfff31a49bcc6ddea)

 sys/cam/scsi/scsi_pass.c | 5 +++--
 sys/geom/geom_dev.c      | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

Comment 18 Palle Girgensohn freebsd_committer

2022-10-24 09:42:46 UTC

*** Bug 266183 has been marked as a duplicate of this bug. ***