Bug 263853

Summary: geom(4): raid/md_jmicron.c: jmicron RAID taste code can panic if conf is garbage
Product: Base System Reporter: Robert Morris <rtm>
Component: kernAssignee: freebsd-geom (Nobody) <geom>
Status: Open ---    
Severity: Affects Some People CC: emaste, fs, markj, mav
Priority: --- Keywords: crash, needs-qa
Version: UnspecifiedFlags: koobs: maintainer-feedback? (markj)
koobs: mfc-stable13?
koobs: mfc-stable12?
Hardware: Any   
OS: Any   
Attachments:
Description Flags
disk image that causes jmicron taste code to panic none

Description Robert Morris 2022-05-07 20:55:59 UTC
Created attachment 233800 [details]
disk image that causes jmicron taste code to panic

During tasting, if the last sector of a newly attached drive looke
enough like a jmicron_raid_conf but meta->disks[] contains no disks,
and meta->spare[] contains one disk, g_raid_md_jmicron_new_disk() may
call g_raid_md_jmicron_start() anyway, leading to a panic.

I've attached a demo disk image, which causes this code in
g_raid_md_jmicron_new_disk() to execute with disks_present = 1 (the
disk being tasted), total_disks = zero (from meta->disks[]), and
total_spare() = one (from meta->spare[]):

          /* If we collected all needed disks - start array. */
          if (mdi->mdio_disks_present == mdi->mdio_total_disks +
              jmicron_meta_total_spare(mdi->mdio_meta))
            g_raid_md_jmicron_start(sc);

Mounting the demo disk image:

# uname -a
FreeBSD  14.0-CURRENT FreeBSD 14.0-CURRENT #219 main-n250919-29f81bc20825-dirty: Sat May  7 16:30:27 EDT 2022     rtm@xxx:/usr/obj/usr/rtm/symbsd/src/riscv.riscv64/sys/RTM  riscv
# mdconfig -f taste8a.img
GEOM_RAID: JMicron-0000000: Array JMicron-0000000 created.
GEOM_RAID: JMicron-0000000: No transformation module found for ��.
GEOM_RAID: JMicron-0000000: Volume �� state changed from STARTING to UNSUPPORTED                                                                            
panic: No disk at position 0!
cpuid = 0
time = 1651920087
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x38
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x16e
panic() at panic+0x2a
g_raid_md_jmicron_start_disk() at g_raid_md_jmicron_start_disk+0x350
g_raid_md_jmicron_start() at g_raid_md_jmicron_start+0x1c2
g_raid_md_jmicron_new_disk() at g_raid_md_jmicron_new_disk+0x110
g_raid_md_taste_jmicron() at g_raid_md_taste_jmicron+0x39a
G_RAID_MD_TASTE() at G_RAID_MD_TASTE+0x5a
g_raid_taste() at g_raid_taste+0x15c
g_new_provider_event() at g_new_provider_event+0xb8
one_event() at one_event+0x106
g_run_events() at g_run_events+0x8a
g_event_procbody() at g_event_procbody+0x56
fork_exit() at fork_exit+0x80
fork_trampoline() at fork_trampoline+0xa
KDB: enter: panic
[ thread pid 13 tid 100017 ]
Stopped at      breakpoint+0xa: c.ldsp  s0,0(sp)
db>
Comment 1 Ed Maste freebsd_committer freebsd_triage 2022-05-09 16:45:43 UTC
Not sure what to do about this - we can easily add a check for this specific case but should really review all of this for robustness. Adding mav for commentary; any sense of how commonly jmicron RAID metadata is used?
Comment 2 Alexander Motin freebsd_committer freebsd_triage 2022-05-09 17:04:17 UTC
I believe software RAIDs are generally not widely used nowadays with ZFS used more and more often.  And JMicron is obviously not the leader of the market, comparing to Intel and AMD, or even Marvell SATA RAIDs (the last work on level of firmware, but take bigger part of the SATA market).  So if somebody make a nice workaround for the specific case -- I'd take a look, otherwise I am not sure what can we do about it, I have bigger fish to fry to look deep on this.
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2022-05-15 02:23:46 UTC
^Triage: Mark may have ideas given recent src 9e9ba9c73de9
Comment 4 Kubilay Kocak freebsd_committer freebsd_triage 2022-05-15 02:24:07 UTC
^Triage: Uh, base 9e9ba9c73de9
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2022-05-19 15:07:02 UTC
Looks like there are several problems here.  First, the tasting code assumes that disk IDs are non-zero, since jmicron_meta_total_disks() terminates its search when an array entry is zero.  disk_ids seem to be assigned using arc4random(), so most of the time they'll be non-zero. :)

jmicron_meta_find_disk() skips over zeroed entries though, so it's not consistent with total_disks().

I see another bug in the jmicron taste routine, in this line:

  spare = (disk_pos == -2) ? 1 : 0;

It's impossible for disk_pos to equal -2 here, I suspect it should be -3.

I'm not sure how best to fix all of this.  Something like this fixes the test case, but I'm sure it's incomplete, and could potentially break existing setups, though I suspect that's unlikely:

diff --git a/sys/geom/raid/md_jmicron.c b/sys/geom/raid/md_jmicron.c
index 939e05f78017..faa7b1cbb40e 100644
--- a/sys/geom/raid/md_jmicron.c
+++ b/sys/geom/raid/md_jmicron.c
@@ -249,6 +249,8 @@ jmicron_meta_find_disk(struct jmicron_raid_conf *meta, uint32_t id)
        int pos;
 
        id &= JMICRON_DISK_MASK;
+       if (id == 0)
+               return (-1);
        for (pos = 0; pos < JMICRON_MAX_DISKS; pos++) {
                if ((meta->disks[pos] & JMICRON_DISK_MASK) == id)
                        return (pos);