Bug 253954

Summary:	kernel: g_access(958): provider da8 has error 6 set
Product:	Base System	Reporter:	Willem Vermeylen <wvermeylen>
Component:	kern	Assignee:	freebsd-bugs (Nobody) <bugs>
Status:	New ---
Severity:	Affects Some People	CC:	farrokhi, hps, jnaughto, tomek
Priority:	---
Version:	12.1-STABLE
Hardware:	amd64
OS:	Any

Description Willem Vermeylen 2021-03-02 11:52:22 UTC

Hey,

I am encountering a (possible) bug with FreeBSD which we tried to solve for the past few years with no success.

When a disk fails and a new disk is being inserted in our server, it is not correctly being detected by the system leaving the following results in geom:

```
Geom name: da8
Providers:
1. Name: da8
   Mediasize: 8001563222016 (7.3T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r0w0e0
   wither: (null)
```

Weirdly enough when we reboot the system, the disk is fine and is fully being detected like normally.

While the disk is having this issue, we are left with a system that can't do anything with the disk. We can't turn on the disk indicator through the sas controller, we can't add it to zfs and trying to reset it with camcontrol is also not giving any results.

We did see the following pop up when we try to add the disk to zfs anyway:

```
kernel: g_access(958): provider da8 has error 6 set
```

to me it almost looks like it somehow holds on to the old disk and therefor can't fully detect the new one until you reboot and therefor flush everything.

From what I remember we started to have this bug after we upgraded from 10.3 to 11.2 and it continued on 12.1 and 12.2 now

Comment 1 Willem Vermeylen 2021-03-02 12:04:18 UTC

some info I forgot:

the sas controller we use:

```
r0@pci0:1:0:0:	class=0x010700 card=0x30e01000 chip=0x00971000 rev=0x02 hdr=0x00
    vendor     = 'Broadcom / LSI'
    device     = 'SAS3008 PCI-Express Fusion-MPT SAS-3'
    class      = mass storage
    subclass   = SAS
```

upgrading the controller makes no difference in this issue so far.

Comment 2 Tomasz "CeDeROM" CEDRO 2021-03-24 16:32:01 UTC

Hello world :-)

I have just encountered this problem using external usb dvd recorder. Here is the dmesg output:

ugen0.13: <Hitachi-LG Data Storage Inc Portable Super Multi Drive> at usbus0
umass0 on uhub1
umass0: <6238--Storage> on usbus0
umass0:  8070i (ATAPI) over Bulk-Only; quirks = 0x0100
umass0:2:0: Attached to scbus2
cd0 at umass-sim0 bus 0 scbus2 target 0 lun 0
cd0: <HL-DT-ST DVDRAM GP57EB40 PF00> Removable CD-ROM SCSI device
cd0: 40.000MB/s transfers
cd0: 0MB (1 0 byte sectors)
cd0: quirks=0x10<10_BYTE_ONLY>
GEOM_PART: integrity check failed (cd0, MBR)
GEOM_PART: integrity check failed (iso9660/Kali%20Live, MBR)
ugen0.13: <Hitachi-LG Data Storage Inc Portable Super Multi Drive> at usbus0 (disconnected)
umass0: at uhub1, port 1, addr 16 (disconnected)
cd0 at umass-sim0 bus 0 scbus2 target 0 lun 0
cd0: <HL-DT-ST DVDRAM GP57EB40 PF00>  detached
g_access(961): provider cd0 has error 6 set
g_access(961): provider cd0 has error 6 set
g_access(961): provider cd0 has error 6 set


May be caused by GEOM but also usbconfig hangs and no new devices can be attached.

Any hints welcome :-)
Tomek

Comment 3 Willem Vermeylen 2021-05-04 08:14:05 UTC

any help from the devs would be appreciated on this matter

Comment 4 jnaughto 2022-06-13 20:20:06 UTC

Any update on this bug.  I just experienced the exact same issue.  I have 8 disks (all SATA) connected to a Freebsd 12.3 system.  The ZFS pool is setup as a raidz3.  Got in today found one drive was "REMOVED"

# zpool status pool
  pool: pool
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 0 days 02:32:26 with 0 errors on Sat Jun 11 05:32:26 2022
config:

        NAME                     STATE     READ WRITE CKSUM
        pool                     DEGRADED     0     0     0
          raidz3-0               DEGRADED     0     0     0
            ada0                 ONLINE       0     0     0
            ada1                 ONLINE       0     0     0
            ada2                 ONLINE       0     0     0
            ada3                 ONLINE       0     0     0
            ada4                 ONLINE       0     0     0
            8936423309855741075  REMOVED      0     0     0  was /dev/ada5
            ada6                 ONLINE       0     0     0
            ada7                 ONLINE       0     0     0

I assumed that the drive had died and pulled it.  I put a new drive in place and attempted to replace it:

# zpool replace pool 8936423309855741075 ada5
cannot replace 8936423309855741075 with ada5: no such pool or dataset


It seems that the old drive somehow is still remembered by the system.  I dug through the logs to find the following occurring when the new drive is inserted into the system:

Jun 13 13:03:15 server kernel: cam_periph_alloc: attempt to re-allocate valid device ada5 rejected flags 0x118 refcount 1
Jun 13 13:03:15 server kernel: adaasync: Unable to attach to new device due to status 0x6
Jun 13 13:04:23 server kernel: g_access(961): provider ada5 has error 6 set

Did a reboot without the new drive in place.  On reboot the output of the pool did look somewhat different:

# zpool status pool
  pool: pool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0 days 02:32:26 with 0 errors on Sat Jun 11 05:32:26 2022
config:

        NAME                      STATE     READ WRITE CKSUM
        pool                      DEGRADED     0     0     0
          raidz3-0                DEGRADED     0     0     0
            ada0                  ONLINE       0     0     0
            ada1                  ONLINE       0     0     0
            ada2                  ONLINE       0     0     0
            ada3                  ONLINE       0     0     0
            ada4                  ONLINE       0     0     0
            8936423309855741075   FAULTED      0     0     0  was /dev/ada5
            ada5                  ONLINE       0     0     0
            diskid/DISK-Z1W4HPXX  ONLINE       0     0     0

errors: No known data errors

I assumed this was due to the fact that there was one less drive attached and the system assigned new adaX values to each drive.   At this point when I inserted the new drive the new drive appeared as an ada9.  So I re-issued the zpool replace command but now with ada9.  Though it did take about 3mins before the zpool replace command responded back (which really concerned me).  Yet the server has quite a few users accessing the filesystem so I thought as long as the new drive was re-silvering I would be fine....

I do a weekly scrub of the pool and I believe the error crept up after the scub.  at 11am today the logs showed the following response:


Jun 13 11:29:15 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jun 13 11:29:15 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout
Jun 13 11:29:15 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): Retrying command, 0 more tries remain
Jun 13 11:30:35 172.16.20.66 kernel: ahcich5: Timeout on slot 5 port 0
Jun 13 11:30:35 172.16.20.66 kernel: ahcich5: is 00000000 cs 00000060 ss 00000000 rs 00000060 tfd c0 serr 00000000 cmd 0004c517
Jun 13 11:30:35 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jun 13 11:30:35 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout
Jun 13 11:30:35 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): Retrying command, 0 more tries remain
Jun 13 11:31:08 172.16.20.66 kernel: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)

At 11:39 I believe the following log entries are of note:

Jun 13 11:39:45 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): CAM status: Unconditionally Re-queue Request
Jun 13 11:39:45 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): Error 5, Periph was invalidated
Jun 13 11:39:45 172.16.20.66 ZFS[92964]: vdev state changed, pool_guid=$5100646062824685774 vdev_guid=$8936423309855741075
Jun 13 11:39:45 172.16.20.66 ZFS[92966]: vdev is removed, pool_guid=$5100646062824685774 vdev_guid=$8936423309855741075
Jun 13 11:39:46 172.16.20.66 kernel: g_access(961): provider ada5 has error 6 set
Jun 13 11:39:47 reactor syslogd: last message repeated 1 times
Jun 13 11:39:47 172.16.20.66 syslogd: last message repeated 1 times
Jun 13 11:39:47 172.16.20.66 kernel: ZFS WARNING: Unable to attach to ada5.

Any idea on what was the issue?