Hey, I am encountering a (possible) bug with FreeBSD which we tried to solve for the past few years with no success. When a disk fails and a new disk is being inserted in our server, it is not correctly being detected by the system leaving the following results in geom: ``` Geom name: da8 Providers: 1. Name: da8 Mediasize: 8001563222016 (7.3T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r0w0e0 wither: (null) ``` Weirdly enough when we reboot the system, the disk is fine and is fully being detected like normally. While the disk is having this issue, we are left with a system that can't do anything with the disk. We can't turn on the disk indicator through the sas controller, we can't add it to zfs and trying to reset it with camcontrol is also not giving any results. We did see the following pop up when we try to add the disk to zfs anyway: ``` kernel: g_access(958): provider da8 has error 6 set ``` to me it almost looks like it somehow holds on to the old disk and therefor can't fully detect the new one until you reboot and therefor flush everything. From what I remember we started to have this bug after we upgraded from 10.3 to 11.2 and it continued on 12.1 and 12.2 now
some info I forgot: the sas controller we use: ``` r0@pci0:1:0:0: class=0x010700 card=0x30e01000 chip=0x00971000 rev=0x02 hdr=0x00 vendor = 'Broadcom / LSI' device = 'SAS3008 PCI-Express Fusion-MPT SAS-3' class = mass storage subclass = SAS ``` upgrading the controller makes no difference in this issue so far.
Hello world :-) I have just encountered this problem using external usb dvd recorder. Here is the dmesg output: ugen0.13: <Hitachi-LG Data Storage Inc Portable Super Multi Drive> at usbus0 umass0 on uhub1 umass0: <6238--Storage> on usbus0 umass0: 8070i (ATAPI) over Bulk-Only; quirks = 0x0100 umass0:2:0: Attached to scbus2 cd0 at umass-sim0 bus 0 scbus2 target 0 lun 0 cd0: <HL-DT-ST DVDRAM GP57EB40 PF00> Removable CD-ROM SCSI device cd0: 40.000MB/s transfers cd0: 0MB (1 0 byte sectors) cd0: quirks=0x10<10_BYTE_ONLY> GEOM_PART: integrity check failed (cd0, MBR) GEOM_PART: integrity check failed (iso9660/Kali%20Live, MBR) ugen0.13: <Hitachi-LG Data Storage Inc Portable Super Multi Drive> at usbus0 (disconnected) umass0: at uhub1, port 1, addr 16 (disconnected) cd0 at umass-sim0 bus 0 scbus2 target 0 lun 0 cd0: <HL-DT-ST DVDRAM GP57EB40 PF00> detached g_access(961): provider cd0 has error 6 set g_access(961): provider cd0 has error 6 set g_access(961): provider cd0 has error 6 set May be caused by GEOM but also usbconfig hangs and no new devices can be attached. Any hints welcome :-) Tomek
any help from the devs would be appreciated on this matter
Any update on this bug. I just experienced the exact same issue. I have 8 disks (all SATA) connected to a Freebsd 12.3 system. The ZFS pool is setup as a raidz3. Got in today found one drive was "REMOVED" # zpool status pool pool: pool state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 0 days 02:32:26 with 0 errors on Sat Jun 11 05:32:26 2022 config: NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz3-0 DEGRADED 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 ada3 ONLINE 0 0 0 ada4 ONLINE 0 0 0 8936423309855741075 REMOVED 0 0 0 was /dev/ada5 ada6 ONLINE 0 0 0 ada7 ONLINE 0 0 0 I assumed that the drive had died and pulled it. I put a new drive in place and attempted to replace it: # zpool replace pool 8936423309855741075 ada5 cannot replace 8936423309855741075 with ada5: no such pool or dataset It seems that the old drive somehow is still remembered by the system. I dug through the logs to find the following occurring when the new drive is inserted into the system: Jun 13 13:03:15 server kernel: cam_periph_alloc: attempt to re-allocate valid device ada5 rejected flags 0x118 refcount 1 Jun 13 13:03:15 server kernel: adaasync: Unable to attach to new device due to status 0x6 Jun 13 13:04:23 server kernel: g_access(961): provider ada5 has error 6 set Did a reboot without the new drive in place. On reboot the output of the pool did look somewhat different: # zpool status pool pool: pool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 0 days 02:32:26 with 0 errors on Sat Jun 11 05:32:26 2022 config: NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz3-0 DEGRADED 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 ada3 ONLINE 0 0 0 ada4 ONLINE 0 0 0 8936423309855741075 FAULTED 0 0 0 was /dev/ada5 ada5 ONLINE 0 0 0 diskid/DISK-Z1W4HPXX ONLINE 0 0 0 errors: No known data errors I assumed this was due to the fact that there was one less drive attached and the system assigned new adaX values to each drive. At this point when I inserted the new drive the new drive appeared as an ada9. So I re-issued the zpool replace command but now with ada9. Though it did take about 3mins before the zpool replace command responded back (which really concerned me). Yet the server has quite a few users accessing the filesystem so I thought as long as the new drive was re-silvering I would be fine.... I do a weekly scrub of the pool and I believe the error crept up after the scub. at 11am today the logs showed the following response: Jun 13 11:29:15 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 Jun 13 11:29:15 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout Jun 13 11:29:15 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): Retrying command, 0 more tries remain Jun 13 11:30:35 172.16.20.66 kernel: ahcich5: Timeout on slot 5 port 0 Jun 13 11:30:35 172.16.20.66 kernel: ahcich5: is 00000000 cs 00000060 ss 00000000 rs 00000060 tfd c0 serr 00000000 cmd 0004c517 Jun 13 11:30:35 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 Jun 13 11:30:35 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout Jun 13 11:30:35 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): Retrying command, 0 more tries remain Jun 13 11:31:08 172.16.20.66 kernel: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) At 11:39 I believe the following log entries are of note: Jun 13 11:39:45 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): CAM status: Unconditionally Re-queue Request Jun 13 11:39:45 172.16.20.66 kernel: (ada5:ahcich5:0:0:0): Error 5, Periph was invalidated Jun 13 11:39:45 172.16.20.66 ZFS[92964]: vdev state changed, pool_guid=$5100646062824685774 vdev_guid=$8936423309855741075 Jun 13 11:39:45 172.16.20.66 ZFS[92966]: vdev is removed, pool_guid=$5100646062824685774 vdev_guid=$8936423309855741075 Jun 13 11:39:46 172.16.20.66 kernel: g_access(961): provider ada5 has error 6 set Jun 13 11:39:47 reactor syslogd: last message repeated 1 times Jun 13 11:39:47 172.16.20.66 syslogd: last message repeated 1 times Jun 13 11:39:47 172.16.20.66 kernel: ZFS WARNING: Unable to attach to ada5. Any idea on what was the issue?