Bug 164132

Summary: [xpt] "xptioctl: pass device not in the kernel" noise when hot swapping drives in 9.x
Product: Base System Reporter: Enji Cooper <ngie>
Component: kernAssignee: Warner Losh <imp>
Status: Open ---    
Severity: Affects Only Me CC: allanjude, beastie, imp, mav, p-fbsd-bugs, serzh, sfourman, trasz
Priority: Normal    
Version: 13.0-STABLE   
Hardware: Any   
OS: Any   

Description Enji Cooper freebsd_committer freebsd_triage 2012-01-15 03:40:03 UTC
Was doing hot swap drive testing with a mps-enabled chassis and SAS drives and when I pulled a SAS drive xpt(4) spit out a slew of errors from the following lines of code (from .../sys/cam/cam_xpt.c):

 766                         /*
 767                          * It is unfortunate that this is even necessary,
 768                          * but there are many, many clueless users out ther     e.
 769                          * If this is true, the user is looking for the
 770                          * passthrough driver, but doesn't have one in his
 771                          * kernel.
 772                          */
 773                         if (base_periph_found == 1) {
 774                                 printf("xptioctl: pass driver is not in the      "
 775                                        "kernel\n");
 776                                 printf("xptioctl: put \"device pass\" in "
 777                                        "your kernel config file\n");
 778                         }

Even though the pass device was in the kernel.

The check should probably be refined to check for pass devices.

How-To-Repeat: 1. Take a Dell/Supermicro machine with an mps controller and at least 2 devices; setup something other than a RAID0 config with either UFS or ZFS (the issue was reproduced with ZFSv28 and the zfsd patches) so you have enough disks to toy with without accidentally taking down the box.
2. Install 9.0-RELEASE [with the zfsd patches if desired].
3. Add some devd rules to query/manipulate the drive enclosure state via getencstat/setencstat (see .../share/examples/ses/setencstat).
Comment 1 G. Paul Ziemba 2014-12-11 06:23:04 UTC
Seeing this on 10.1-STABLE amd64 r274721. Every two seconds, these two lines are added to /var/log/messages. I don't know what's triggering it, but it blows out my logs in about six hours. Oh, I _have_ device pass in my kernel.
Comment 2 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:52:32 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"
Comment 3 Allan Jude freebsd_committer freebsd_triage 2020-10-15 19:04:14 UTC
I've seen this a few times now, in a case where a disk was swapped, but somehow did not get a pass(4) allocated.

<ATA ST10000NM0016-1T SND0>        at scbus1 target 8 lun 0 (pass12,da11)
<ATA ST10000NM0016-1T SND0>        at scbus1 target 9 lun 0 (da12)
<ATA ST10000NM0016-1T SND0>        at scbus1 target 10 lun 0 (pass14,da13)
Comment 4 Sergei Masharov 2021-06-11 07:36:15 UTC
the same problem on 13.0-RELEASE-p1

the pass device have appeared, but ada device is not. the only fix is to reboot the system.
Comment 5 Sergei Masharov 2021-06-11 07:41:57 UTC
# camcontrol devlist
<GJ0250EAGSQ HPG2>                 at scbus2 target 0 lun 0 (ada0,pass0)
<WDC WD1000FYPS-01ZKB0 02.01B01>   at scbus3 target 0 lun 0 (pass1,ada1)
<Hitachi HTS542525K9A300 BBFOC3MP>  at scbus4 target 0 lun 0 (ada2,pass2)
<WDC WD10EACS-00ZJB0 01.01B01>     at scbus5 target 0 lun 0 (ada3,pass3)
# ls /dev/pass*
/dev/pass0      /dev/pass1      /dev/pass2      /dev/pass3
# ls /dev/ada?
/dev/ada0       /dev/ada2       /dev/ada3

# grep kern.crit messages
Jun 11 00:06:56 <kern.crit> proxy kernel: ada1 at ahcich1 bus 0 scbus3 target 0 lun 0
Jun 11 00:06:56 <kern.crit> proxy kernel: ada1: <WDC WD1000FYPS-01ZKB0 02.01B01> s/n WD-WCASJ1759184 detached
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
Jun 11 00:07:05 <kern.crit> proxy kernel:
Jun 11 00:07:17 <kern.crit> proxy kernel: xptioctl: pass driver is not in the kernel
Jun 11 00:07:17 <kern.crit> proxy kernel: xptioctl: put "device pass" in your kernel config file
Comment 6 Sergei Masharov 2021-06-11 07:47:21 UTC
I think it is caused by running camcontrol devlist command just several seconds after device removal. I have one glitchy drive, that sometimes disappears and to return it to the system I run camcontrol reset/rescan. I have wrote a devd script that should do it automaticly and have faced this problem. maybe it happened because of low delay. I have tried to inrease delay in the script, but still no luck. It have happened to me several times and now I can easily reproduce this behavior.
Comment 7 Warner Losh freebsd_committer freebsd_triage 2021-06-11 14:51:05 UTC
So is there a way that I can recreate this? This sounds bizarre and no scripts that would allow me to recreate it are attached to this bug.

Also, it would be ideal if this could be triggered using lsiutil disabling and re-enabling the phy.

Absent that, I have no clue what's going on and making any progress on this will have to wait for camcorder so I can see the sequence of events.

(bumped the version to 13.0-stable since we have recent sighting there)
Comment 8 Warner Losh freebsd_committer freebsd_triage 2021-06-11 14:53:31 UTC
Also, this may well be 2 bugs because it happens to both ada and da devices.
Comment 9 Sergei Masharov 2021-06-11 15:17:30 UTC
I think this problem caused by the quick device detach/attach. I can now easily reproduce it on my system with SATA disks, but maybe I should enable some kind of debug while reproducing it?
Comment 10 Warner Losh freebsd_committer freebsd_triage 2021-06-11 17:12:11 UTC
I have a fleet of thousands of systems that has a few hundred at any given time that have bouncing devices and I don't see this with them. So there must be something else needed to recreate it.
Comment 11 sfourman 2022-09-28 19:26:31 UTC
I am running a SuperMicro system with over 800 disks on 13.1-RELEASE-p1 and somehow I got here, I was just starting to investigate if we needed to replace a disk.


da786 at mps2 bus 0 scbus10 target 216 lun 0
da786: <SEAGATE ST6000NM0105 K002>  s/n ZAD0EDKD0000C715N6ND detached
da804 at mps2 bus 0 scbus10 target 204 lun 0
da804: <SEAGATE ST6000NM0105 K002> Fixed Direct Access SPC-4 SCSI device
da804: Serial Number ZAD0EDKD0000C715N6ND
da804: 600.000MB/s transfers
da804: Command Queueing enabled
da804: 5723166MB (1465130646 4096 byte sectors)
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 1337 Command timeout on target 210(0x00e0) 60000 set, 60.9778032 elapsed
mps2: Sending abort to target 210 for SMID 1337
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 1337 Aborting command 0xfffffe0355b0e498
mps2: Finished abort recovery for target 210
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 1633 Command timeout on target 210(0x00e0) 60000 set, 60.851374433 elapsed
mps2: Sending abort to target 210 for SMID 1633
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 1633 Aborting command 0xfffffe0355b27258
mps2: Finished abort recovery for target 210
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 725 Command timeout on target 210(0x00e0) 60000 set, 60.598754866 elapsed
mps2: Sending abort to target 210 for SMID 725
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 725 Aborting command 0xfffffe0355adae38
mps2: Finished abort recovery for target 210
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 1216 Command timeout on target 210(0x00e0) 60000 set, 60.616971935 elapsed
mps2: Sending abort to target 210 for SMID 1216
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 1216 Aborting command 0xfffffe0355b04200
mps2: Finished abort recovery for target 210
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 583 Command timeout on target 210(0x00e0) 60000 set, 60.75966146 elapsed
mps2: Sending abort to target 210 for SMID 583
        (probe202:mps2:0:210:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 583 Aborting command 0xfffffe0355acef68
mps2: Finished abort recovery for target 210
        (probe202:mps2:0:210:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 1026 Command timeout on target 210(0x00e0) 60000 set, 60.256923432 elapsed
mps2: Sending abort to target 210 for SMID 1026
        (probe202:mps2:0:210:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 1026 Aborting command 0xfffffe0355af42b0
mps2: Finished abort recovery for target 210
(probe202:mps2:0:210:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe202:mps2:0:210:0): CAM status: Command timeout
(probe202:mps2:0:210:0): Retrying command, 3 more tries remain
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
        (probe202:mps2:0:210:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 443 Command timeout on target 210(0x00e0) 60000 set, 60.611647969 elapsed
mps2: Sending abort to target 210 for SMID 443
        (probe202:mps2:0:210:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 443 Aborting command 0xfffffe0355ac3348
mps2: Finished abort recovery for target 210
(probe202:mps2:0:210:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe202:mps2:0:210:0): CAM status: Command timeout
(probe202:mps2:0:210:0): Retrying command, 2 more tries remain