Bug 164132

Summary: [xpt] "xptioctl: pass device not in the kernel" noise when hot swapping drives in 9.x
Product: Base System Reporter: Enji Cooper <ngie>
Component: kernAssignee: Warner Losh <imp>
Status: Open ---    
Severity: Affects Only Me CC: allanjude, beastie, imp, mav, p-fbsd-bugs, serzh, trasz
Priority: Normal    
Version: 13.0-STABLE   
Hardware: Any   
OS: Any   

Description Enji Cooper freebsd_committer 2012-01-15 03:40:03 UTC
Was doing hot swap drive testing with a mps-enabled chassis and SAS drives and when I pulled a SAS drive xpt(4) spit out a slew of errors from the following lines of code (from .../sys/cam/cam_xpt.c):

 766                         /*
 767                          * It is unfortunate that this is even necessary,
 768                          * but there are many, many clueless users out ther     e.
 769                          * If this is true, the user is looking for the
 770                          * passthrough driver, but doesn't have one in his
 771                          * kernel.
 772                          */
 773                         if (base_periph_found == 1) {
 774                                 printf("xptioctl: pass driver is not in the      "
 775                                        "kernel\n");
 776                                 printf("xptioctl: put \"device pass\" in "
 777                                        "your kernel config file\n");
 778                         }

Even though the pass device was in the kernel.

The check should probably be refined to check for pass devices.

How-To-Repeat: 1. Take a Dell/Supermicro machine with an mps controller and at least 2 devices; setup something other than a RAID0 config with either UFS or ZFS (the issue was reproduced with ZFSv28 and the zfsd patches) so you have enough disks to toy with without accidentally taking down the box.
2. Install 9.0-RELEASE [with the zfsd patches if desired].
3. Add some devd rules to query/manipulate the drive enclosure state via getencstat/setencstat (see .../share/examples/ses/setencstat).
Comment 1 G. Paul Ziemba 2014-12-11 06:23:04 UTC
Seeing this on 10.1-STABLE amd64 r274721. Every two seconds, these two lines are added to /var/log/messages. I don't know what's triggering it, but it blows out my logs in about six hours. Oh, I _have_ device pass in my kernel.
Comment 2 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:52:32 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"
Comment 3 Allan Jude freebsd_committer 2020-10-15 19:04:14 UTC
I've seen this a few times now, in a case where a disk was swapped, but somehow did not get a pass(4) allocated.

<ATA ST10000NM0016-1T SND0>        at scbus1 target 8 lun 0 (pass12,da11)
<ATA ST10000NM0016-1T SND0>        at scbus1 target 9 lun 0 (da12)
<ATA ST10000NM0016-1T SND0>        at scbus1 target 10 lun 0 (pass14,da13)
Comment 4 Sergei Masharov 2021-06-11 07:36:15 UTC
the same problem on 13.0-RELEASE-p1

the pass device have appeared, but ada device is not. the only fix is to reboot the system.
Comment 5 Sergei Masharov 2021-06-11 07:41:57 UTC
# camcontrol devlist
<GJ0250EAGSQ HPG2>                 at scbus2 target 0 lun 0 (ada0,pass0)
<WDC WD1000FYPS-01ZKB0 02.01B01>   at scbus3 target 0 lun 0 (pass1,ada1)
<Hitachi HTS542525K9A300 BBFOC3MP>  at scbus4 target 0 lun 0 (ada2,pass2)
<WDC WD10EACS-00ZJB0 01.01B01>     at scbus5 target 0 lun 0 (ada3,pass3)
# ls /dev/pass*
/dev/pass0      /dev/pass1      /dev/pass2      /dev/pass3
# ls /dev/ada?
/dev/ada0       /dev/ada2       /dev/ada3

# grep kern.crit messages
Jun 11 00:06:56 <kern.crit> proxy kernel: ada1 at ahcich1 bus 0 scbus3 target 0 lun 0
Jun 11 00:06:56 <kern.crit> proxy kernel: ada1: <WDC WD1000FYPS-01ZKB0 02.01B01> s/n WD-WCASJ1759184 detached
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
Jun 11 00:07:01 <kern.crit> proxy kernel: (aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
Jun 11 00:07:05 <kern.crit> proxy kernel:
Jun 11 00:07:17 <kern.crit> proxy kernel: xptioctl: pass driver is not in the kernel
Jun 11 00:07:17 <kern.crit> proxy kernel: xptioctl: put "device pass" in your kernel config file
Comment 6 Sergei Masharov 2021-06-11 07:47:21 UTC
I think it is caused by running camcontrol devlist command just several seconds after device removal. I have one glitchy drive, that sometimes disappears and to return it to the system I run camcontrol reset/rescan. I have wrote a devd script that should do it automaticly and have faced this problem. maybe it happened because of low delay. I have tried to inrease delay in the script, but still no luck. It have happened to me several times and now I can easily reproduce this behavior.
Comment 7 Warner Losh freebsd_committer 2021-06-11 14:51:05 UTC
So is there a way that I can recreate this? This sounds bizarre and no scripts that would allow me to recreate it are attached to this bug.

Also, it would be ideal if this could be triggered using lsiutil disabling and re-enabling the phy.

Absent that, I have no clue what's going on and making any progress on this will have to wait for camcorder so I can see the sequence of events.

(bumped the version to 13.0-stable since we have recent sighting there)
Comment 8 Warner Losh freebsd_committer 2021-06-11 14:53:31 UTC
Also, this may well be 2 bugs because it happens to both ada and da devices.
Comment 9 Sergei Masharov 2021-06-11 15:17:30 UTC
I think this problem caused by the quick device detach/attach. I can now easily reproduce it on my system with SATA disks, but maybe I should enable some kind of debug while reproducing it?
Comment 10 Warner Losh freebsd_committer 2021-06-11 17:12:11 UTC
I have a fleet of thousands of systems that has a few hundred at any given time that have bouncing devices and I don't see this with them. So there must be something else needed to recreate it.