Created attachment 190466 [details] output of camcontrol devlist before USB drive is inserted Beginning with 12.0-CURRENT VM image: FreeBSD-12.0-CURRENT-amd64-20180118-r328126.vmdk.xz and continuing with 12.0-CURRENT VM images: FreeBSD-12.0-CURRENT-amd64-20180125-r328383.vmdk.xz FreeBSD-12.0-CURRENT-amd64-20180131-r328637.vmdk.xz FreeBSD-12.0-CURRENT-amd64-20180208-r329009.vmdk.xz when a USB flash drive inserted into a USB 3.0 port is accessed (in this case via gpart show), the console hangs for 10-12 minutes then emits the error messages seen in the attachment. The UFS filesystem on the USB flash drive cannot be mounted. If the USB flash drive is connected via the attached USB 2.0 controller, everything is good. This problem is not manifested in any 10.4-STABLE or 11.1-STABLE VM images. The host system is CentOS EL7 7.1708. VirtualBox version is 5.2.6. The USB 3.0 controller uses a VIA chipset. System is for test purposes only, so it is easy to try anything that might help resolve this problem. Thanks.
Created attachment 190467 [details] output of camcontrol devlist after USB drive is inserted
Created attachment 190468 [details] Output of time gpart show da0
Created attachment 190469 [details] Output of camcontrol devlist after failure
freebsd-stable mailing list also has this, indicating an MFC has now caused this problem in -STABLE: https://lists.freebsd.org/pipermail/freebsd-stable/2018-February/088393.html "VM image for 11.1-STABLE began exhibiting same dislike for USB 3.0 ports as problem reported in PR 225794 for 12.0-CURRENT. Prior to FreeBSD-11.1-STABLE-amd64-20180215-r329320.vmdk.xz no such problem was observed." CAM status 0x44 refers to done_ccb->ccb_h.status per sys/cam/scsi/scsi_da.c, function dadone(). done_ccb is declared as "union ccb *done_ccb". Field ccb_h is likely declared as "struct ccb_hdr ccb_h" (this is commonplace). Field status is therefore "u_int32_t status" per sys/cam/scsi/cam_ccb.h, which should correlate with the cam_status enum per sys/cam/scsi/cam.h. I had to write a small program to dump all the enum names and values because the ordering and mid-struct value assignments result in hard-to-read code. Text strings come directly from cam_status_table[] per sys/cam/cam.c, except for QFRZN, which comes from the .h file itself. 0x04 = CAM_REQ_CMP_ERR = CCB request completed with an error 0x40 = CAM_DEV_QFRZN = The DEV queue is frozen w/this err Suggest involving avg@, hselasky@, and emaste@ You will probably be asked to try older commits to try and narrow down what exact commit broke things. stable/11 commits: http://www.freshbsd.org/search?branch=RELENG_11&project=freebsd HEAD commits: http://www.freshbsd.org/?branch=HEAD&project=freebsd
Based on Jeremy's assessment, I looked for revisions from 4 weeks ago that were MFC'd this week. I came up with r327996 which was MFC'd to 11.1-STABLE as r329316. I built 11.1-STABLE as of r329310 and the problem appears to be resolved. All of this is without any understanding of the intent of the original (r327996) fix.
Could you please run this command after attaching the device and before running the gpart command? # camcontrol debug -c -p -I -P 3:0:0 Where 3:0:0 is taken from bus, target, lun reported by camcontrol devlist. I hope that this will enable debugging messages that will tell us more about what is going on.
Created attachment 190827 [details] Output of gpart show da0 after camcontrol debug -c -p -I -P 3:0:0 Output of gpart show da0 after camcontrol devlist debug -c -p -I -P 3:0:0 I had to extract this text from dmesg -a due to size exceeding VirtualBox screenshot limit.
(In reply to David.Boyd49 from comment #7) This is valuable information, thank you! Could you please install sg3_utils package? And then run the following commands (after attaching the USB device): sg_modes -p 63 -vvvvv /dev/pass2 sg_modes -p 63 -m 192 -vvvvv /dev/pass2 sg_modes -p 63 -6 -vvvvv /dev/pass2 It might be helpful to run the same commands (adjusted for the device name) on the host to see if VirtualBox causes any trouble (unlikely).
And for completeness: sg_modes -a -vvvvv /dev/pass2
Created attachment 190845 [details] Output of sg3_utils commands I ran the commands requested. The output looks truncated by error conditions. Maybe that is significant in itself.
Created attachment 190847 [details] Output of sg3_utils commands on CentOS EL 7.1708 host The VirtualBox host is CentOS EL 7.1708. I ran the sg commands there as well.
(In reply to David.Boyd49 from comment #10) Are you sure that you used the correct pass device? You used pass3 while previously the USB device had pass2. Just double-checking... It's very strange that even inquiry command failed and mode sense command has not been even tried.
(In reply to David.Boyd49 from comment #11) The host results look good.
Created attachment 190860 [details] Verify camcontrol devlist output for pass3 device Andriy, I verified pass3 as the USB device (see screenshot). I followed up on your suggestion that this might be VirtualBox. I tested with CentOS and openSUSE guest. Neither had any problem mounting and accessing the USB flash drive but both had similar errors trying to run sg_modes commands. A comment made in January (and overlooked since then) indicated that this occurred in a FreeBSD guest running FreeBSD-12.0-CURRENT-amd64-20180118-r328126.vmdk.xz where the host was using a Via chipset (VL805 to be exact). I will test this as soon as possible on a host that has a NEC chipset (D72020x) to see if that actually makes any difference. I'll let you know.
(In reply to David.Boyd49 from comment #14) Perhaps the device genuinely does not support INQUIRY command... Maybe camcontrol would be a better option in this case. Could you please try this? # camcontrol modepage /dev/pass3 -m 63 # camcontrol modepage /dev/pass3 -l
Andriy, I will run the camcontrol modepage commands as soon as possible. In order to "get something done", I carried the USB flash drive to a production VirtualBox server that has an NEC D720201 chipset for USB 3.0 ports. There I ran the following command "sg_modes -a -vvvvv /dev/passX" against the latest FreeBSD 12.0-CURRENT image. I did not have any issues with the command output being truncated due to errors. I also ran the command "sg_modes -a -vvvvv /dev/sdX" on several Linux-based VM's without incident. These were LinuxMint 18.3 and CentOS EL 6.9 systems. The test machine (where the problem was originally observed) has a VIA VL805 chipset. On that machine I ran the command "sg_modes -a -vvvvv /dev/sdX" on several Linux-based VM's. Each of them displayed error(s) (similar to 12.0-CURRENT) trying to run the command. These were LinuxMint 18.3, openSUSE Leap 42.3, and CentOS EL 7.1708 systems.
Created attachment 190901 [details] Output of camcontrol modepage commands Andriy, I ran the camcontrol modepage commands as you requested. I didn't see any significant output. I will run the same commands on the VirtualBox guest where the NEC chipset is used to see if there is any difference.
Created attachment 190908 [details] Output of camcontrol modepage commands on system with NEC chipset Same USB drive Same FreeBSD 12.0-CURRENT version Different USB chipset (NEC D720201)
(In reply to David.Boyd49 from comment #18) So, it seems that MODE SENSE (10) command completely wedges the device (or maybe the controller) and it stops accepting any subsequent commands. I wonder if it would work better with MODE SENSE (6) or if any form of MODE SENSE is broken. Are you set up for compiling and testing kernel code patches? Or would you prefer that I send you pre-built kernels? But before that, could you please obtain a log from a working Linux guest with the problematic controller? I am not sure how to get the log that I want as I am not very experienced with Linux, but it should look something like this: [ 3902.887849] usb 4-6: new SuperSpeed USB device number 6 using xhci_hcd [ 3902.904584] usb 4-6: New USB device found, idVendor=152d, idProduct=0567 [ 3902.904586] usb 4-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ 3902.904587] usb 4-6: Product: JMS567 [ 3902.904588] usb 4-6: Manufacturer: JMicron [ 3902.904588] usb 4-6: SerialNumber: DB123456789B1D [ 3902.906862] scsi host13: uas [ 3902.907595] scsi 13:0:0:0: Direct-Access Samsung SSD 850 PRO 256G 0117 PQ: 0 ANSI: 6 [ 3902.908426] sd 13:0:0:0: Attached scsi generic sg4 type 0 [ 3902.908574] sd 13:0:0:0: [sde] 500118192 512-byte logical blocks: (256 GB/238 GiB) [ 3902.908575] sd 13:0:0:0: [sde] 4096-byte physical blocks [ 3902.909089] sd 13:0:0:0: [sde] Write Protect is off [ 3902.909091] sd 13:0:0:0: [sde] Mode Sense: 53 00 10 08 [ 3902.909368] sd 13:0:0:0: [sde] Disabling FUA [ 3902.909370] sd 13:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 3902.910769] sde: sde1 [ 3902.912446] sd 13:0:0:0: [sde] Attached SCSI disk I guess that this requires setting kernel log level to debug before attaching the device and then extracting the logs from the appropriate place (dmesg?, systemd journal?).
Andriy, I can build kernels with patches. I will try using usbmon to get the information you want. I may not be able to do this until tomorrow.
Created attachment 191060 [details] USB attach on CentOS guest dmesg output showing USB attach on CentOS EL 7.1708 VirtualBox guest
Created attachment 191063 [details] USB attach on CentOS guest (with data) USB attach on CentOS EL 7.1708 VirtualBox guest (with actual data)
(In reply to David.Boyd49 from comment #22) Thank you! I am working on a patch. Meanwhile, could you please test setting kenv kern.cam.da.0.minimum_cmd_size=6 on the command line before plugging in the USB device for the first time? Alternatively, you can add kern.cam.da.0.minimum_cmd_size=6 to /boot/loader.conf and reboot. This assumes that the disk attaches as "da0" as it did before.
Created attachment 191120 [details] Output: USB attach, gpart show after kern.cam.da.0.minimum_cmd_size=6 Andriy, I haven't had much/any luck getting the USB flash drive to attach and be available after kenv kern.cam.da.0.minimum_cmd_size=6. This was again VM image FreeBSD-12.0-CURRENT-amd64-20180226-r330034.vmdk.xz I have tommorrow free, so just let me know what you want me to do. Thanks.
Created attachment 191128 [details] Use MODE SENSE (6) for USB Could you please test this patch? Also, while testing could you please execute the camcontrol debug before the gpart command? I've just now realized another strange thing. Apparently, MODE SENSE works fine the first time when the disk is probed, but it causes the trouble the second time when the disk gets opened.
Created attachment 191171 [details] Log of errors generated by gpart show da0 after your patch was applied Log of errors generated by gpart show da0 after your patch was applied
Created attachment 191172 [details] Console input/output after your patch was applied Console output after your patch was applied ... just to verify that I did things correctly.
Created attachment 191173 [details] Correct console output after your patch was applied Please ignore the duplicate message log output ... this is the actual console output.
Created attachment 191215 [details] do not try to use any 16-byte commands (In reply to David.Boyd49 from comment #26) Disappointing. Okay, let's try to attack the issue from a different angle.
Created attachment 191268 [details] Output after additional patch was applied Andriy, Here is the debug output after the second patch was applied. I don't know what I'm looking at, but I didn't see much difference. I'm ready to do whatever you need.
(In reply to David.Boyd49 from comment #30) So, that was not it. Another thing to try. Could you please try to comment out the call to dareprobe() in daopen() ?
Andriy, No joy. After commenting the dareprobe() (assuming that I did it properly), when I enter gpart show da0 the system hangs forever with no debug output. Please let me know how you want me to proceed.
(In reply to David.Boyd49 from comment #32) Bummer. I am really out of ideas now. I think that we have tried every quirk that Linux may use and nothing works... One idea is trying the previous patches in combinations... I would like to see SCSI debug messages that are produced during the original probing. The problem with getting them is that the only way to get them is to enable debug messages for all devices. And that means noise from ada0 and ada1. Maybe you could the following: 1. stop syslogd, so that the kernel log is not written to disk causing the positive feedback loop 1a. alternatively, boot to single user mode and do things there 2. camcontrol debug -c -p -I -P all 3. plug the device 4. wait until CAM messages settle down 5. camcontrol debug off 6. check the captured messages in dmesg output
(In reply to David.Boyd49 from comment #32) Oh, before you consider comment #33... have you also commented out cam_periph_sleep("dareprobe") ? If not, then that would explain the hang (endless wait for something that's not happening).
Andriy, I did not comment the cam_periph_sleep("dareprobe") statement. I will do that as soon as possible. Which, if any, of the preceding patches would you like me to include in the build?
(In reply to David.Boyd49 from comment #35) Let's try first without any extra patches.
Created attachment 191533 [details] camcontrol debug after local patch Andriy, This is the output after the local patch was applied to scsi_da.c. I issued camcontrol debug -c -p -I -P all in single-user mode. Possibly of interest: when I rebooted the system into multi-user mode, the USB flash drive was detected and usable (fsck, gpart show, mount). I don't know what that means, but I thought I would let you know.
(In reply to David.Boyd49 from comment #37) That's interesting and puzzling. When you rebooted to multi-user, did you still have the patched kernel? If so, then maybe when you booted the hardware (or VirtualBox) was already wedged? Because single-user vs multi-user mode should not have any effect on CAM and USB. So, I'd like to double-check with you if the patch helps. And, if yes, could you please repeat the single user mode test again?
Andriy, Yes, the patch disabling dareprobe(s) was/is applied to the kernel in both single-user and multi-user mode. In single-user mode the error condition in the dmesg output only appears if the device is attached after the "camcontrol debug -c -p -I -P all" statement is issued. Just booting into single-user mode, but without issuing the debug command, allows access to the USB flash drive (gpart show da0, mount/unmount, fsck, etc.). As stated, with the patch disabling dareprobe(s), booting into multi-user mode allows access as above.
What revision of FreeBSD were the latest tests done by? I've fixed some locking bugs ending with r331435... dareprobe suggests this may help, but I'm unsure.
(In reply to Warner Losh from comment #40) Warner, it is unlikely that it had to do with locking, but not impossible, of course. If you look at attachment 191533 [details] you can see that INQUIRY ccb was failing for some unknown reason, but that happened only when CAM debugging (via camcontrol debug ... all) was enabled. Quite strange.
(In reply to David.Boyd49 from comment #39) David, thank you for the clarification. I am quite at loss about why dareprobe plus WP-detection would cause the problems that you experienced. I hoped to get to the bottom of the issue, but the puzzling problem with CAM debug makes that very hard. I guess I'll just work on a patch that would add an ability to disable WP-detection. And then you'll be able to just set a tunable or sysctl to get that. I do not have any better idea. Not for working around the problem, not for debugging it.
Andriy, More and more I am becoming convinced that this is a problem in VirtualBox. I think that your patch (r327996) uncovered this bug in the way that vbox maps real USB controller(s) to virtual USB controller(s). In an attempt to prove this point, I ordered several Renesas (NEC) USB 3.0 controllers (uPD720201 chipset) to replace the VIA USB 3.0 controllers (VL805 chipset). Some of the motherboard being used have Renesas (NEC) USB 3.0 controller(s) built-in. All other motherboards do not have built-in USB 3.0 controller(s) and use an add-on PCI-e card to add 2, 4, 5 or 7 ports. I am using "sg_modes -a -vvvvv <device>" as a control. In all cases where FreeBSD (and various Linux flavors) were running on the "bare metal", no problems with either controller type were noticed. When FreeBSD (and the Linux systems) is running as a VirtualBox guest, no issues are seen when the Renesas USB 3.0 controller is used **** Continued on next comment. ****
Hi, Try also to exclude the timing of the SCSI commands as a source of USB errors. When you are using bare metal, the commands execute more quickly. Sometimes, it might happen that the USB firmware in the peer has blind eye, which cause it to hang or lose the transferred command. The following sysctl might be useful testing this. hw.usb.umass.throttle: 0 Further there is: /usr/src/tools/tools/usbtest/ Which can be used to test mass storage device to quickly reveal known issues. --HPS
**** Continued from previous comment. **** When the VIA USB 3.0 controllers are in play, the FreeBSD (and Linux) guests show the problems that we have been chasing. But, again, only when running as VirtualBox guest(s). VirtualBox presents the both the Renesas and VIA USB 3.0 controller(s) as an Intel 7 Series/C210 xHCI USB Controller. Based on our testing and observation this presentation is not consistent across the Renesas and VIA types. I have no problem with being told that I am wrong. If you can agree with my conclusions, I want to open a Bug Report with VirtualBox. If there is anyone in the FreeBSD community who has a connection with the folks at VirtualBox, I would be most appreciative for any assistance that they may offer. Andriy, it might be useful to include a few words describing what you were addressing with the patch that uncovered this problem. If you agree with my conclusions, please feel free to close this PR. David.
(In reply to David.Boyd49 from comment #45) David, sorry that this fell of my plate, so to speak. Thank you for all the testing and analysis. It indeed appears to be a problem with VirtualBox. It would be a good idea to file a bug for it. The essence of scsi_da change in r327996 is that the driver now issues MODE SENSE command for the all pages page (0x3f). I am adding a tunable that disables the new functionality, so that the problem can be worked around.
A commit references this bug: Author: avg Date: Mon Dec 17 16:01:38 UTC 2018 New revision: 342169 URL: https://svnweb.freebsd.org/changeset/base/342169 Log: add a knob that disables detection of write protected disks It has been reported that on some systems (with real hardware passed through to a virtual machine) the WP detection causes USB disk probing failures. While here, also fix the selection of the next state in the case of malloc failure in DA_STATE_PROBE_WP. It was DA_STATE_PROBE_RC unconditionally even when it should have been DA_STATE_PROBE_RC16. PR: 225794 Reported by: David Boyd <David.Boyd49@twc.com> MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D18496 Changes: head/sys/cam/scsi/scsi_da.c
A commit references this bug: Author: avg Date: Fri Jan 11 11:42:26 UTC 2019 New revision: 342938 URL: https://svnweb.freebsd.org/changeset/base/342938 Log: MFC r342169: add a knob that disables detection of write protected disks PR: 225794 Changes: _U stable/12/ stable/12/sys/cam/scsi/scsi_da.c
A commit references this bug: Author: avg Date: Fri Jan 11 11:43:37 UTC 2019 New revision: 342939 URL: https://svnweb.freebsd.org/changeset/base/342939 Log: MFC r342169: add a knob that disables detection of write protected disks PR: 225794 Changes: _U stable/11/ stable/11/sys/cam/scsi/scsi_da.c