Bug 244356

Summary: Writing to a USB 3.0 stick is very slow
Product: Base System Reporter: Sebastien Boisvert <seb>
Component: usbAssignee: freebsd-usb (Nobody) <usb>
Status: Open ---    
Severity: Affects Only Me CC: gljennjohn, grahamperrin, hselasky, icarus.m+freebsdbugzilla, imp, john, maurizio1018, nreilly, olce, rob2g2-freebsd, seb, vidwer+fbsdbugs
Priority: --- Keywords: needs-qa, performance
Version: 12.1-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
Output of usbdump for the USB 2.0 stick (which works fine)
none
Output of usbdump for the USB 3.0 stick (which is 10x times slower than the USB 2.0 stick)
none
Output of dmesg with hw.usb.umass.debug=-1
none
Output of dmesg with hw.usb.umass.debug=-1 with 2 Kingston DataTraveler 2.0
none
dmesg with umass debugging and no USB devices plugged in
none
dmesg with umass debugging and Kingston DataTraveler 3.0 plugged in (USB 3.0 device in a USB 3.0 port)
none
For comment #26: SD_64G_ufs_SU_usr_cp.txz
none
For comment #26: SD_64G_exFAT_usr_cp.txz
none
For comment #26: SD_64G_no_unplug_ufs_SU_usr_cp.txz
none
For comment #26: SD_64G_ufs_SU_dd.txz
none
For comment #26: KT_32G_ufs_SU_dd.txz
none
For comment #26: SD_64G_2nd_hub_ufs_SU_dd.txz
none
For comment #26: SD_128G_ufs_SU_usr_cp_1.txz
none
For comment #26: SD_128G_ufs_SU_usr_cp_2.txz
none
Stick SD_64G: Output of `usbtest`
none
Stick SD_64G: 2nd output of `usbtest`
none
Stick SD_64G: 3rd output of (multiple) `usbtest`
none
Stick KT_32G: 1st round of outputs from `usbtest`
none
Stick SD_128G: 1st round of outputs from `usbtest`
none
Output of usbtest
none
Kingston_DataTraveler_100_G3_32_GB_in_USB_2.0_port.dd+iostat
none
Kingston_DataTraveler_100_G3_32_GB_in_USB_2.0_port.usbtest
none
Kingston_DataTraveler_100_G3_32_GB_in_USB_3.0_port.dd+iostat
none
Kingston_DataTraveler_100_G3_32_GB_in_USB_3.0_port.usbtest
none
Kingston DataTraveler G4 32GB dd on fat32
none
Kingston DataTraveler G4 32GB dd on UFS 1/3
none
Kingston DataTraveler G4 32GB dd on UFS 2/3
none
Kingston DataTraveler G4 32GB dd on UFS 3/3
none
Kingston DataTraveler G4 32GB iostat on FAT32
none
Kingston DataTraveler G4 32GB iostat on UFS 1/3
none
Kingston DataTraveler G4 32GB dd on UFS 2/3
none
Kingston DataTraveler G4 32GB iostat on UFS 2/3
none
Kingston DataTraveler G4 32GB iostat on UFS 3/3
none
dd + iostat on UFS - DataTraveler G4 32GB
none
dd + iostat on UFS+SU after zeroing the stick - DataTraveler G4 32GB none

Description Sebastien Boisvert 2020-02-24 01:20:31 UTC
Hello,

Writing to a USB 3.0 stick is very slow. USB 2.0 sticks work fine.
USB 3.0 stick plugged in a USB 2.0 port exhibit the same problem
compared to being plugged in a USB 3.0 port.

My FreeBSD installation is on a ZFS mirror of 2 identical 7.2 GiB USB 2.0 sticks.

Output of 'usbconfig' for these 2 storage devices:

ugen1.2: <Kingston DataTraveler 2.0> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (100mA)
ugen0.3: <Kingston DataTraveler 2.0> at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (100mA)

Output of gpart:

root@atlantis:~ # gpart show da0
=>      40  15131560  da0  GPT  (7.2G)
        40      1024    1  freebsd-boot  (512K)
      1064       984       - free -  (492K)
      2048   4194304    2  freebsd-swap  (2.0G)
   4196352  10934272    3  freebsd-zfs  (5.2G)
  15130624       976       - free -  (488K)

root@atlantis:~ # gpart show da1
=>      40  15131560  da1  GPT  (7.2G)
        40      1024    1  freebsd-boot  (512K)
      1064       984       - free -  (492K)
      2048   4194304    2  freebsd-swap  (2.0G)
   4196352  10934272    3  freebsd-zfs  (5.2G)
  15130624       976       - free -  (488K)


The directory /usr contains 620 MB of files.
Command:
# du -sh /usr/
620M	/usr/


It takes 4:31.4 (4 minutes, 31 seconds) to copy /usr to /root.
Command: time rsync -av /usr /root/

sent 1,458,070,436 bytes  received 952,593 bytes  5,354,212.95 bytes/sec
total size is 1,454,334,520  speedup is 1.00
13.724u 17.240s 4:31.94 11.3%	433+221k 59891+57303io 0pf+0w

I have a 32 GB USB 3.0 stick. The output of 'usbconfig' is:

ugen0.5: <Kingston DataTraveler 3.0> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (126mA)

The output of 'gpart':

root@atlantis:~ # gpart show da2
=>      40  60604336  da2  GPT  (29G)
        40      2008       - free -  (1.0M)
      2048  60600320    1  freebsd-ufs  (29G)
  60602368      2008       - free -  (1.0M)

There is a UFS file system on the USB 3.0 devive.
It is mounted in /mnt/ufs

It takes 31:34.57 to copy /usr to /mnt/ufs,
Command: time rsync -av /usr/ /mnt/ufs/

sent 1,458,070,404 bytes  received 952,532 bytes  770,136.15 bytes/sec
total size is 1,454,334,520  speedup is 1.00
13.128u 12.211s 31:34.57 1.3%	435+222k 7678+9104io 0pf+0w

Using 'truss', I can see that a lot of calls to select() take a lot of time
before returning.

Command: 
 truss -D rsync -av /usr /mnt/ufs/ >& /root/rsync-truss-ufs-usb-3.log

root@atlantis:~ # sort -k1,1rn /root/rsync-truss-ufs-usb-3.log | head -n 20
60.003431924 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 0 (0x0)
49.165728175 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
42.159206222 select(5,{ },{ 4 },{ },{ 60.000000 }) = 1 (0x1)
41.122681159 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
39.143298378 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
38.106784375 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
38.080823335 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
33.890768013 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
32.247561086 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
29.721087531 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
28.045373886 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
27.097321654 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
26.624814418 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
26.183747112 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
23.386795589 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
22.540706824 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
22.448082140 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
21.807824638 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
21.662166794 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)
17.910288782 select(6,{ 5 },{ 4 },{ 5 },{ 60.000000 }) = 1 (0x1)


Using the 'procsystime' dtrace script, I can see the numerous calls to
select().

root@atlantis:~ # /usr/local/share/dtrace-toolkit/procsystime
Tracing... Hit Ctrl-C to end...
dtrace: 357 dynamic variable drops with non-empty dirty list
^C

Elapsed Times for all processes,

         SYSCALL          TIME (ns)
       sigreturn               4899
            open               5961
         fstatfs               6843
           close               8286
          openat              13816
           fstat              17242
       sigaction              22701
            mmap              30375
         fstatat              39743
   getdirentries              40936
     sigprocmask              99694
           write             633885
        __sysctl             873677
           ioctl            1227846
            read            1428213
        _umtx_op         9038575168
          select        17245685291


The 'hotkernel' dtrace script shows that the CPU cores are mostly idle.

root@atlantis:~ # /usr/local/share/dtrace-toolkit/hotkernel 
Sampling... Hit Ctrl-C to end.
^C
FUNCTION                                                COUNT   PCNT
kernel`cpu_activeclock                                      1   0.0%
kernel`acpi_cpu_idle                                        1   0.0%
kernel`usbd_in_polling_mode                                 1   0.0%
kernel`xhci_generic_done_sub                                1   0.0%
kernel`critical_exit_preempt                                1   0.0%
kernel`softdep_sync_buf                                     1   0.0%
kernel`rdtsc                                                1   0.0%
kernel`cpu_idle_acpi                                        2   0.0%
kernel`ithread_loop                                         2   0.0%
kernel`usbd_get_page                                        2   0.0%
kernel`xhci_setup_generic_chain_sub                         3   0.0%
kernel`em_update_stats_counters                             3   0.0%
kernel`tsc_get_timecount_low_mfence                         4   0.0%
kernel`cpu_search_highest                                   7   0.0%
kernel`xhci_interrupt_poll                                  8   0.0%
kernel`xhci_interrupt                                      16   0.0%
kernel`spinlock_exit                                       36   0.0%
kernel`cpu_idle                                          2888   3.4%
kernel`acpi_cpu_c1                                      81118  96.5%


The 'disklatency' dtrace script shows a high disk latency.

root@atlantis:~ # /usr/share/dtrace/disklatency
Tracing... Hit Ctrl-C to end.
^C

  da                                                        2
           value  ------------- Distribution ------------- count    
           76000 |                                         0        
        >= 80000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2        

  da                                                        2          2131761
  da                                                        2          2121442
  da                                                        2            10420


ZFS is completely unusable on this USB 3.0 stick.
It took the whole night just to install FreeBSD on ZFS using the installer.


Hardware
--------

- Processor: AMD Ryzen 5 2600 Six-Core Processor
- Socket: AM4
- Memory: Kingston 8GB DDR4-2400ECC Unbuffered DIMM CL171Rx81.2V Micron EDie (Server Premier)
- Motherboard: ASRock B450 Gaming-ITX/ac
- Chipset: AMD Promontory B450
- Storage (operating system): 2 x Kingston DataTraveler 2.0 7.2 GiB
- Storage: Kingston DataTraveler 3.0 PHISON  USB3
Comment 1 Hans Petter Selasky freebsd_committer freebsd_triage 2020-02-24 08:07:55 UTC
You might want to check the latency using usbdump:

usbdump -i usbus1 -f 2 -s 65536

To see exactly what is going on.

--HPS
Comment 2 Sebastien Boisvert 2020-02-24 12:26:02 UTC
Created attachment 211891 [details]
Output of usbdump for the USB 2.0 stick (which works fine)
Comment 3 Sebastien Boisvert 2020-02-24 12:27:21 UTC
Created attachment 211892 [details]
Output of usbdump for the USB 3.0 stick (which is 10x times slower than the USB 2.0 stick)
Comment 4 Hans Petter Selasky freebsd_committer freebsd_triage 2020-02-24 12:38:38 UTC
It looks like the SCSI/CAM layer is probing something every 3 seconds and doesn't get the expected result. It does not look like a device problem.

Can you try to set before plugging the device:

sysctl hw.usb.umass.debug=-1

And then collect messages in dmesg?

--HPS
Comment 5 Sebastien Boisvert 2020-02-24 12:45:09 UTC
(In reply to Hans Petter Selasky from comment #1)

Hello Hans Petter Selasky,

Note that one the USB 2.0 stick is on the bus usbus1 (as device ugen1.2) while
the other USB 2.0 stick is on the bus usbus0 (as device ugen0.3). The USB 3.0 stick is on the bus usbus0 (as device ugen0.5).

I ran 'usbdump' with those commands:

# usbdump -i usbus1 -f 2 -s 65536 > /root/Kingston_DataTraveler_2.0-usbus1-device-2-latency.txt

# usbdump -i usbus0 -f 5 -s 65536 > /root/Kingston_DataTraveler_3.0-usbus0-device-5-latency.txt

Here are the information your requested:

Kingston_DataTraveler_2.0-usbus1-device-2-latency.txt.gz
https://bugs.freebsd.org/bugzilla/attachment.cgi?id=211891

Kingston_DataTraveler_3.0-usbus0-device-5-latency.txt.gz
https://bugs.freebsd.org/bugzilla/attachment.cgi?id=211892
Comment 6 Sebastien Boisvert 2020-02-24 22:41:25 UTC
Created attachment 211911 [details]
Output of dmesg with hw.usb.umass.debug=-1
Comment 7 Hans Petter Selasky freebsd_committer freebsd_triage 2020-02-24 23:01:12 UTC
Is this output the same for both USB 2.0 and USB 3.0?

Warner:

Why is the data transfer 8K for a sense length of 252 bytes?

ccb->csio.dxfer_len = 8192, ccb->csio.sense_len = 252

umass1:umass_cam_action: 6:0:0:XPT_SCSI_IO: cmd: 0x2a, flags: 0x80, 10b cmd/8192b data/252b sense

--HPS
Comment 8 Sebastien Boisvert 2020-02-24 23:06:50 UTC
Created attachment 211912 [details]
Output of dmesg with hw.usb.umass.debug=-1 with 2 Kingston DataTraveler 2.0

USB devices plugged in (no USB 3.0):

ugen1.1: <0x1022 XHCI root HUB> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.1: <0x1022 XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.2: <vendor 0x8087 product 0x0aa7> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (100mA)
ugen1.2: <Kingston DataTraveler 2.0> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (100mA)
ugen0.3: <Kingston DataTraveler 2.0> at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (100mA)
ugen0.4: <Chicony USB Keyboard> at usbus0, cfg=0 md=HOST spd=LOW (1.5Mbps) pwr=ON (100mA)
Comment 9 Hans Petter Selasky freebsd_committer freebsd_triage 2020-02-24 23:26:42 UTC
ugen1.2: <Kingston DataTraveler 2.0> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (100mA)
ugen0.3: <Kingston DataTraveler 2.0> at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (100mA)

In the trace I see umass0 and umass1. Can you point out which is which?

There are no errors in the attached traces, nor any timeouts. Everything appears to work like expected.

Warner: Nevermind. It is SCSI command 0x2a .
Comment 10 Hans Petter Selasky freebsd_committer freebsd_triage 2020-02-24 23:27:09 UTC
In the trace both devices are enumerating like USB2.0 ???
Comment 11 Hans Petter Selasky freebsd_committer freebsd_triage 2020-02-24 23:30:46 UTC
One question: Can you re-do the test while disabling the swap partition?

--HPS
Comment 12 Sebastien Boisvert 2020-02-25 00:52:33 UTC
Hello Hans,

I have 2 spare Seagate IronWolf NAS 4 TB SATA3.
(I am waiting for a case and a power supply to arrive by mail in order
to complete the build.)

To simplify things, I installed FreeBSD 12.1-RELEASE on a IronWolf SATA3 disk.

This way, the swap partition on the USB 2.0 stick will not interfere.

I will give you 2 dmesg outputs:

- dmesg without any USB device being attached.
- dmesg with only the <Kingston DataTraveler 3.0> 32 GB USB stick plugged in.
Comment 13 Sebastien Boisvert 2020-02-25 00:55:56 UTC
Created attachment 211914 [details]
dmesg with umass debugging and no USB devices plugged in

### Output of 'usbconfig'
ugen1.1: <0x1022 XHCI root HUB> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.1: <0x1022 XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.2: <vendor 0x8087 product 0x0aa7> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (100mA)
Comment 14 Sebastien Boisvert 2020-02-25 00:59:48 UTC
Created attachment 211915 [details]
dmesg with umass debugging and Kingston DataTraveler 3.0 plugged in (USB 3.0 device in a USB 3.0 port)

### Output of 'usbconfig'
ugen1.1: <0x1022 XHCI root HUB> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.1: <0x1022 XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.2: <vendor 0x8087 product 0x0aa7> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (100mA)
ugen0.3: <Kingston DataTraveler 3.0> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (126mA)
Comment 15 Sebastien Boisvert 2020-02-27 02:07:43 UTC
I tested the <Kingston DataTraveler 3.0> USB stick on Ubuntu 18.04 (Linux)
on the same system (AMD Ryzen 5 2600, SRock B450 Gaming-ITX/ac) to make sure
that the problem is not due to the hardware.

The command

    time rsync -av /lib /mnt/ext4-on-usb-3.0/

achieves a throughput of 66 MB/s.

So the hardware is not the problem.
Comment 16 vidwer+fbsdbugs 2020-03-01 15:45:59 UTC
I'll be following this.

With the USB DA device from https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235897 dd and rsync transfers to UFS and to ZFS seem to stall, after a while.
A 120-something transfer speed of a large generated file (dd if=/dev/urandom of=foo bs=1M count=1048576) stalls, drops to 10-something transfer speed, stalls again and drops to KB/s speeds. After interrupting rsync, I ran 'time sync', which takes 'real 2m15.632s'.

Because of the dd observation, perhaps 'disk cache' can't keep up with the amount of writes?

rsync to UFS on USB 2.0 storage:
sent 1,074,155,802 bytes  received 76 bytes  9,633,685.00 bytes/sec

rsync to UFS on USB 3.0 storage:
sent 1,074,155,802 bytes  received 76 bytes  2,438,492.35 bytes/sec

I can test patches on 13-CURRENT.
Comment 17 Sebastien Boisvert 2020-03-01 15:57:19 UTC
I ended up using 2 x Kingston DataTraveler 2.0 sticks for my installation of FreeBSD 12.1-RELEASE (ZFS mirror).

I also purchased 2 x USB 3.0 Sandisk sticks instead of Kingston to see if it's a vendor problem. I have not yet received those.

I do have an older Kingston USB 3.0 stick, and it is also unusable like the 2 newer Kingston USB 3.0 sticks.

In an experiment, I tried to install OmniOS r151030a on the same 2 x Kingston DataTraveler 3.0. Like with FreeBSD 12.1-RELEASE, OmniOS r151030a takes forever to install itself on USB 3.0 sticks (ZFS mirror).

In another experiment, I installed OmniOS r151030a on the 2 x Kingston DataTraveler USB 2.0 sticks. The installation went fine, but then OmniOS hangs when booting off those USB 2.0 sticks.

So, I am happy that FreeBSD 12.1-RELEASE works with USB 2.0. USB 3.0 would be great though.
Comment 18 Hans Petter Selasky freebsd_committer freebsd_triage 2020-03-01 15:58:13 UTC
Hi,

FreeBSD doesn't have a disk cache.

Try to run usbdump at the same time as testing the disk using dd.

--HPS
Comment 19 Hans Petter Selasky freebsd_committer freebsd_triage 2020-03-01 15:59:23 UTC
Also try:

/usr/src/tools/tools/usbtest

Make sure your memory stick is empty before running usbtest.

--HPS
Comment 20 Sebastien Boisvert 2020-03-01 16:33:50 UTC
(In reply to Hans Petter Selasky from comment #18)

> FreeBSD doesn't have a disk cache.

I don't know much about FreeBSD so maybe I am wrong, but:

In FreeBSD, ZFS ARC (adaptive replacement cache) is a disk cache for read operations for ZFS.

So, for read operations, if I have ZFS on the USB stick, then there is a disk cache for read operations through the ZFS ARC, right ?

Although that there are no disk cache for write operations, FreeBSD probably
has a queue (or queues) in the block subsystem for AHCI I/O operations (with SATA 3 for instance). This logical queue can be used to coalesce and/or order I/O commands before draining this queue to submit to the device queue.

I don't know if USB 3.0 sticks are mono-queue or multi-queue.
I know that SATA has 1 queue with a depth of 32 whereas NVMe has 64k queues with each a depth of 64k.

I know that Linux has a page cache (unified for the virtual memory and block devices and so on).
Do you think that USB 3.0 works well on Linux because of the page cache that is used for write operations ?

I mean, if userspace does not call "sync", then the data can be dirty in kernel space without being written back to the device. That usually makes I/O faster.

The documentation of "select" says:

    "to see if some of their descriptors are ready for reading,
     are ready for writing"
     --> https://www.freebsd.org/cgi/man.cgi?select

Here, select takes 60 seconds to conclude that none of the descriptors are ready. That delay is clearly abnormal.

How can I trace what is being called between the select_ENTER and select_EXIT in FreeBSD ?
Comment 21 Hans Petter Selasky freebsd_committer freebsd_triage 2020-03-01 16:38:28 UTC
Cache only for files, not disk writes/reads.

--HPS
Comment 22 Sebastien Boisvert 2020-03-01 17:12:06 UTC
(In reply to Hans Petter Selasky from comment #21)

OK got it.

How can I trace what is being called between the select_ENTER and select_EXIT in FreeBSD ?
Comment 23 Sebastien Boisvert 2020-03-13 11:13:00 UTC
Hello,

I acquired new USB flash drives to do additional tests.

I used the same USB 3.1 Gen1 port on my computer case for all tested USB drives.
I created one GPT partition and formatted a UFS file system on it.
The benchmark is the following command that copies /usr (size: 848M):

time rsync -av /usr /mnt/USB_FLASH_DRIVE/

------------------------------------------------------------------------
Product: Kingston DataTraveler SE9 G2 USB Flash Drive 
Part Number: DTSE9G2/32GB
Interface: USB 3.0

time:
19.606u 19.922s 1:29:45.25 0.7% 433+221k 5552+228702io 0pf+0w

rsync:
sent 2,069,862,956 bytes  received 780,403 bytes  384,484.89 bytes/sec
total size is 2,066,495,106  speedup is 1.00


------------------------------------------------------------------------
Product: SanDisk Ultra Fit USB 3.1 Flash Drive 
Part Number: SDCZ430-032G-G46
Interface: USB 3.1

time:
20.155u 22.475s 19:16.30 3.6%	431+220k 19641+353403io 0pf+0w

rsync:
sent 2,069,862,956 bytes  received 780,443 bytes  1,790,439.60 bytes/sec
total size is 2,066,495,106  speedup is 1.00


------------------------------------------------------------------------
Product: SanDisk Cruzer Pop 32GB USB 2.0 Flash Drive, Tribal
Part Number: SDCZ53B-032G-B35
Interface: USB 2.0

time:
19.108u 20.545s 1:31:48.85 0.7% 438+224k 14194+228758io 0pf+0w

rsync:

sent 2,069,862,956 bytes  received 780,431 bytes  375,831.45 bytes/sec
total size is 2,066,495,106  speedup is 1.00
Comment 24 Sebastien Boisvert 2020-05-03 22:50:36 UTC
Update:

I removed USB sticks from my NAS.

FreeBSD 12.1 now runs on a ZFS mirror VDEV on my NVMe storage device.

root@atlantis:~ # usbconfig
ugen1.1: <0x1022 XHCI root HUB> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.1: <0x1022 XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.2: <vendor 0x8087 product 0x0aa7> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (100mA)


root@atlantis:~ # nvmecontrol devlist
 nvme0: THNSN51T02DU7 TOSHIBA
    nvme0ns1 (976762MB)


root@atlantis:~ # gpart show -l -p nvd0
=>        40  2000409184    nvd0  GPT  (954G)
          40        2008          - free -  (1.0M)
        2048        1024  nvd0p1  gptboot0  (512K)
        3072        1024          - free -  (512K)
        4096    33554432  nvd0p2  swap0  (16G)
    33558528    67108864  nvd0p3  zfs0  (32G)
   100667392        1024  nvd0p4  gptboot1  (512K)
   100668416        1024          - free -  (512K)
   100669440    33554432  nvd0p5  swap1  (16G)
   134223872    67108864  nvd0p6  zfs1  (32G)
   201332736   134217728  nvd0p7  write-cache  (64G)
   335550464  1664858112  nvd0p8  read-cache  (794G)
  2000408576         648          - free -  (324K)
Comment 25 Olivier Certner freebsd_committer freebsd_triage 2020-05-05 10:26:38 UTC
Please don't close this yet, as I'm going to add new information in the next days, having similar problems.
Comment 26 Olivier Certner freebsd_committer freebsd_triage 2020-05-12 21:45:43 UTC
I did some tests yesterday on a very recent machine (MB Asus X570) with 3 different USB sticks:
- SanDisk Extreme GO USB 3.1 64GB (abbrev: SD_64G)
- SanDisk Extreme GO USB 3.1 128GB (abbrev: SD_128G)
- Kingston DataTraveler 100 G3 32GB (abbrev: KT_32G)
They confirm and extend what Sébastien has already reported.

*********************
* Executive Summary *
*********************

A. For tests performed yesterday and whose reports are below:
- Transfers to USB 3.0 mass storage appears to stall very frequently, for at least 2 or 3s, and frequently for several runs of that. Duration of periods of non-zero bandwidth may become as low as 2 or 3s.
- There is an initial grace period where bandwidth is steady and at arguably normal numbers. It lasted between 10 and 20min. But once the stalls start, bandwidth is never steady again, even when changing USB hubs, USB sticks, filesystems and method of writing (for the precise extent of the variations, please see the reports below). The exact conditions for the grace period to manifest are not clear yet; I suspect a reboot should reenable it, but the machine was never rebooted during the tests.
- Using FreeBSD 12.1-STABLE at r355888, unmodified.

B. For related experience during the last 2 years:
- I observed similar overall behavior on very old laptops (> 10 years) that only have USB 2.0 hubs, with some set of USB 3.0 sticks (including KT_32G for sure).
- Same on another old desktop machine with USB 3.0 hubs (MB Asus P5P43TD).
- Same also with other USB 3.0 sticks (from yet another manufacturer, Corsair).
- Using continuously updated FreeBSD versions (10 to 12; but never CURRENT up to now).
- No such behavior is exhibited on Linux (USB key is ejected properly, and the data on it is checked after replug and found correct).

C. Wrap-up and follow-up:

Currently, I thus have 5 USB 3.0 sticks, none of which can sustain acceptable steady bandwidth when transferring data to filesystems, except probably for some limited time after machine boot-up. Unless I'm particularly unlucky, I'm suspecting the extent of the problem is such that a significant number of USB 3.0 sticks may actually not work with decent bandwidth for a non-trivial duration at all. Having positive reports might help in narrowing the problem down.

I really want this problem to be fixed, because I don't see myself buying more and more USB sticks in the hope that one is finally going to work, or having to change motherboards or laptops just because of this, especially when it seems the hardware really can work as expected (which might need quirks, workarounds, etc.).

If knowledgeable people could kindly look into the reports and attachments, it would be much appreciated. I'm more than willing to run more tests based on feedbacks. I'm probably yet too new to USB and CAM to make fast debugging progress on my own, and I may not have the proper hardware background, so I would appreciate any guidance as to what to test and which code to try tinkering with.


*****************************************
* Hardware/software, common test traits *
*****************************************

All sticks are MBR-partitioned, with a single 4K-aligned partition occupying all available space. FS type varies and is mentioned in the reports below.

FreeBSD version is 12.1-STABLE at r355888.

For most tests, I grabbed iostat statistics every second (iostat -w 1). Inside the iostat output, you'll want to look for da0 (last columns) for the current USB stick numbers. /usr partition is on a M.2 NVMe SSD, appearing as nvd0 (first disk columns). Other active disks during tests were ada3 and ada4, with unrelated traffic (none of their FSes were used for the tests below). All hard disks on these machines are ZFS pools.

Tests are reported on chronological order. The machine was never rebooted in the course of these tests.


***********
* Reports *
***********


I. Initial tests with SD_64G

For these initial tests, I did not put in place instrumentation as advised earlier in this PR, but I wish I had, since I observed a different behavior (the grace period), which is why I'm reporting about them.

First, I used a freshly created exFAT partition. exFAT was created with package `exfat-utils` and then mounted thanks to package `fusefs-exfat`; both are at version 1.3.0.

I copied /usr/local/bin (~661M) using `cp -a`. This completed without errors, at a rate of ~66MB/s, rate that was steady according to iostat.

Then, I replaced the exFAT partition by a standard UFS+SU one (default, so 32K blocks, 8K frags), and did the same test. Got same result.

Finally, I reformatted the partition with UFS (without SU) and did the test again. Got steady albeit lower performance (around 50/55MB/s if I recall correctly).

So far so good.


II. Iostat-instrumented tests with SD_64G

Done on a freshly created UFS+SU partition. This time I launched the copy of whole `usr` (reported to be 46GB by `du`). I stopped it after about 35GB were copied. The corresponding attachment is: SD_64G_ufs_SU_usr_cp.txz.

Just to repeat some information given in introduction in order for the context to be very clear: `usr`'s files are read from the VM cache (previous tests) or a ZFS partition on the NVMe SSD.

As can be seen from the iostat file, transfer bandwidth to the stick varies, being below 10MB/s at start, and then most of the time being at several tenths of MB/s, with ranges of up to 2 minutes where the bandwidth is mostly between 30 and 60 MB/s. The transfer proceeds well for approximately 10min and a half (again, one line per second in the iostat file).

But then, it starts to exhibit a very curious pattern, where bandwidth alternatively goes to 0 during 2 to 10s (except for one (small) transaction every 2 or 3s), and then rises to reasonable values (10s of MB/s) with very high TPS. Traffic never goes back to the first phase's pattern.

I stopped iostat collection relatively quickly afterwards, but let the copy proceed more, and stopped it after ~35GB were transferred. In total, the transfer took a little more than 3 hours (see trace in attachment), which gives little more than 3MB/s for the whole run.

I relaunched this test but this time replacing the UFS+SU partition by an exFAT one, hoping that the initial correct behavior would return (see I.). I did this without unplugging/replugging the stick after the UFS+SU test. The corresponding attachment is: SD_64G_exFAT_usr_cp.txz. It contains only the iostat trace. Behavior is similar to the second phase of the previous run. There was no "normal" phase (with decent steady bandwidth) first.

Finally, I reran the test on a fresh UFS+SU partition, with again the same result. I did collect only a couple of seconds of iostat. Transfer stopped after 8 hours with a device full, meaning an overall transfer rate of around ~2.25MB/s. The corresponding attachment is: SD_64G_no_unplug_ufs_SU_usr_cp.txz.


III. USB and umass instrumented tests, with variations

From this point, I systematically set sysctl `hw.usb.umass.debug` to -1, and run usbdump as prescribed by HPS earlier in the PR. I also grabbed iostat output, except for A.

A. Using a different stick: SD_128G

Corresponding attachment: SD_128G_ufs_SU_usr_cp.txz.

This is a bigger stick from the same manufacturer. Test is the "same" as for SD_64G: I ran the `cp -a` test to a fresh UFS+SU MBR partition covering the whole stick. Result was essentially the same as previous tests, with stalls and writes alternating.

B. Using `dd if=/dev/zero` instead of `cp -a /usr`

Corresponding attachment: SD_64G_ufs_SU_dd.txz.

I used this to rule out any influence of the source filesystem (ZFS on NVMe SSD), as well as some metadata operations on the destination FS. `dd` writes to a fresh new partition, into a single file.

For this one, I collected only 2 minutes of iostat, but stalls are clearly visible, and the problem seems worse: There are several runs of 2 to 3s blocks delimited by a single transaction before bandwidth can really take off, and for no more than 2 or 3s. The full test ran for 6 minutes, and kernel logs as well as usbdump output are provided in the attachment.

C. Using `dd if=/dev/zero`, yet another stick: KT_23G, plugged on rear USB hub

Corresponding attachment: KT_32G_ufs_SU_dd.txz.

The plugs I had used up to now were facade ones on the case, and using them KT_32G stick is not recognized as plugged. So I switched to using plugs solded to the motherboard.

This is a smaller stick from another manufacturer. Test was done as for B., and ran 5 minutes. Overall same results.

D. Using `dd if=/dev/zero`, original stick, plugged on rear USB hub

Corresponding attachment: SD_64G_2nd_hub_ufs_SU_dd.txz.

This test was done just to ensure that results were not influenced by the hub. Indeed, KT_32G not recognized on first hub could leave the impression that something was wrong on these plugs.

But results were again the same. Traces inside the attachment as usual.
Comment 27 Olivier Certner freebsd_committer freebsd_triage 2020-05-12 21:47:20 UTC
Created attachment 214426 [details]
For comment #26: SD_64G_ufs_SU_usr_cp.txz
Comment 28 Olivier Certner freebsd_committer freebsd_triage 2020-05-12 21:48:21 UTC
Created attachment 214427 [details]
For comment #26: SD_64G_exFAT_usr_cp.txz
Comment 29 Olivier Certner freebsd_committer freebsd_triage 2020-05-12 21:49:11 UTC
Created attachment 214428 [details]
For comment #26: SD_64G_no_unplug_ufs_SU_usr_cp.txz
Comment 30 Olivier Certner freebsd_committer freebsd_triage 2020-05-12 21:50:33 UTC
Created attachment 214429 [details]
For comment #26: SD_64G_ufs_SU_dd.txz
Comment 31 Olivier Certner freebsd_committer freebsd_triage 2020-05-12 21:51:04 UTC
Created attachment 214430 [details]
For comment #26: KT_32G_ufs_SU_dd.txz
Comment 32 Olivier Certner freebsd_committer freebsd_triage 2020-05-12 21:51:46 UTC
Created attachment 214431 [details]
For comment #26: SD_64G_2nd_hub_ufs_SU_dd.txz
Comment 33 Olivier Certner freebsd_committer freebsd_triage 2020-05-12 21:54:38 UTC
Created attachment 214432 [details]
For comment #26: SD_128G_ufs_SU_usr_cp_1.txz

This attachment was split in 2 to avoid running into the 1Mo attachment limit.
Comment 34 Olivier Certner freebsd_committer freebsd_triage 2020-05-12 21:55:32 UTC
Created attachment 214433 [details]
For comment #26: SD_128G_ufs_SU_usr_cp_2.txz

Second part of initial SD_128G_ufs_SU_usr_cp.txz.
Comment 35 Sebastien Boisvert 2020-05-13 00:11:40 UTC
Clearly, the problem is select() taking too much time to return.

How can I trace what is being called between the select_ENTER and select_EXIT in FreeBSD ?
Comment 36 Olivier Certner freebsd_committer freebsd_triage 2020-05-13 16:29:10 UTC
(In reply to Sebastien Boisvert from comment #35)

That `select` takes a while before returning is a symptom you saw, but `select` itself is not the problem's cause. Also, this symptom depends on the way data is copied. For example, I do not have it in the same form with `dd`. Absolutely all the top-lines in truss' output for `dd` are `write` lines, there are no `select` lines at all. However, they show similar variability in latencies, and long ones, that you have with `select`. Using `rsync`, I indeed see `select` calls as top lines as you do, so I can apparently "reproduce" the behavior you see.

As for getting stack traces (yes, this is not exactly what you asked for), I tried to use hwpmc(4) (see pmcstat(8) and pmccontrol(8)), but unfortunately no performance counter class seems to be available for my machine. I saw that there are files for Ryzen processors in the source tree though, so I may be able to figure out to teach libpmc to use them (but maybe some code is really missing). You could try on your machine (try typing `pmcstat -L` and `pmccontrol -L` first, after having loaded `hwpmc.ko`), but given it has a Ryzen 5, it is likely you'll get the same (non-)result as I.

Another option is to use `procstat -k -k -p <pid>`. For example, here is what I typically get when running this against a `rsync` process:
  PID    TID COMM                TDNAME              KSTACK
78175 104627 dd                  -                   mi_switch+0xdd sleepq_wait+0x2c _sleep+0x246 getpbuf+0xb8 cluster_wbuild+0x745 cluster_write+0x12d ffs_write+0x3dd VOP_WRITE_APV+0xec vn_write+0x260 vn_io_fault_doio+0x43 vn_io_fault1+0x16a vn_io_fault+0x18f dofilewrite+0xb0 sys_write+0xc1 amd64_syscall+0x362 fast_syscall_common+0x101
78175 104627 dd                  -                   mi_switch+0xdd sleepq_wait+0x2c _sleep+0x246 waitrunningbufspace+0x87 bufwrite+0x243 cluster_wbuild+0x54b cluster_write+0x12d ffs_write+0x3dd VOP_WRITE_APV+0xec vn_write+0x260 vn_io_fault_doio+0x43 vn_io_fault1+0x16a vn_io_fault+0x18f dofilewrite+0xb0 sys_write+0xc1 amd64_syscall+0x362 fast_syscall_common+0x101
So I don't think we are much progressing here: Process is blocked on getting buffers, because (educated guess) all buffers are full when the stick is stalled.

But I fear that things are going to be much more complicated than just getting call stacks at several points, because several subsystems are involved, and CAM is essentially said to be a hard to understand state machine (I haven't tried to figure it out yet, though). Indeed, I've tried running `procstat` on `usb` and `cam` threads, when `iostat` shows that traffic is happening, but didn't get anything meaningful (i.e., same output as when the stick is idle). For the record, respective outputs:
  PID    TID COMM                TDNAME              KSTACK
   32 100170 usb                 usbus0              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100171 usb                 usbus0              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100172 usb                 usbus0              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100173 usb                 usbus0              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100174 usb                 usbus0              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100176 usb                 usbus1              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100177 usb                 usbus1              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100178 usb                 usbus1              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100179 usb                 usbus1              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100180 usb                 usbus1              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100214 usb                 usbus2              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100215 usb                 usbus2              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100216 usb                 usbus2              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100217 usb                 usbus2              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
   32 100218 usb                 usbus2              mi_switch+0xdd sleepq_wait+0x2c _cv_wait+0x151 usb_process+0x137 fork_exit+0x82 fork_trampoline+0xe
  PID    TID COMM                TDNAME              KSTACK
   31 100162 cam                 doneq0              mi_switch+0xdd sleepq_wait+0x2c _sleep+0x246 xpt_done_td+0x9b fork_exit+0x82 fork_trampoline+0xe
   31 100163 cam                 doneq1              mi_switch+0xdd sleepq_wait+0x2c _sleep+0x246 xpt_done_td+0x9b fork_exit+0x82 fork_trampoline+0xe
   31 100164 cam                 doneq2              mi_switch+0xdd sleepq_wait+0x2c _sleep+0x246 xpt_done_td+0x9b fork_exit+0x82 fork_trampoline+0xe
   31 100165 cam                 doneq3              mi_switch+0xdd sleepq_wait+0x2c _sleep+0x246 xpt_done_td+0x9b fork_exit+0x82 fork_trampoline+0xe
   31 100166 cam                 doneq4              mi_switch+0xdd sleepq_wait+0x2c _sleep+0x246 xpt_done_td+0x9b fork_exit+0x82 fork_trampoline+0xe
   31 100341 cam                 scanner             mi_switch+0xdd sleepq_wait+0x2c _sleep+0x246 xpt_scanner_thread+0x99 fork_exit+0x82 fork_trampoline+0xe
r

Dtrace itself could be used to produce call trees, and may actually be the best chance to obtain useful ones for `usb` and `cam` threads, by triggering data collection inside certain functions. I'm trying to figure out how to do this, but I won't have much time until next week, so don't expect quick news from me.

From the elements we have, I think that HPS (comment #4) is probably on the right track. First thing now would be to determine what exactly CAM is waiting form (and in which state it is). And for this, I bet that the help of a CAM guru would save us a lot of time.
Comment 37 Olivier Certner freebsd_committer freebsd_triage 2020-05-13 16:44:19 UTC
Created attachment 214454 [details]
Stick SD_64G: Output of `usbtest`

As asked by HPS previously.
Comment 38 Hans Petter Selasky freebsd_committer freebsd_triage 2020-05-13 17:30:36 UTC
usbtest shows that the device is not USB compliant. Try to disable this test before running again:

        7) Toggle try invalid SCSI command: <YES>

Make sure the stick is re-plugged.

--HPS
Comment 39 Olivier Certner freebsd_committer freebsd_triage 2020-05-13 18:23:05 UTC
Created attachment 214458 [details]
Stick SD_64G: 2nd output of `usbtest`

(In reply to Hans Petter Selasky from comment #38)

Done. In fact, I did two tests, unplugging and replugging the stick each time.

First test is what you asked for, result seems pretty similar, except for an additional timeout.

So I tried a second one, with Read and Write and 8 blocks I/Os. Apparently, same results, except that some number after "ERROR: Data transfer" differs for "LBA limit = 32767".

Are you confident that the stick is not USB compliant? Could it be a problem with `usbtest`? What could I test next?
Comment 40 Hans Petter Selasky freebsd_committer freebsd_triage 2020-05-13 18:51:04 UTC
        8) Toggle try invalid wrapper block: <YES>

Turn off all negative testing.

If you cannot get usbtest to read data reliably from your device, I wouldn't use it for anything real.

--HPS
Comment 41 Olivier Certner freebsd_committer freebsd_triage 2020-05-13 20:30:42 UTC
Created attachment 214464 [details]
Stick SD_64G: 3rd output of (multiple) `usbtest`

(In reply to Hans Petter Selasky from comment #40)

I tried various combinations. The first that seemed to work was indeed with negative tests off, including last HBA access. Unfortunately, I did it again immediately and it failed, so the choice of options alone does not guarantee success. I managed to get one read-only test to work correctly. And also, one Write Once, Read Only, just before the last test, seemed to have worked partially.

Next, I'm going to test one of the other sticks I mentioned earlier, KT_32G, from another manufacturer, to see if I get such inconsistent behavior.

While I can understand your advice, I will not buy that easily that so many different sticks with so many boards can fail in apparently the same way, especially given that:
1. As far as I remember, they work very well with other OSes (Windows, Linux).
2. Also, even when writing them with FreeBSD, how inconvenient it is because of the stalls, I never experienced any form of data corruption.
3. They are from different mainstream vendors with generally good reputation. (Yes, I know what reputation is worth, and that hardware manufacturers love not to follow standard exactly, even when they are arguably ambiguous.)
4. They were relatively expensive, at least not toy crap sometimes distributed at events. (Obviously, this is not a strong argument in itself, but adds to the rest.)

Now, you're free to challenge that, in which case I'll be happy to prove additional points 1 & 2 by doing serious tests on different OSes, and also checksumming copied files, so that we can swap bad remembrance and impressions for hard facts.

The apparently random nature of results with `usbtest` so far, assuming that I ran them correctly (don't see how I could not), could indicate that initialization at plug is not performed correctly. Which could be caused by a race of some sort at initialization, or some expected interaction simply missing, causing wild controller behavior on the stick. Yes, this is speculative, but maybe this will ring a bell to you if, e.g., you stumbled on such problems in the past. Even if general USB and umass are flawlessly implemented and following the standard, lots of manufacturers may not completely abide by that, and apparently other OSes are somewhat aware of this (again, to be confirmed with hard facts). Some months ago, I tried every possible USB quirk with `usbconfig`, but could not obtain any reliable improvement. I'm open to suggestions to redo some tests and log results.

Finally, I noticed several backports to USB to 12 after r355888. Do you think that any of them could have an influence on the results? I'm in particular wondering about r358873. I plan to test the latest stable/12, but this probably won't happen before next week.
Comment 42 Olivier Certner freebsd_committer freebsd_triage 2020-05-13 23:00:47 UTC
Created attachment 214466 [details]
Stick KT_32G: 1st round of outputs from `usbtest`

With stick KT_32G, I get reliable results. I've only exercized read for now.

In short, I've disabled all negative tests. I also activated Sense on error, although this seems to have no effect (more tests would be needed to prove this for sure). If I leave "Random" for the I/O size, tests appears to stall from the start, fails after 10s of seconds and the device is disconnected then reconnected. If I choose "Increasing", "1 block" or "1024 blocks", tests complete successfully.

In addition, there is a troubling fact. KT_32G is not recognized on the case's front plugs, as reported earlier, but now also one of the MB's hub doesn't recognize it anymore (it did before, I tested the stick plugged there, see SD_64G_usbtest_3.txt). The machine has not been rebooted at all since I started testing and collecting data.
Comment 43 Olivier Certner freebsd_committer freebsd_triage 2020-05-13 23:23:36 UTC
Created attachment 214468 [details]
Stick SD_128G: 1st round of outputs from `usbtest`

With stick SD_128G, read tests work, but no tests involving writes seem to work. I used case's front plugs.
Comment 44 Sebastien Boisvert 2020-05-13 23:39:36 UTC
(In reply to Olivier Certner from comment #41)


Hi Olivier and Hans,

I agree with Olivier that it is unlikely that so many USB sticks from so many
manufacturers end up failing mostly in the same way, but yet work fine on
other operating systems.

I tried all my USB sticks with my mobo that has a AMD Promontory B450 chipset.
All these USB sticks that works very slowly in the USB 3.0 bus in FreeBSD 12.1
But on the same system, they work well on Ubuntu 18.04.

As you may know, I retired my USB sticks from production on my server.
I now boot on ZFS using a NVMe device.

Let me know if I can run some tests on my machine.

I think that one important test would be to test a USB 3.0 stick on a old machine
that has USB 3.0 support.

My oldest machine is a Lenovo Thinkpad T430s.

I will try that later: use rsync to transfer /usr from FreeBSD 12.1 root to a USB 3.0 device.

Thanks guys
Comment 45 Olivier Certner freebsd_committer freebsd_triage 2020-05-15 14:45:05 UTC
Hi Sébastien and Hans,

Contrary to what I recalled (or something changed since then), SD_64G and SD_128G indeed seem bad. I did some tests in order to verify behavior on other OSes, and both sticks exhibit the same transfer pattern on recent Linux and MacOS X machines, i.e., usually 2 or 3s of decent bandwidth, followed by 2 or 3s without transfer and some "probe" transaction, often repeated. So I'm simply going to RMA them.

The only difference with other OSes is that apparent bandwidth is somewhat higher, though still low (USB 2.0 or less speeds). KB/t appears to be frequently near 1024, where it is around 128 on FreeBSD (but then with higher tps); the overhead of smaller transactions may explain the difference.

Will test KT_32G later to see how it fares.

Hans, thanks for `usbtest`.

A few question: In `usbtest`, does <Random> for the I/O size has any alignment constraint on the size (like multiples of some power of 2, or only powers of 2)? And what about <Increasing>? In umass, is it possible to have longer transaction then 128KB?

Thanks.
Comment 46 Hans Petter Selasky freebsd_committer freebsd_triage 2020-05-15 20:50:21 UTC
Hi,

> Hans, thanks for `usbtest`.

You are welcome!

> constraint on the size (like multiples of some power of 2, or only powers of 2)? 

I don't remember. You'll need to check the code.

You'll get best performance with USB 2.0 using 64Kbyte transactions, and for USB 3.0, 1 Mbyte and above.

--HPS
Comment 47 Maurizio 2020-05-18 09:49:20 UTC
Created attachment 214596 [details]
Output of usbtest
Comment 48 Maurizio 2020-05-18 09:54:37 UTC
The previous comment was the output of usbtest on a slow Kingston DataTraveler G4 32GB
Comment 49 Olivier Certner freebsd_committer freebsd_triage 2020-05-18 13:39:22 UTC
(In reply to Hans Petter Selasky from comment #46)

Going to check the code of `usbtest` for what <Random> does exactly.

> You'll get best performance with USB 2.0 using 64Kbyte transactions, and for USB 3.0, 1 Mbyte and above.

Thanks for the advice. It seems however that I cannot trigger transactions greater than 128KiB whatever method I use (`dd` to disk device, `dd` to single file, `cp`), and apparently some utilities lead to very small unoptimal transaction sizes (e.g., 4KiB when using `hd`). Do you know something simple that that could be changed to raise the transaction size? I fear this is related to the old MAXPHYS debate, in which case it won't be simple...
Comment 50 Hans Petter Selasky freebsd_committer freebsd_triage 2020-05-18 13:44:30 UTC
It depends on the speed which the USB device is enumerated at.

64Kbyte is the legacy default for USB 2.0 devices. Some devices break if you try more, even if SCSI supports it.

--HPS
Comment 51 Olivier Certner freebsd_committer freebsd_triage 2020-05-18 14:05:19 UTC
(In reply to Hans Petter Selasky from comment #50)

Indeed, I observed the 64KiB limit for USB 2.0. I was talking about USB 3.0 sticks, for which I cannot seem to get bigger transactions than 128KiB.

Both Linux and MacOSX, according to `iostat`, seem to use 1MiB transactions on these, and seem to get a little bit more bandwidth for sequential access (to a file, not to the disk directly, so FS clustering comes into play, and probably several other things). Unfortunately, I did not keep logs for these tests (on SD_64G and SD_128G, see comment #45), and I don't recall the exact bandwidth advantage. I'm doing a new round of tests with KT_32G, and at least on MacOSX, I get comparable performance (a bit lower in fact) despite the different transaction size. Anyway, the speed differences here are second order to what prompted the opening of this bug, and possibly not even relevant to non-sequential access.
Comment 52 Olivier Certner freebsd_committer freebsd_triage 2020-05-18 14:35:12 UTC
(In reply to Maurizio from comment #48)

Thanks Maurizio. I have roughly the same output as yours with my KT_32G (Kingston DataTraveler 100 G3, showing up as <Kingston DataTraveler 3.0> in kernel logs), *provided* that I set I/O size to anything but "Random" (I need to investigate this).

In this case, the read speed reported by `usbtest` happens to be fairly stable at 110MiB/s, whereas write speed is quite bumpy and ranges from 5 to 40MiB/s. To me, it seems that your trace essentially shows the same performance profile. The write speed in particular is not a lightning one, but could be considered acceptable.

What is very strange, however, is that an `iostat` recording of a `dd` to exFAT shows that transfer again proceeds at decent speeds during ~2s, and then stalls for ~2s, and the pattern goes on and on. And I observe this on Linux and MacOSX as well.

Hans, do you consider this pattern as normal? I never observed such one with USB 2.0 sticks so far.

Maurizio, if you have time, could you try the `dd` (with, e.g., parameters `if=/dev/zero of=<file_on_stick> bs=1M count=1024`) test above and capture the output of `iostat -w 1 <stick_device>` during the test?
Comment 53 Olivier Certner freebsd_committer freebsd_triage 2020-05-18 15:07:21 UTC
(In reply to Sebastien Boisvert from comment #44)

Hi Sébastien,

And first of all, I wanted to say this earlier, maybe now it matters even more: I did not want to hijack your initial bug report in any way, so I hope you don't feel that way. Given your initial reports and my experiments, I thought that we were seeing the same exact problem (at least, the same symptoms). Since then, more testing has given birth to a much more contrasted picture, at least on my side. As said in comment #45, SD_64G and SD_128G seem not to work under `usbtest` (the first at all, the second one when writing), so just may be bad, and I'm trying to RMA them (initial request sent). The situation for KT_32G seems more complex. Indeed, it reliably passes `usbtest`, *except* when I/O size is set to "Random" (the default), in which case no transfer happens and the stick simply disconnects from the bus. Plus, `iostat` reports a pattern that I find quite weird. If you're OK, I'm proposing you to do some tests with your USB 3 sticks (see below), to dig deeper and know to which extent we are indeed seeing the same behaviors. If you're annoyed by my follow-ups, I will happily open another PR and put information there.

So, it would be great if you could run the following tests on your USB 3 sticks, and in particular the two models of Kingston.
1. `usbtest` umass: At least, write only, read only, read+write, with I/O size Random and Increasing.
2. `dd` test (see comment #52), with `iostat -w 1 <stick_device>` running elsewhere (and reporting both outputs).

Thanks and sorry again if I bothered you.
Comment 54 Hans Petter Selasky freebsd_committer freebsd_triage 2020-05-18 15:13:41 UTC
> Hans, do you consider this pattern as normal? I never observed such one with
USB 2.0 sticks so far.

It might be the device has an internal cache and at some point needs to erase the flash to be able write new data.

--HPS
Comment 55 Gary Jennejohn 2020-05-18 18:12:19 UTC
 (In reply to Hans Petter Selasky from comment #54)
Yes, this could be the case.  I have a Lexar USB3 64GiB stick with only 13GiB free.  When I write to it I see very slow transfers with short burst of about 40MiB/sec repeatedly.  Transfers were much faster when the stick contained less data.  My assumption is that the controller in the stick is busy with internal book keeping much more frequently, which slows down the transfers considerably.
Comment 56 Sebastien Boisvert 2020-05-18 19:12:38 UTC
(In reply to Olivier Certner from comment #53)

Hi Olivier,

First, I think that your comments in this bug report are great.
I don't feel bothered at all. :-)

I will run the suggested tests on my Ryzen server.
Comment 57 Sebastien Boisvert 2020-05-18 20:03:29 UTC
Created attachment 214609 [details]
Kingston_DataTraveler_100_G3_32_GB_in_USB_2.0_port.dd+iostat
Comment 58 Sebastien Boisvert 2020-05-18 20:06:33 UTC
Created attachment 214611 [details]
Kingston_DataTraveler_100_G3_32_GB_in_USB_2.0_port.usbtest
Comment 59 Sebastien Boisvert 2020-05-18 20:12:15 UTC
Created attachment 214612 [details]
Kingston_DataTraveler_100_G3_32_GB_in_USB_3.0_port.dd+iostat
Comment 60 Sebastien Boisvert 2020-05-18 20:14:49 UTC
Created attachment 214613 [details]
Kingston_DataTraveler_100_G3_32_GB_in_USB_3.0_port.usbtest
Comment 61 Sebastien Boisvert 2020-05-18 20:19:41 UTC
I posted the following results:

- Kingston_DataTraveler_100_G3_32_GB
  - in_USB_2.0_port
    - usbtest     https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214611
    - dd+iostat   https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214609
  - in_USB_3.0_port
    - usbtest     https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214613
    - dd+iostat   https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214612
Comment 62 Olivier Certner freebsd_committer freebsd_triage 2020-05-19 08:41:54 UTC
(In reply to Sebastien Boisvert from comment #61)

Hi Sébastien,

`usbtest` on your side seems to fail very quickly. Could you re-run the test, this time disabling negative tests, i.e., in the Mass Storage menu, make sure that 7 and 8 are set to NO?

For `iostat`, there is no useful output in the traces. You need to put some GEOM name directly on the command line, e.g., replace "/dev/da0" by "da0" (check that the output header line indeed has "da0" in it).

Thanks.
Comment 63 Maurizio 2020-05-19 09:35:30 UTC
Created attachment 214626 [details]
Kingston DataTraveler G4 32GB dd on fat32
Comment 64 Maurizio 2020-05-19 09:36:33 UTC
Created attachment 214627 [details]
Kingston DataTraveler G4 32GB dd on UFS 1/3
Comment 65 Maurizio 2020-05-19 09:37:01 UTC
Created attachment 214628 [details]
Kingston DataTraveler G4 32GB dd on UFS 2/3
Comment 66 Maurizio 2020-05-19 09:37:38 UTC
Created attachment 214629 [details]
Kingston DataTraveler G4 32GB dd on UFS 3/3
Comment 67 Maurizio 2020-05-19 09:38:40 UTC
Created attachment 214630 [details]
Kingston DataTraveler G4 32GB iostat on FAT32
Comment 68 Maurizio 2020-05-19 09:39:24 UTC
Created attachment 214631 [details]
Kingston DataTraveler G4 32GB iostat on UFS 1/3
Comment 69 Maurizio 2020-05-19 09:40:02 UTC
Created attachment 214632 [details]
Kingston DataTraveler G4 32GB dd on UFS 2/3
Comment 70 Maurizio 2020-05-19 09:41:39 UTC
Created attachment 214633 [details]
Kingston DataTraveler G4 32GB iostat on UFS 2/3
Comment 71 Maurizio 2020-05-19 09:42:14 UTC
Created attachment 214634 [details]
Kingston DataTraveler G4 32GB iostat on UFS 3/3
Comment 72 Maurizio 2020-05-19 14:18:33 UTC
(In reply to Olivier Certner from comment #52)
The stick, a Kingston DataTraveler G4 32GB, works well if formatted as FAT32, bad if formatted as UFS and the performance will degrade over the time. I have written this script for testing the stick:

#!/bin/sh
set -x
STICK_DEVICE=/dev/da0
FILE_ON_STICK=/media/5ec390d1148aa331/zero.bin
COUNT=1024
tmux new-window "script /tmp/dd.log dd if=/dev/zero of=$FILE_ON_STICK bs=1M count=$COUNT; sleep 5; killall iostat"
tmux new-window "script /tmp/iostast.log iostat -w 1 $STICK_DEVICE"
set +x

where FILE_ON_STICK depends on the stick mount point.

I have run the script one time with good performance with the stick formatted as FAT32, the log output are: dd_fat32.log (https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214626 ) and iostat_fat32.log (https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214630). The dd instruction takes 49 sec. 

After formatting the stick as UFS I have runned the script 3 times, the dd instruction takes 57, 141 and 189 seconds. The logs are, in order:

dd_ufs.log (https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214627 ),
iostat_ufs.log  (https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214631),

dd_usf2.log  (https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214628),
iostat_ufs2.log (https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214633)

dd_ufs3.log and 
iostat_ufs3.log (https://bugs.freebsd.org/bugzilla/attachment.cgi?id=214634)

Running the script with the stick formatted as UFS, the laptop used for the test, hangs a little.
Comment 73 Olivier Certner freebsd_committer freebsd_triage 2020-05-21 13:25:32 UTC
(In reply to Maurizio from comment #72)

Hi Maurizio,

Thanks for these news. However, interesting output of `iostat` is actually missing, you have to use the GEOM (provider) name as an argument (i.e., use `iostat -w 1 da0` instead of `iostat -w 1 /dev/da0`).

I've not tried FAT myself, will do out of curiosity.

The degradation you are seeing, I usually see it as well. Going to give more details in subsequent comment.
Comment 74 Maurizio 2020-05-22 07:56:27 UTC
Created attachment 214747 [details]
dd + iostat on UFS - DataTraveler G4 32GB

(In reply to Olivier Certner from comment #73)

Hi Olivier,

I have updated the script with the GEOM name:

#!/bin/sh
set -x
STICK_GEOM_NAME=da0
FILE_ON_STICK=/media/5ec77cd141597fbc/zero.bin
COUNT=1024
tmux new-window "script /tmp/dd.log dd if=/dev/zero of=$FILE_ON_STICK bs=1M count=$COUNT; sleep 5; killall iostat"
tmux new-window "script /tmp/iostast.log iostat -w 1 $STICK_GEOM_NAME"
set +x

I have reformatted the stick as UFS and run the test 3 times.
Please let me know if you need more tests.
Comment 75 Olivier Certner freebsd_committer freebsd_triage 2020-05-22 15:47:21 UTC
I've tested again SD_128G, more thoroughly this time.

When copying '/usr' to some UFS partition, I have lots of stalls, and long ones (~ 30s). Copying to an exFAT partition gives a similar iostat profile, but with the difference that duration of stalls is ~74% of the total test duration, whereas it is ~83% for the UFS test. Also, practically all exFAT accesses are through 128KiB transactions, whereas for UFS median and average of accesses is around 70KiB. Not taking into account stalls, average bandwidth is approximately at 48MiB/s for exFAT and 25MiB/s for UFS (12.3MiB/s vs 4.3MiB/s including stalls). For these two metrics, ratios between the two FSes are close (1.82 and 1.89). A priori, I would have expected that the second ratio would be lower than the first, because smaller transactions should be compensated in part, I thought, by more transactions per second. But TPS for both tests is not that different (median: 440 vs. 466, average: 381 vs. 566, stalls not counted). More precisely, for a transaction average close to 128KiB/s for UFS, TPS is around 220, wereas it is around 224 for exFAT (so no significant difference), whereas bandwidth degrades a lot for smaller transactions, because there TPS does not compensate at all (the computed equivalent bandwidth-preserving TPS for 128KiB transactions becomes ~34 on average). The worse performance of UFS cannot be attributed to possible stick "degradation" since the UFS test was actually performed *before* the exFAT one.

Following up on Hans and Gary's responses, I was wondering how the controller could know which blocks of the sticks are actually used or not. For the record, I tried using `camcontrol sanitize` and `camcontrol security`, but none worked (don't know whether it's because umass SCSI's XPT doesn't allow/implement it, or because they are explicitly not supported in the UAS standard, or because the sticks don't implement them). So what else is possible? Some ideas: The controller could note when sectors full of zeros are written, keep track of that and free the underlying storage. Or, which I find dubious, the controller could interpret (ex)FAT metadata and thus know which blocks are actually used. Without going that far, it could interpret writes to the MBR (or GPT) area to know which partitions are currently allocated (but semantically, this is more dangerous, so even more dubious, although simpler to implement).

From additional experiments, I observe behavior compatible with the first possibility (or some variant), and can also infirm the third one. For the first one, I simply `dd` the whole stick with zeros. After that, a similar UFS `cp` experiment shows an average bandwidth of 7.8MiB/s, so almost twice higher than the previously obtained 4.3MiB/s above. Transfer is stalled "only" for 71% of the whole test duration. This also infirms the third, since I had done other UFS tests before the one reported above, and no performance jump was observed then.

What I'm working on now is to somehow make FreeBSD access the stick with bigger transactions, because of the results reported initially in this comment, which are compatible with the fact that the stick is handling badly small transactions, and because MacOS uses 1MiB transactions (so manufacturers may want to optimize for that; Linux does as well). Also, flash memory is structured in blocks of several pages, and erasure can only happens at block granularity, with current blocks being usually greater than 128KiB (I'm not a specialist, just read Wikipedia), so partial write may lead to full erasures and rewrites, degrading performance.

I did a first experiment with UFS block size increased to 64KiB (which happens to be the maximum value UFS currently accepts), and thus fragment size to 8KiB, which I expect will increase the minimum (and average) transaction size. And indeed, results are a bit better. I get 9.8MiB/s on average (vs. 7.8MiB/s with default block size as reported above), and transfers are stalled "only" 56% of the time. Indeed, the minimum transaction size observed is now 8KiB instead of 4KiB, but to my surprise, average and median transaction sizes actually decrease a bit (51.5 and 25.6KiB vs. 49 and 22.8KiB; stalls omitted). So it seems that it's the minimum transaction size that really matters, and somehow increasing it reduces stalls, which could hint indeed to limited/poor caching when writing to non-full blocks. Still, I observe transaction sizes that can be any multiple of 8KiB up to 64KiB. I ran the same test a second time, after a full `dd if=/dev/zero` on the whole stick, and got worse results (8.3MiB/s on average, stalls ~63% of the time, 49.4KiB and 25.7KiB for average and median transaction sizes), which are not much of a clear improvement over UFS with default block size. So the 64KiB block size tests don't prove decisive to me at this point.

MAXPHYS limits us to 128KiB/s, but without touching it I tought I could at least try to have all transactions be 128KiB, as happens with exFAT. So I turned to GEOM_NOP and then GEOM_CACHE, thinking they could eventually do the job out of the box, but in fact they can't. GEOM_NOP allows to set the block size, but then every I/O request's size must be a multiple of this block size, so rising the block size simply makes 'newfs' to fail. GEOM_CACHE is closer to what I would like, but it only caches reads, not writes, plus, while testing it, it deadlocked the whole GEOM subsystem. So I'm now understanding, fixing and extending it, in the hope that I can finally force 128KiB transactions (at the price of read-modify-write) and run more tests... All this, because I want to see whether the speed of UFS on these sticks can significantly be improved with software tuning to workaround potentially limited hardware.

For the RMA, guys from SanDisk want proof that copying is indeed slow when using exFAT. Copying '/usr' to an exFAT partition happened at 12.3MiB/s (see first paragraph), not very much superior to latest UFS tests, but this was before `dd`ing the whole stick with zeros. So I guess I now have to do this experiment again, to see how much I can get from exFAT. Not sure if the SanDisk guys will consider ~12MiB/s slow enough to warrant a replacement. I could couple the argument with the "sticks don't respect standards" angle (prior `usbtest`), but I doubt it will add much weight. Anyway, I find it amazing that they can market write speeds up to 100MiB/s if we can reach only ~1/10th of it with (not so) small files.

Maurizio, Sébastien, if you're interested, you can try to `dd if=/dev/zero` your sticks and then retry your experiments (with UFS+SU), to see if you can get any significant speed improvement. Also, it may be important to transfer a large quantity of data to see whether degradation/more frequent stalls indeed happen. Ideally, try to fill your sticks almost completely (which means very long tests). In fact, I usually stopped UFS tests after around ~18GiB had been transferred (around 30min duration). I realize this may have biased most results so far, so I'll try some bigger transfers later on. Anyway, I haven't arguably done enough tests for the results to be statistically significant, so more testing may change current interpretations. Assuming the behavior of SD_128G can indeed be understood of course.

I've kept all the raw iostat files used to produce the statistics in this post, but have not attached them to the PR because there are several of them again and I'm not sure that would be useful at this point. Feel free to ask if you want to see them.
Comment 76 Gary Jennejohn 2020-05-23 10:11:55 UTC
(In reply to Olivier Certner from comment #75)
Flash is complicated.  There's SLC (one bit per cell, fast but expensive), MLC (two bits per cell), TLC (three bits per cell) and QLC (four bits per cell, high storage density but slow).
The more bits per cell the more complicated it is for the controller to write data to the flash.
I tested with a 128GB USB3 stick from Medion.  Probably not the fastest stick on the market.  I suspect, but can't prove, that this stick uses TLC flash.
I must point out that I was using this stick for backups in a Linux project with the ext4 filesystem.  The stick was pretty full.
I then decided to reformat it with UFS.  Note that after formatting there were still lots of dirty sectors left over from using it for Linux backups.
I mounted the stick and used dd to copy a 3GiB file with bs=128k to the UFS file system.  The average transfer rate was 3.95MiB/s.
I then unmounted the stick to get rid of the file contents in the buffer cache and remounted it.
Then I used dd with bs=128k to copy the file to /dev/null.  The average transfer rate was 104.6MiB/s.
So, I still suspect that the controller was forced to do lots of bookkeeping to copy data to the flash.
Since flash is initialized with all 1s when manufactured and setting a 1 to a 0 is easy for the controller, but setting a 0 to a 1 is much more time consuming, an interesting test would be to fill all sectors with 1s and then do a transfer test.  Unfortunately, there's no /dev/one available, so this would be rather challenging to do with a high-capacity stick.
There may be a command which can be sent to the controller to erase all sectors at once (set all bits to 1), but I'm not sure about that.
Comment 77 vidwer+fbsdbugs 2020-05-23 19:51:42 UTC
(In reply to Gary Jennejohn from comment #76)
> Unfortunately, there's no /dev/one available, so this would be rather challenging to do with a high-capacity stick.

There's another way: "yes 1 | tr -d '\n' | dd of=foo bs=$((1024 * 1024)) count=$((1024 * 1024))", which works in bash and /bin/sh. Not with /bin/csh, however.
(via https://www.google.com/search?q=dd+repeat+input -> https://stackoverflow.com/questions/20714264/use-dd-to-write-specific-file-recursively )
Comment 78 Gary Jennejohn 2020-05-24 07:53:41 UTC
(In reply to vidwer+fbsdbugs from comment #77)
This is an interesting idea.  Unfortunately. "yes 1" produces an ASCII '1' = 0x31. It would be feasible to write a bit of C code which writes out an array of 0xff (all 1s) to a file and then use dd to overwrite the USB stick with that file. That would be fairly simple.  If the file were exactly 1 million bytes in size then it would be simple to figure out what count would be since manufacturers size things in GB, which is 1000 times one million.  So for example with a 128GB stick the command would be dd if=file of=/dev/daX count=128000.
Comment 79 Gary Jennejohn 2020-05-24 11:00:57 UTC
(In reply to Gary Jennejohn from comment #78)
Well, using dd cannot work because the transfer stops as soon as EOF on the input file is reached.  So, the only way to do it is to write some simple C code to write directly to /dev/daX.  I did that and am now filling the 128GB USB stick with 0xFF at about 2.6MiB/sec.  The speed is limited because I'm only writing 1024 bytes at a time with O_SYNC set in open().  Once that finishes I'll do some testing.
Comment 80 vidwer+fbsdbugs 2020-05-24 11:10:03 UTC
Right. As a newbie, I couldn't find in sys/ where the device special file/nodes for /dev/zero is created.
Comment 81 Olivier Certner freebsd_committer freebsd_triage 2020-05-24 11:23:07 UTC
(In reply to Gary Jennejohn from comment #79)

I didn't report it, but when `dd`ing files to FS, in fact I was using `yes` as the input, precisely to avoid writing only zeros.

You can use `dd`, just lookup "iflag=fullblock" in the man page. Just substitute `yes` with a repeated echo, which should accept a character as an hex if I remember well.
Comment 82 Gary Jennejohn 2020-05-24 11:38:53 UTC
(In reply to vidwer+fbsdbugs from comment #80)
It's in /usr/src/sys/dev/null/null.c.  But /dev/zero uses a zero_region which is pre-filled with 0s for efficiency.  Not really a good idea to hack around in the kernel without lots of experience IMHO.
Comment 83 Gary Jennejohn 2020-05-24 11:42:58 UTC
(In reply to Olivier Certner from comment #81)
Yeah, that would be one approach.  But I've been developing C code for at least 40 years and it was easier for me to simply write a little program to do it.  It really was quite trivial.
Comment 84 Maurizio 2020-05-24 17:25:53 UTC
Created attachment 214817 [details]
dd + iostat on UFS+SU after zeroing the stick - DataTraveler G4 32GB

(In reply to Olivier Certner from comment #75)

Hi Oliver,

after running 'dd if=/dev/zero of=/dev/da0 bs=1m' I have run the test in comment #74 again. The result is better but in my previous tests the stick was formatted without SU.
Comment 85 Olivier Certner freebsd_committer freebsd_triage 2020-05-25 09:40:52 UTC
(In reply to Gary Jennejohn from comment #76)

Yes. You're comparing write with read performance, and indeed there is a significant difference. Personally, I'm more concerned with write performance, and especially to some UFS filesystem. I have always assumed that read performance would be better than write's, that's why I'm concentrating on writes, another reason being that, in my use case, the stick is most of the time entirely overwritten when written to, whereas when read (much more often), it is only so partially (usually a small to very small part). And the problem is that write performance is quite low, meaning that it can take 8 hours to fill a 64GB stick with the initial performance I reported (~2.5MiB/s). I remember getting better performance out of older USB 2.0 sticks than that (even for not so small ones, e.g., 16GB; maybe a test would be in order).

Indeed, I read that the erased state of flash is to have all bits set to 1. This describes internal mechanism, not necessarily external one. Usually, new sticks come formatted, so the "natural" state is probably not observable directly, even on a just-bought stick. 0 is so universally used as absence of data that I would not be surprised if some controller logic actually reverses bit for storage, so that 0-filled sectors are in fact stored as 1-filled ones. Provided of course that they are stored at all (these all filled sectors may be handled differently by the firmware, I don't know). The fact that zeroing the whole stick before access indeed increases performance could be a consequence of that.

I'm quite interested in the result of your tests, i.e., the bandwidth comparison before and after having overwritten the stick with 0xFF. If bit-reversal is indeed present, and there is no sophisticated management mechanism for "empty" cells, then your test could actually show no performance improvement. If it does, then something else is going on. In any case, we'll have learned something.
Comment 86 Olivier Certner freebsd_committer freebsd_triage 2020-05-25 10:53:33 UTC
I did a new test on SD_128G with exFAT, after zeroing the whole stick, using `cp -af /usr`. I now get 23.1MiB/s globally, whereas in the previous test I reported 12.3MiB/s. Average bandwidth when not stalled is a bit better (53.3MiB/s vs. 48MiB/s previously). Stall duration was ~57% of the total test duration (vs ~74% before).

So, this:
1. Confirms that zeroing the stick before writing indeed improves performance a lot: Bandwidth almost doubles for exFAT, which was also the case for UFS (see comment #75, 4.3 => 7.8MiB/s).
2. Confirms the wide gap between exFAT and UFS performance (almost 3x better bandwidth for exFAT) when compared in a (more) similar setting.
Comment 87 Gary Jennejohn 2020-05-25 11:23:31 UTC
(In reply to Olivier Certner from comment #86)
I only used dd for my testing.  A file system adds lots of overhead and does not give a clear picture of the raw performance.
I used a Kingston DT Rubber 3.0ge 16GB on both FreeBSD and Linux.  I used it because it's easy to fill up such a small stick.
Unfortunately, the FreeBSD XHCI driver does not correctly attach the stick (probably a bug) whereas USB2 has no problems with it.  Under Linux XHCI is happy to use it.
This is a rather old stick (9 or 10 years) and Kingston claims that it can only write at a maximum of 30MB/s.  A test report I saw showed that the stick can actually get up to about 40MB/s.
For the dd test I wrote two different approximately 16GB files to the stick one after the other.  I tried both bs=128k and bs=256k, but that had no effect on the transfer rates under FreeBSD or Linux.
In any case, whether the stick was filled with 0xFF or all sectors were already dirty didn't change the write transfer rates.  Under FreeBSD (USB2) I saw about 33MB/s and under Linux (USB3) I saw about 40MB/s for each of the files.
I also used the 128GB (USB3) stick with same dd tests under FreeBSD and also saw on the order of 40MB/s in that test.  This stick probably was not full.
So, I guess the poor write performance isn't caused by the contoller in the stick.  It's probably more likely that the filesystem overhead is the culprit, especially with UFS when the stick is getting full or when it's writing small files into fragments.
Comment 88 Olivier Certner freebsd_committer freebsd_triage 2020-05-25 14:26:58 UTC
(In reply to Gary Jennejohn from comment #87)

Thanks for the tests. I expected that a pre-fill with 0xFF would not change the result, and you seem to confirm that. Also, you're seeing better performance under Linux, which may be due to the transaction size used (1MiB there), but it could be that USB 2 is saturating on FreeBSD and that could be in fact the dominant effect. Did not have any XHCI problem on my side with any of the USB 3 sticks. I don't think I made a thorough raw `dd` test under Linux, only a `dd` to exFAT, and I didn't see much difference (was mentioned in comment #51; FS influence would be low, since all writes were clustered, with only 128KiB transactions really issued to the stick), at least I don't remember noticing the ~20% difference you are reporting. I will be doing some raw `dd` testing under Linux (and maybe MacOS) in order to confirm this with my sticks (SD_128G at least).

There are several systems involved in the overall stick performance. By not testing the FS, you indeed focus on the hardware part, but I would argue that there is a possibility that it is not all of it. By putting the FS into the equation, I'm not only trying to analyze what happens in real world usage, which Sébastien's initial report was implicitly about I think, and which matches my use case as well, but also the influence of controller's performance at issuing read/write (or read/erasure) sequences and how it caches these and full flash block accesses. UFS indeed uses fragments, and thus issues smaller transactions than a full FS block, although the effect is mitigated somehow by clustering: This is what I think you call FS overhead. I'm just trying to see if I can obtain better performance by arranging to have this read-write process to happen in software (through GEOM_CACHE), which will indeed tell me if there is some controller inefficiency or caching problem in the real-world equation, and if it's the case I'll get better performance with this setup (while somehow conserving space, but this is not really important to me at this point).

I've been contemplating for a while to go down another, maybe simpler, route to raise the minimum transaction size: Getting rid of fragments in UFS. It seems it is possible to do it by setting the fragment size to the block size at FS creation. So this is the test I'm performing right now: See if performance is much improved with only 32k blocks (and whether the minimum transaction size actually jumps to 32k).
Comment 89 Olivier Certner freebsd_committer freebsd_triage 2020-05-26 10:03:18 UTC
Last test on SD_128G was done with UFS+SU, block size and fragment size set to 64k. Unfortunately, I got lots of (apparently transient) write errors in `dmesg`, so not sure whether results are really representative. Either the stick is failing, or the controller still doesn't like the pattern of accesses presented to it.

Each transaction's size is 64k at least, sometimes 128k. Perhaps strangely, errors are reported mostly for 128k-aligned addresses (~94% of the time). Might be because the first write was to the upper 64k part of a 128k block. And since I'm not seeing them with exFAT, this would hint for a flash block size of precisely 128k. I did some more stats about clusters of errors, but nothing especially interesting jumped to my eye.

For the record, global bandwidth is again bad (4.8MiB/s; 28MiB/s not counting stalls), stalls account for ~83% of total test duration. Around 84GiB were copied. If I do the same statistics after around ~18GiB have been copied, for them to be more comparable with previous tests, I get 17.4MiB/s overall bandwidth, which is way much better than best reported bandwidth for UFS so far (around x2). There is a steep bandwidth decrease after about 14GiB have been transferred. Prior to this point bandwidth is >= 20MiB/s, sometimes growing up to 50MiB/s. After this point, bandwidth is always below 10MiB/s, in general less than half of it (but with some spikes to 10MiB/s now and then).

The bandwidth decrease is much less clear with exFAT. It happens after a lot more data have been transferred (>= 68GiB, going from ~60MiB/s to ~30MiB/s), is not as dramatic as bandwidth can recover to up to 75% of the initial value sometimes, and may not be comparable since I saw that for some reason files get overwritten (strange, I should figure this out).

Now going to try simple `yes | dd` tests on Linux and FreeBSD, after full stick zeroing.
Comment 90 Olivier Certner freebsd_committer freebsd_triage 2020-06-02 10:08:16 UTC
More tests with SD_128G. Stick is dd-zeroed entirely before each test.

1. Did again the previous test (UFS+SU, 64k blocks and fragments), to confirm and get a feeling of possible variations. Results are worse in the sense that the bandwidth dip happened much earlier, after less than 240MiB had been transferred (less than 11min after start). Bandwidth is approximately the same (4.4MiB/s overall, 27.8MiB/s without stalls). Got a lot more errors in dmesg.

2. Did again the exFAT test. Bandwidth is 18.7MiB/s (50.3MiB/s not counting stalls). This is a bit worse than the previously reported results (comment #86), but still compatible (and still much better than UFS, even 64k). More importantly, I see absolutely no dmesg error at all. So the stick probably is not "failing", it just doesn't handle 64k writes well, whereas 128k is OK.

3. Did a `yes | dd` on the whole stick on a very recent Linux laptop: Got 129MiB/s! It seems I did not do a whole `yes | dd` on FreeBSD, only one for about ~9GiB, and for this one I had seen 62.6MiB/s. So in fact the difference is huge, contrary to what I thought, and much bigger than that reported by Gary (comment #87).

I will do a full `yes | dd` on FreeBSD to see.

Did not have time to make any progress on GEOM_CACHE, but I should be able to this week.
Comment 91 Olivier Certner freebsd_committer freebsd_triage 2020-08-01 22:37:27 UTC
Just a small word to say that I was busy with other things, and will be for the next month at least.

Also, I performed some more tests, which did not seem to give more insight than what we already have.

Finally, I studied GEOM_CACHE's code, and only started modifying it. Having in mind the deadlock I mentioned in an earlier comment, I thoroughly reviewed the code but did not find any direct bug there. The code makes some coherent assumptions about how the GEOM framework operates. It might be that some of the assumptions are wrong, but for now I have no particular reason to think so. I could not reproduce the deadlock by torture testing CURRENT with GEOM_CACHE on top of a md-backed memory disk.
Comment 92 blaasdf 2023-01-17 11:35:40 UTC
Just wanted to add another data point as I _suspect_ I was hit by this very issue.

I'm using a USB 3.0 disk case: ICY BOX IB-AC703-U3 , which the vendor claims can reach 6GB/s. I put a 1TB Patriot P200 2,5`` SSD SATA disk in the case. The disc spec says 530MB/s read, 460MB/s write. On a Linux system I can reliably get 440MB/s write.

On a system installed with:
`FreeBSD 12.2-RELEASE-p10 b26f74b5984(HEAD) TRUENAS`
I get varying behaviour. The motherboard I use has USB 2.0 and USB 3.0 ports. No other USB devices are connected. I only connect the drive to USB 3.0 ports. I connect the drive after the machine boots fully. About 90% of times I get the following message in dmesg:
```
ugen0.2: <USB 3.0 Device USB 3.0 Device> at usbus0
umass0 on uhub1
umass0: <USB 3.0 Device USB 3.0 Device, class 0/0, rev 2.10/3.01, addr 10> on usbus0
umass0:  SCSI over Bulk-Only; quirks = 0x8100
(... redacted...)
da0: <IB-AC703 -U3 0301> Fixed Direct Access SPC-4 SCSI device
(... redacted...)
da0: 40.000MB/s transfers
(... redacted...)
```
and indeed the speed is < 40MB/s. 500GB of files of varying sizes was written at 20MB/s.


About 10% of times I get the following message in dmesg:
```
ugen0.2: <USB 3.0 Device USB 3.0 Device> at usbus0
umass0 on uhub1
umass0: <USB 3.0 Device USB 3.0 Device, class 0/0, rev 3.00/3.01, addr 11> on usbus0
umass0:  SCSI over Bulk-Only; quirks = 0x8100
(... redacted...)
da0: <IB-AC703 -U3 0301> Fixed Direct Access SPC-4 SCSI device
(... redacted...)
da0: 400.000MB/s transfers
(... redacted...)
```
and the speed is < 400MB/s. The same 500GB of files were written at 160MB/s and the bottleneck was on the source (i.e. the source was at max read capacity).

Notice in the slow case, the message says: `rev 2.10/3.01` , whereas in the fast case it says: `rev 3.00/3.01`

There's nothing additional that I do between disconnecting and connecting the drive. I just  do it multiple times until I see the right message.

Let me know if I can provide any further diagnostics.
Comment 93 Hans Petter Selasky freebsd_committer freebsd_triage 2023-01-17 11:58:19 UTC
Hi,

If you don't see 400 MByte/s maybe the disk has a problem to enumerate at Super-Speed and falls back to High-Speed (40 MByte/s) instead.

Some more data points for so-called-flash disk:

The highest write speed is obtained when a flash disk is low-level formatted before use. Else you risk that flash areas needs to be erased before they can be written, which takes additional time.

--HPS
Comment 94 blaasdf 2023-01-17 14:32:50 UTC
> Else you risk that flash areas needs to be erased before they can be written, which takes additional time.

Are you talking about write amplification? That's a very good point, I totally forgot about that and no, I didn't wipe the drive before each attempt.

Although I suspect the problem occurs when attaching the drive, that's what made me think the problem is with the usb driver. What I mean by that is that once I connect the disk and see the message:
```
da0: 40.000MB/s transfers
```
the disc won't write at higher than 40MB/s rates. So I thought that during the init sequence the usb driver "negotiates the protocol version to use" and that's what's being used afterwards.

I might be totally wrong here and this might be unrelated. Just wanted to share what I found in case it pushes things forward ;)
Comment 95 Hans Petter Selasky freebsd_committer freebsd_triage 2023-01-17 14:58:33 UTC
(In reply to blaasdf from comment #94)

> Are you talking about write amplification? That's a very good point, I totally forgot about that and no, I didn't wipe the drive before each attempt.

What does "write amplification" mean?

There are two problems - yes:

1) Getting the disk to attach at the right speed.
2) Make sure the disk is low-level formatted before heavy disk write usage

--HPS
Comment 96 blaasdf 2023-01-17 16:04:36 UTC
> What does "write amplification" mean?

"Because flash memory must be erased before it can be rewritten, with much coarser granularity of the erase operation when compared to the write operation,[a] the process to perform these operations results in moving (or rewriting) user data and metadata more than once. Thus, rewriting some data requires an already-used-portion of flash to be read, updated, and written to a new location, together with initially erasing the new location if it was previously used at some point in time. Due to the way flash works, much larger portions of flash must be erased and rewritten than actually required by the amount of new data. "

src: https://en.wikipedia.org/wiki/Write_amplification