Bug 277992 - mpr and possible trim issues
Summary: mpr and possible trim issues
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-27 16:00 UTC by mike
Modified: 2024-04-27 18:41 UTC (History)
3 users (show)

See Also:


Attachments
charting write throughput (90 second pauses between tests) (478.83 KB, image/png)
2024-04-08 12:51 UTC, mike
no flags Details
write performance of the tests (500.07 KB, image/png)
2024-04-27 18:41 UTC, mike
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description mike 2024-03-27 16:00:06 UTC
The thread https://lists.freebsd.org/archives/freebsd-hardware/2024-March/000094.html has most of the details. 

In summary, a set of WD Blue SA510 SSDs with the latest firmware as of Mar 2024 will eventually start throwing errors and detach from the controller when I copy and then destroy a zfs dataset with several million files.  It sort of feels like a TRIM issue, but not sure.  Putting the disks off the onboard SATA controller does not recreate the issue. 

If I start with a low level trim (trim -f /dev/daX), create a raidz1 zfs pool with 4, one TB WD disks, import a dataset of about 280GB (compressed) that has many (20+mill files), do a zfs send original pool | zfs recv copy-of-pool, then zfs destroy copy-of-pool and repeat about 4 or 5 times, the drives in the pool will start throwing errors.

If I do a hard trim of the disks, I can start from scratch and again get 4 or 5 cycles before the errors.  Hence, it sort of feels like a broken trim issue ?

I tried with auto trim on and off, a manual zfs trim <pool> between zfs send| zfs recv tests to no avail. When the disks are on the mpr controller I will get errors such as 
(da6:mpr0:0:16:0): READ(10). CDB: 28 00 6d e0 ae 28 00 00 08 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 0c cb 3f 00 00 00 e8 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): READ(10). CDB: 28 00 6d e0 ad 28 00 01 00 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): READ(10). CDB: 28 00 6d e0 ac 28 00 00 f8 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 40 07 df 88 00 01 00 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 3f 48 72 08 00 01 00 00
(da6:mpr0:0:16:0): CAM status: SCSI Status Error
(da6:mpr0:0:16:0): SCSI status: Check Condition
(da6:mpr0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, 
or bus device reset occurred)
(da6:mpr0:0:16:0): Retrying command (per sense data)
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2036 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 637 loginfo 
31110f00
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 41 98 42 00 00 01 00 00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1242 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 979 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1243 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2091 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1612 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2093 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 152 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2132 loginfo 
31110f00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 43 17 dc 88 00 01 00 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 41 98 43 00 00 00 50 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 0c d4 f6 80 00 00 68 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 0c d4 f5 80 00 01 00 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): READ(10). CDB: 28 00 05 dc 12 28 00 00 f8 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): READ(10). CDB: 28 00 05 dc 0f b0 00 00 88 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 02 96 7e 80 00 00 10 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): READ(10). CDB: 28 00 6f 5b 8d 68 00 01 00 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 41 98 42 00 00 01 00 00
(da5:mpr0:0:15:0): CAM status: SCSI Status Error
(da5:mpr0:0:15:0): SCSI status: Check Condition
(da5:mpr0:0:15:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, 
or bus device reset occurred)
(da5:mpr0:0:15:0): Retrying command (per sense data)

The same tests with Samsung disks work without issue or at least I was not able to recreate the error. 

# mprutil show adapter
mpr0 Adapter:
       Board Name: INSPUR 3008IT
   Board Assembly: INSPUR
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 18.00.00.00
Firmware Revision: 16.00.12.00
  Integrated RAID: no
         SATA NCQ: ENABLED
 PCIe Width/Speed: x8 (8.0 GB/sec)
        IOC Speed: Full
      Temperature: 56 C


I originally ran into this problem with the same series of LSI adapter, but it was not in IT mode and instead was using the mrsas driver.  

When on the ATA controller the disks are DSM_TRIM. When on MPR, they are ATA_TRIM.
Comment 1 mike 2024-03-27 16:01:10 UTC
Note, 
zfs send original pool | zfs recv copy-of-pool
should be
zfs send original dataset | zfs recv copy-of-dataset
Comment 2 Warner Losh freebsd_committer freebsd_triage 2024-04-05 21:28:12 UTC
"Power on reset" sense code means that the drive dropped off the bus (typically). Usually if you can eliminate flakey power (either bad connectors, or inadequate power to cope with maximum power draw), then you are left with "commands sent to the drive freaked it out". I always bet at least a nickel on 'TRIM'.

If you think that this is TRIM related, then you can try disabling TRIM at the da device level by changing the delete method to 'none':
kern.cam.da.0.delete_method: NONE

It may also be too large a TRIM, you can see what delete_max is, and try reducing it (I usually go by 1/2 when looking for problems like this). I've never had to do this, but it is tunable, and 'too large' is a known issue, or used to be years ago:
kern.cam.da.0.delete_max: 1048576

You can do the trim type at normal to trim everything, and then you can turn it off in the drive to do the zfs send / receive. For testing purposes, trim doesn't matter in terms of drive life (it only matters when you write to the drive day in / day out and keep some parts of the drive unused for more than, say, a few hours or a day).

It may also be that weird commands are being sent by something like smartd, etc. If you can reproduce this, then we may be able to adopt a prototype dtrace script I have to make a tcpdump-like script. We can use it to dump the last N commands when we get a unit attention.

Now, having said all this, I think that this sense code might be mishandled right now. Regardless of how we got here, I think that this should pause the I/O, and reset the parameters we've set in the drive to reset its state to what the driver expects (though I think only the write through cache and similar settings might matter).
Comment 3 mike 2024-04-06 01:02:16 UTC
(In reply to Warner Losh from comment #2)

Thanks for looking Warner!  The disks are in the same server, either in the SAS enclosure/expander or off the motherboard. Pretty sure power is ok as the samsungs are good in the same server. The variable seems to be moving the WD from the mpr to the motherboard SATA controllers. 

I will give it a try disabling TRIM as well as adjust the TRIM max size. Note, I can put the hardware in the netperf cluster if you want to have a direct look at it too. I will take a look next week when back at the office
Comment 4 mike 2024-04-07 20:42:01 UTC
With max trim set to 1/4 of the default value, I still get random errors. In this case, I set all 4 drives to 
 sysctl -w kern.cam.da.7.delete_max=4294901760
kern.cam.da.7.delete_max: 17179607040 -> 4294901760



Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 3f ab 99 08 00 00 40 00 length 32768 SMID 844 Command timeout on target 49(0x000d), 60000 set, 60.788719903 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:51:57 r-14mfitest kernel: mpr0: Sending abort to target 49 for SMID 844
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 3f ab 99 08 00 00 40 00 length 32768 SMID 844 Aborting command 0xfffffe02068b87a0
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): READ(10). CDB: 28 00 0b 07 eb d0 00 00 c8 00 length 102400 SMID 1453 Command timeout on target 49(0x000d), 60000 set, 60.933970425 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 40 a5 23 c8 00 00 10 00 length 8192 SMID 443 Command timeout on target 49(0x000d), 60000 set, 60.1019647685 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 3f ab 98 e8 00 00 20 00 length 16384 SMID 113 Command timeout on target 49(0x000d), 60000 set, 60.1101031222 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 3f ab 97 e8 00 01 00 00 length 131072 SMID 1395 Command timeout on target 49(0x000d), 60000 set, 60.1184073991 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 3f ab 96 e8 00 01 00 00 length 131072 SMID 1364 Command timeout on target 49(0x000d), 60000 set, 60.1266419429 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 4c 35 79 68 00 00 08 00 length 4096 SMID 1323 Command timeout on target 49(0x000d), 60000 set, 60.1337927805 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 4c 35 79 20 00 00 48 00 length 36864 SMID 1506 Command timeout on target 49(0x000d), 60000 set, 60.1422928220 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 4c 35 79 10 00 00 08 00 length 4096 SMID 343 Command timeout on target 49(0x000d), 60000 set, 60.1504915373 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:51:57 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 40 a5 23 d8 00 00 10 00 length 8192 SMID 830 Command timeout on target 49(0x000d), 60000 set, 60.1586143647 elapsed
Apr  7 13:51:57 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:52:00 r-14mfitest kernel: mpr0: mprsas_prepare_remove: Sending reset for target ID 49
Apr  7 13:52:00 r-14mfitest kernel:     (da3:mpr0:0:49:0): READ(10). CDB: 28 00 0c 05 97 18 00 00 30 00 length 24576 SMID 1035 Command timeout on target 49(0x000d), 60000 set, 60.52537889 elapsed
Apr  7 13:52:00 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:52:00 r-14mfitest kernel:     (da3:mpr0:0:49:0): READ(10). CDB: 28 00 0b 07 ee 38 00 00 08 00 length 4096 SMID 615 Command timeout on target 49(0x000d), 60000 set, 60.133221431 elapsed
Apr  7 13:52:00 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:52:00 r-14mfitest kernel:     (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 4c 35 79 70 00 01 00 00 length 131072 SMID 1049 Command timeout on target 49(0x000d), 60000 set, 60.214511381 elapsed
Apr  7 13:52:00 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:52:00 r-14mfitest kernel:     (pass3:mpr0:0:49:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 1789 Command timeout on target 49(0x000d), 60000 set, 60.158235152 elapsed
Apr  7 13:52:00 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 1323 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 830 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: (da3:mpr0:0:49:0): Invalidating pack
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 1453 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 443 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 1364 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 343 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 113 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 1395 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 1506 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 1789 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 1035 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 615 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Controller reported scsi ioc terminated tgt 49 SMID 1049 loginfo 31140000 departing
Apr  7 13:52:01 r-14mfitest kernel: mpr0: No pending commands: starting remove_device for target 49 handle 0x000d
Apr  7 13:52:01 r-14mfitest kernel: mpr0: clearing target 49 handle 0x000d
Apr  7 13:52:01 r-14mfitest kernel: mpr0: At enclosure level 0, slot 3, connector name (    )
Apr  7 13:52:01 r-14mfitest kernel: mpr0: Finished abort recovery for target 49
Apr  7 13:52:01 r-14mfitest kernel: (da3:mpr0:0:49:0): WRITE(10). CDB: 2a 00 3f ab 99 08 00 00 40 00 
Apr  7 13:52:01 r-14mfitest kernel: (da3:mpr0:0:49:0): CAM status: Command timeout
Apr  7 13:52:01 r-14mfitest kernel: (da3:mpr0:0:49:0): Retrying command, 3 more tries remain
Apr  7 13:52:01 r-14mfitest kernel: da3 at mpr0 bus 0 scbus0 target 49 lun 0
Apr  7 13:52:01 r-14mfitest kernel: da3: <ATA WD Blue SA510 2. 6100>  s/n 240406800922 detached
Apr  7 13:52:03 r-14mfitest kernel: (da3:mpr0:0:49:0): Periph destroyed

If I do a power cycle of the box, the disk comes back and resilvers just fine.  No smart errors.  When I move these disks off the MFI controller and onto the same server's onboard Sata controllers, I am not able to provoke these errors running the test for a good 12-14hrs.  Put them back on the mfi controller, and if I do a trim -f of the disks and start clean, I seem to be able to get a good 4 cycles out before the errors.  Hence I thought it had something to do with trim.



I cant seem to disable trim via that sysctl 
sysctl -w kern.cam.da.4.delete_method=NONE
kern.cam.da.4.delete_method: ATA_TRIM -> ATA_TRIM
I can however do 
 sysctl -w kern.cam.da.4.delete_method=DISABLE
kern.cam.da.4.delete_method: ATA_TRIM -> DISABLE


it seems to work a little better with it as DISABLED. Post crash of the disk and reboot, I managed to get 4 iterations error free. However, it really slows down on the last loop. It normally takes about 20min to run, but by the 4th time its an hour.

Apr  7 14:12:45 r-14mfitest LOOP[4704]: starting
Apr  7 14:40:14 r-14mfitest LOOP[16194]: ending
Apr  7 14:41:45 r-14mfitest LOOP[16835]: starting
Apr  7 15:05:24 r-14mfitest LOOP[26742]: ending
Apr  7 15:06:55 r-14mfitest LOOP[27383]: starting
Apr  7 15:31:42 r-14mfitest LOOP[37775]: ending
Apr  7 15:33:12 r-14mfitest LOOP[38413]: starting
Apr  7 16:25:19 r-14mfitest LOOP[60263]: ending

I am going to let it keep going to see if it will error out just at a later time.
Comment 5 mike 2024-04-08 12:51:03 UTC
Created attachment 249831 [details]
charting write throughput (90 second pauses between tests)

Good news / bad news. The test continued through the night without causing any errors, however it is getting progressively slower.  From the disks being fresh, it would take about 20min. Now the same test takes about 2hrs

I let it run overnight and each iteration got longer and longer
Apr  8 00:10:45 r-14mfitest LOOP[57086]: ending job ran (1:21:6)
Apr  8 00:12:16 r-14mfitest LOOP[57737]: starting
Apr  8 01:41:51 r-14mfitest LOOP[95255]: ending job ran (1:29:35)
Apr  8 01:43:21 r-14mfitest LOOP[95888]: starting
Apr  8 03:57:36 r-14mfitest LOOP[52511]: ending job ran (2:14:15)
Apr  8 03:59:06 r-14mfitest LOOP[53144]: starting
Apr  8 05:57:41 r-14mfitest LOOP[2987]: ending job ran (1:58:35)

the 3am one was probably an extra 15min due to various daily cron jobs running.
But you can see the initial test that worked with decent write speed from the attached graph. Then the crash of the disk. I then disable TRIM and try again, but its slower and slower

I have a feeling if I do a trim -f da[each disk]p1 one by one and resilver, it will then be fast again on writes for a bit
Comment 6 mike 2024-04-08 22:24:24 UTC
OK, as suspected.  I stopped the test, reset the TRIM method to ATA_TRIM, offlined the disk one by one and did a trim -f /dev/da[disk]p1 one by one and resilvered the pool.  After that, I got normal performance for 3.5 iterrations and then its back to being super slow on writes

Apr  8 14:39:47 r-14mfitest LOOP[21836]: starting
Apr  8 15:00:59 r-14mfitest LOOP[30713]: ending job ran 1272 seconds (0:21:12)
Apr  8 15:07:29 r-14mfitest LOOP[33450]: starting
Apr  8 15:28:54 r-14mfitest LOOP[42451]: ending job ran 1285 seconds (0:21:25)
Apr  8 15:31:55 r-14mfitest LOOP[43719]: starting
Apr  8 15:53:22 r-14mfitest LOOP[52716]: ending job ran 1287 seconds (0:21:27)
Apr  8 15:56:23 r-14mfitest LOOP[53996]: starting
Apr  8 16:18:03 r-14mfitest LOOP[63083]: ending job ran 1300 seconds (0:21:40)
Apr  8 16:21:04 r-14mfitest LOOP[64344]: starting
Apr  8 16:55:39 r-14mfitest LOOP[78807]: ending job ran 2075 seconds (0:34:35)
Apr  8 16:58:40 r-14mfitest LOOP[80073]: starting
Apr  8 17:56:56 r-14mfitest LOOP[4539]: ending job ran 3496 seconds (0:58:16)
Apr  8 17:59:57 r-14mfitest LOOP[5796]: starting
Comment 7 mike 2024-04-09 12:15:08 UTC
I let the job continue over night and it continues to get slower on writes. i.e after doing a trim -f /dev/da[disk in pool]/p1 one by one, the tests start out fast (~ 300MB of files in the zfs dataset) for 4 iterations. 
# zfs get used,referenced,compressratio quirk-test/bull1
NAME              PROPERTY       VALUE  SOURCE
quirk-test/bull1  used           371G   -
quirk-test/bull1  referenced     369G   -
quirk-test/bull1  compressratio  3.86x  -
disks are 1TB each

Apr  8 15:00:59 r-14mfitest LOOP[30713]: ending job ran 1272 seconds (0:21:12)
Apr  8 15:28:54 r-14mfitest LOOP[42451]: ending job ran 1285 seconds (0:21:25)
Apr  8 15:53:22 r-14mfitest LOOP[52716]: ending job ran 1287 seconds (0:21:27)
Apr  8 16:18:03 r-14mfitest LOOP[63083]: ending job ran 1300 seconds (0:21:40)
Apr  8 16:55:39 r-14mfitest LOOP[78807]: ending job ran 2075 seconds (0:34:35)
Apr  8 17:56:56 r-14mfitest LOOP[4539]: ending job ran 3496 seconds (0:58:16)
Apr  8 19:09:44 r-14mfitest LOOP[35043]: ending job ran 4187 seconds (1:9:47)
Apr  8 20:27:37 r-14mfitest LOOP[67633]: ending job ran 4493 seconds (1:14:53)
Apr  8 21:47:30 r-14mfitest LOOP[1103]: ending job ran 4613 seconds (1:16:53)
Apr  8 23:11:06 r-14mfitest LOOP[36135]: ending job ran 4836 seconds (1:20:36)
Apr  9 00:51:15 r-14mfitest LOOP[78070]: ending job ran 5829 seconds (1:37:9)
Apr  9 02:46:06 r-14mfitest LOOP[26187]: ending job ran 6710 seconds (1:51:50)
Apr  9 04:57:16 r-14mfitest LOOP[81635]: ending job ran 7690 seconds (2:8:10)**


*** daily runs at 3am so will slow that one down

This morning, I rebooted the box so trim values are the same. I then forced 6 TRIMs on the pool (zpool trim -w quirk-test) and it sort of seems to make a difference ? But not the same as doing a full trim -f

Apr  9 08:02:32 r-14mfitest LOOP[53380]: ending job ran 4802 seconds (1:20:2)

so instead of a 2hr run it took 1:20m
Comment 8 mike 2024-04-09 12:15:08 UTC
I let the job continue over night and it continues to get slower on writes. i.e after doing a trim -f /dev/da[disk in pool]/p1 one by one, the tests start out fast (~ 300MB of files in the zfs dataset) for 4 iterations. 
# zfs get used,referenced,compressratio quirk-test/bull1
NAME              PROPERTY       VALUE  SOURCE
quirk-test/bull1  used           371G   -
quirk-test/bull1  referenced     369G   -
quirk-test/bull1  compressratio  3.86x  -
disks are 1TB each

Apr  8 15:00:59 r-14mfitest LOOP[30713]: ending job ran 1272 seconds (0:21:12)
Apr  8 15:28:54 r-14mfitest LOOP[42451]: ending job ran 1285 seconds (0:21:25)
Apr  8 15:53:22 r-14mfitest LOOP[52716]: ending job ran 1287 seconds (0:21:27)
Apr  8 16:18:03 r-14mfitest LOOP[63083]: ending job ran 1300 seconds (0:21:40)
Apr  8 16:55:39 r-14mfitest LOOP[78807]: ending job ran 2075 seconds (0:34:35)
Apr  8 17:56:56 r-14mfitest LOOP[4539]: ending job ran 3496 seconds (0:58:16)
Apr  8 19:09:44 r-14mfitest LOOP[35043]: ending job ran 4187 seconds (1:9:47)
Apr  8 20:27:37 r-14mfitest LOOP[67633]: ending job ran 4493 seconds (1:14:53)
Apr  8 21:47:30 r-14mfitest LOOP[1103]: ending job ran 4613 seconds (1:16:53)
Apr  8 23:11:06 r-14mfitest LOOP[36135]: ending job ran 4836 seconds (1:20:36)
Apr  9 00:51:15 r-14mfitest LOOP[78070]: ending job ran 5829 seconds (1:37:9)
Apr  9 02:46:06 r-14mfitest LOOP[26187]: ending job ran 6710 seconds (1:51:50)
Apr  9 04:57:16 r-14mfitest LOOP[81635]: ending job ran 7690 seconds (2:8:10)**


*** daily runs at 3am so will slow that one down

This morning, I rebooted the box so trim values are the same. I then forced 6 TRIMs on the pool (zpool trim -w quirk-test) and it sort of seems to make a difference ? But not the same as doing a full trim -f

Apr  9 08:02:32 r-14mfitest LOOP[53380]: ending job ran 4802 seconds (1:20:2)

so instead of a 2hr run it took 1:20m
Comment 9 mike 2024-04-27 18:36:38 UTC
I decided to try the same tests on the exact same hardware but booting truenas scale to see if the problem persists.  If I do a manual trim between zfs send | zfs recv, the performance seems fairly consistent and there are no crashes/resets of the drives in the pool on linux (6.6.20-production+truenas).

Not a linux person so hard to say if there are some quirks for these disks on linux. 

root@truenas[/var/log]# hdparm -I /dev/sda | grep -i tri
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Deterministic read data after TRIM
root@truenas[/var/log]# 

If I dont do the manual TRIM between send|recv (ie zpool trim -w pool), I get the same pattern as when I do a manual trim -f /dev/da[x] on each disk one by one on FreeBSD.  I get 3 full speed loops and after that, super slow until a proper trim is done. On FreeBSD I do this to the raidz1 pool by doing a trim -f /dev/da[1-4] one by one and resilver.

So it does seem to point to TRIM via zfs (be that manual or autotrim) somehow broken with this drive on FreeBSD via the mpr driver or via the ATA driver.
Comment 10 mike 2024-04-27 18:41:08 UTC
Created attachment 250268 [details]
write performance of the tests

starting at 16:00 on the graph, I sent a series of zfs send | zfs recv of the zfs data set and zfs destroy the copy. I pause for 120 seconds, then initiate a manual trim zpool trim -w pool and then pause 120secs and then start again.  As you can see the throughput is consistent on writes.  Then at 08:00 I restart the test, but this time I dont do a trim after the zfs destroy.  If manages about 3 times (the dataset is about a 1/3rd of the pool's capacity and then the writes become brutally slow which tracks FreeBSD.  If I then do a zpool trim -w pool, performance is back to normal.