Assume the situation. Have gmirror with 2 ssd components (ada0, ada1). UFS over gmirror. One component (e.g., ada1) becomes broken and should be replaced. gmirror forget gm0 replace broken drive with new one gmirror insert gm0 ada1 gmirror rebuid gm0 ada1 (if autosync was disabled) Wait some time until synchronization done. Now what we have? ada0 fine, but ada1 is fully dd'ed. From drive side ada1 is fully written. Now we start write large volume of data to fs, and see (systat -vmstat) that ada1 performs poorly (low write speed, high busy percentage). When we remove large file from fs, TRIM is sent to ada1 and it begin perform better (on new big write, hight write speed, less busy percentage). Now the question. Is there a way to send TRIM commands to ada1 after rebuild according UFS state to allow drive know what "regions" should be considered as currently unused? The way I found: is to create a large file and to delete it, but it is not good, because of low overall performance during creation of a file.
To answer your question directly, no, there is currently no way to ask UFS to TRIM unused filesystem blocks. Of course, gmirror has no way to know whether a given block is used or not, so the problem cannot be solved there. Creating a large file and deleting it is probably the only reasonable workaround for now.
"To answer your question directly, no, there is currently no way to ask UFS to TRIM unused filesystem blocks. Of course, gmirror has no way to know whether a given block is used or not, so the problem cannot be solved there. Creating a large file and deleting it is probably the only reasonable workaround for now." Doesn't fsck_ufs -E do this?
(In reply to Warner Losh from comment #2) Ah, I didn't know about that one. I was thinking about an online method, but fsck_ffs -E might be sufficient for the submitter's purposes.
Looks like fsck_ufs -E should be run on unmounted fs. I right?
(In reply to freebsd from comment #4) Yes, or a read-only FS.
Thanks! I will try it.
2 drives: KINGSTON SKC400S371T Size: 1TB Used: 403G Inodes: 10M It took about 90 minutes to complete fsk_ufs -E /home Phases 1-4 took about 5 minutes. Phase 5 - rest time. According to "systat -vmstat" on both drives: tps mainly 111 MB/s from 0.5 to 2000 %busy - 98-99% Not good results.
That's how it works - it needs to issue a _lot_ of TRIM requests. Take a look at "gstat -d" when it's busy.
Yes, i saw in "gstat -d" same 111 tps in "d/s" column.
(In reply to freebsd from comment #9) what's the mb/s for the TRIMs? There's a sweet spot for TRIM requests that varies based on each drive. just trims each of the blocks in sequence, which is a lot of trim requests. It also does it synchronously, IIRC, which reduces the performance (a little on SSDs, a lot on NVMe).
Ok. I think, it could be a good thing to TRIM UFS online. I need UFS because of SUIDDIR option. Suddenly zfs don't have this feature. As an alternative I setup on another server zfs mirror + zvol with UFS.
(In reply to Warner Losh from comment #10) It would be trivial to aggregate trim's in fsck_ffs -E. Is there a limit on the size of a trim block request?
Unfortunately the Maximum size for a trim or discard is a property of the device. I don't know a way to read it that is convenient. you could try just making it bigger and bigger until it gets an error..
(In reply to Kirk McKusick from comment #12) We'll be plumbing this information up via GEOM soon. However, there's no way to query. The different protocols have different maxes, and devices have different sweet spots. The I/O scheduler might be able to measure the latter through different techniques, with a fall back to sysctls to set the maximum size for troublesome or awesome devices (down or up) depending... It's not there yet. A bigger issue is that we have no async interface to TRIM. This isn't a big deal for SATA drives, as most of them can't do NCQ trim (and our support for NCQ trim is limited to ahci), but for NVMe drives it can be a huge difference...
(In reply to Warner Losh from comment #14) It sounds like the application should just send down the biggest trim that it has and then let the GEOM layer or perhaps the device layer divide it up into smaller pieces as needed. Or would it work better to have the application ask for a maximum size and then limit its trim requests to that size?
For now, sending down larger trims would be better, and let the lower layers decide...
I just looked at the implementation of the fsck_ffs -E option and it is already aggregating the free blocks into the largest trim blocks that it is able to create. It would be interesting to try running fsck_ffs with the -E and -d options (-d is causes debug output including a listing of the size and location of every trim command). On an empty filesystem it should yield one trim per cylinder group which is close to the size of the cylinder group (typically about 500Mb). I don't have a unused SSD to try this out on, but I am curious if one of you can try this experiment. Unless I am misreading the code, I think that we are doing the best that we can.
Created attachment 193690 [details] fsck_ffs -E -d /dev/mirror/gm0 on empty fs
Created attachment 193691 [details] fsck_ffs -E -d /dev/mirror/gm0 on fs with 1gb data
(In reply to freebsd from comment #19) Thanks very much for running those tests. Is the running time at the end correct (that is that both tests ran in about 4 minutes)? If so, then the 90 minutes that you saw in your original run was because the free space was much more scattered resulting in the need to do many more trim commands.
(In reply to Kirk McKusick from comment #20) Yes, both test run about that time. After weekend I will try to make tests with more data in filesystem and with gmirror rebuild.
(In reply to freebsd from comment #21) Tests on more full filesystems will be interesting. At this point, I doubt that there are any further optimizations that I can make to improve the running time (as it is dominated by the time to execute the trim commands). But some possible improvements might become evident.
Created attachment 193837 [details] fsck_ffs -E -d /dev/mirror/gm0 on fs with 200gb data
Created attachment 193838 [details] fsck_ffs -E -d /dev/mirror/gm0 on fs with 390gb data after gmirror rebuild This test was done on following way: FS (mounted on /mnt) contained 200G of data. gmirror remove gm0 ada1 gmirror insert gm0 ada1 dd if=/dev/zero of=/mnt/1.tst bs=10m count=102400 On another terminal i run systat -vmstat. After few minutes ada1 become 99% busy (ada0 was busy at 33%), MB/s lowered from 255 Mb/s (sata2 controller used on this machine) to 85 MB/s. dd was stopped when file /mnt/1.tst reached 190G size. And finally umount /mnt fsck_ffs -E -d /dev/mirror/gm0 I could not reproduce the problem when fsck_ufs run more than 6 minutes.
After extensive testing (thanks freebsd at ihead.ru) we have determined that fsck_ffs -E functionality is working as efficiently as possible and is working correctly.