Bug 228359 - rebuild gmirror + ufs + ssd +trim
Summary: rebuild gmirror + ufs + ssd +trim
Status: Closed Works As Intended
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-19 11:15 UTC by Alexey Koscheev
Modified: 2018-05-30 17:09 UTC (History)
6 users (show)

See Also:


Attachments
fsck_ffs -E -d /dev/mirror/gm0 on empty fs (101.90 KB, text/plain)
2018-05-25 15:28 UTC, Alexey Koscheev
no flags Details
fsck_ffs -E -d /dev/mirror/gm0 on fs with 1gb data (104.06 KB, text/plain)
2018-05-25 15:29 UTC, Alexey Koscheev
no flags Details
fsck_ffs -E -d /dev/mirror/gm0 on fs with 200gb data (397.81 KB, text/plain)
2018-05-30 12:04 UTC, Alexey Koscheev
no flags Details
fsck_ffs -E -d /dev/mirror/gm0 on fs with 390gb data after gmirror rebuild (414.52 KB, text/plain)
2018-05-30 12:19 UTC, Alexey Koscheev
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Koscheev 2018-05-19 11:15:11 UTC
Assume the situation.

Have gmirror with 2 ssd components (ada0, ada1). UFS over gmirror.
One component (e.g., ada1) becomes broken and should be replaced.
gmirror forget gm0
replace broken drive with new one
gmirror insert gm0 ada1
gmirror rebuid gm0 ada1 (if autosync was disabled)

Wait some time until synchronization done.

Now what we have?
ada0 fine, but ada1 is fully dd'ed. From drive side ada1 is fully written.
Now we start write large volume of data to fs, and see (systat -vmstat) that ada1 performs poorly (low write speed, high busy percentage).
When we remove large file from fs, TRIM is sent to ada1 and it begin perform better (on new big write, hight write speed, less busy percentage).

Now the question.

Is there a way to send TRIM commands to ada1 after rebuild according UFS state to allow drive know what "regions" should be considered as currently unused?

The way I found: is to create a large file and to delete it, but it is not good, because of low overall performance during creation of a file.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2018-05-20 18:44:10 UTC
To answer your question directly, no, there is currently no way to ask UFS to TRIM unused filesystem blocks. Of course, gmirror has no way to know whether a given block is used or not, so the problem cannot be solved there. Creating a large file and deleting it is probably the only reasonable workaround for now.
Comment 2 Warner Losh freebsd_committer freebsd_triage 2018-05-20 18:46:47 UTC
"To answer your question directly, no, there is currently no way to ask UFS to TRIM unused filesystem blocks. Of course, gmirror has no way to know whether a given block is used or not, so the problem cannot be solved there. Creating a large file and deleting it is probably the only reasonable workaround for now."

Doesn't fsck_ufs -E do this?
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2018-05-20 18:54:36 UTC
(In reply to Warner Losh from comment #2)
Ah, I didn't know about that one. I was thinking about an online method, but fsck_ffs -E might be sufficient for the submitter's purposes.
Comment 4 Alexey Koscheev 2018-05-20 19:14:50 UTC
Looks like fsck_ufs -E should be run on unmounted fs. I right?
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2018-05-20 19:26:01 UTC
(In reply to freebsd from comment #4)
Yes, or a read-only FS.
Comment 6 Alexey Koscheev 2018-05-20 19:37:09 UTC
Thanks! I will try it.
Comment 7 Alexey Koscheev 2018-05-21 08:39:19 UTC
2 drives: KINGSTON SKC400S371T
Size: 1TB
Used: 403G
Inodes: 10M

It took about 90 minutes to complete 
fsk_ufs -E /home

Phases 1-4 took about 5 minutes.
Phase 5 - rest time.

According to "systat -vmstat" on both drives:
tps mainly 111
MB/s from 0.5 to 2000
%busy - 98-99%

Not good results.
Comment 8 Edward Tomasz Napierala freebsd_committer freebsd_triage 2018-05-21 10:26:16 UTC
That's how it works - it needs to issue a _lot_ of TRIM requests.  Take a look at "gstat -d" when it's busy.
Comment 9 Alexey Koscheev 2018-05-21 12:29:42 UTC
Yes, i saw in "gstat -d" same 111 tps in "d/s" column.
Comment 10 Warner Losh freebsd_committer freebsd_triage 2018-05-21 15:43:18 UTC
(In reply to freebsd from comment #9)

what's the mb/s for the TRIMs?

There's a sweet spot for TRIM requests that varies based on each drive. just trims each of the blocks in sequence, which is a lot of trim requests. It also does it synchronously, IIRC, which reduces the performance (a little on SSDs, a lot on NVMe).
Comment 11 Alexey Koscheev 2018-05-21 20:51:02 UTC
Ok. I think, it could be a good thing to TRIM UFS online.

I need UFS because of SUIDDIR option.

Suddenly zfs don't have this feature.

As an alternative I setup on another server zfs mirror + zvol with UFS.
Comment 12 Kirk McKusick freebsd_committer freebsd_triage 2018-05-21 21:16:24 UTC
(In reply to Warner Losh from comment #10)
It would be trivial to aggregate trim's in fsck_ffs -E. Is there a limit on the size of a trim block request?
Comment 13 Julian Elischer freebsd_committer freebsd_triage 2018-05-22 14:23:39 UTC
Unfortunately the Maximum size for a trim or discard is a property of the device.
I don't know a way to read it that is convenient. 
you could try just making it bigger and bigger until it gets an error..
Comment 14 Warner Losh freebsd_committer freebsd_triage 2018-05-22 14:49:56 UTC
(In reply to Kirk McKusick from comment #12)

We'll be plumbing this information up via GEOM soon. However, there's no way to query. The different protocols have different maxes, and devices have different sweet spots. The I/O scheduler might be able to measure the latter through different techniques, with a fall back to sysctls to set the maximum size for troublesome or awesome devices (down or up) depending...  It's not there yet. A bigger issue is that we have no async interface to TRIM. This isn't a big deal for SATA drives, as most of them can't do NCQ trim (and our support for NCQ trim is limited to ahci), but for NVMe drives it can be a huge difference...
Comment 15 Kirk McKusick freebsd_committer freebsd_triage 2018-05-22 17:33:18 UTC
(In reply to Warner Losh from comment #14)
It sounds like the application should just send down the biggest trim that it has and then let the GEOM layer or perhaps the device layer divide it up into smaller pieces as needed. Or would it work better to have the application ask for a maximum size and then limit its trim requests to that size?
Comment 16 Warner Losh freebsd_committer freebsd_triage 2018-05-22 19:41:17 UTC
For now, sending down larger trims would be better, and let the lower layers decide...
Comment 17 Kirk McKusick freebsd_committer freebsd_triage 2018-05-22 23:32:46 UTC
I just looked at the implementation of the fsck_ffs -E option and it is already aggregating the free blocks into the largest trim blocks that it is able to create. It would be interesting to try running fsck_ffs with the -E and -d options (-d is causes debug output including a listing of the size and location of every trim command). On an empty filesystem it should yield one trim per cylinder group which is close to the size of the cylinder group (typically about 500Mb). I don't have a unused SSD to try this out on, but I am curious if one of you can try this experiment.

Unless I am misreading the code, I think that we are doing the best that we can.
Comment 18 Alexey Koscheev 2018-05-25 15:28:17 UTC
Created attachment 193690 [details]
fsck_ffs -E -d /dev/mirror/gm0 on empty fs
Comment 19 Alexey Koscheev 2018-05-25 15:29:21 UTC
Created attachment 193691 [details]
fsck_ffs -E -d /dev/mirror/gm0 on fs with 1gb data
Comment 20 Kirk McKusick freebsd_committer freebsd_triage 2018-05-25 16:35:25 UTC
(In reply to freebsd from comment #19)
Thanks very much for running those tests. Is the running time at the end correct (that is that both tests ran in about 4 minutes)? If so, then the 90 minutes that you saw in your original run was because the free space was much more scattered resulting in the need to do many more trim commands.
Comment 21 Alexey Koscheev 2018-05-25 17:43:38 UTC
(In reply to Kirk McKusick from comment #20)
Yes, both test run about that time.

After weekend I will try to make tests with more data in filesystem and with gmirror rebuild.
Comment 22 Kirk McKusick freebsd_committer freebsd_triage 2018-05-25 20:43:40 UTC
(In reply to freebsd from comment #21)
Tests on more full filesystems will be interesting. At this point, I doubt that there are any further optimizations that I can make to improve the running time (as it is dominated by the time to execute the trim commands). But some possible improvements might become evident.
Comment 23 Alexey Koscheev 2018-05-30 12:04:40 UTC
Created attachment 193837 [details]
fsck_ffs -E -d /dev/mirror/gm0 on fs with 200gb data
Comment 24 Alexey Koscheev 2018-05-30 12:19:40 UTC
Created attachment 193838 [details]
fsck_ffs -E -d /dev/mirror/gm0 on fs with 390gb data after gmirror rebuild

This test was done on following way:

FS (mounted on /mnt) contained 200G of data.
gmirror remove gm0 ada1
gmirror insert gm0 ada1
dd if=/dev/zero of=/mnt/1.tst bs=10m count=102400
On another terminal i run systat -vmstat.
After few minutes ada1 become 99% busy (ada0 was busy at 33%), MB/s lowered from 255 Mb/s (sata2 controller used on this machine) to 85 MB/s.

dd was stopped when file /mnt/1.tst reached 190G size.

And finally
umount /mnt
fsck_ffs -E -d /dev/mirror/gm0

I could not reproduce the problem when fsck_ufs run more than 6 minutes.
Comment 25 Kirk McKusick freebsd_committer freebsd_triage 2018-05-30 17:09:20 UTC
After extensive testing (thanks freebsd at ihead.ru) we have determined that fsck_ffs -E functionality is working as efficiently as possible and is working correctly.