Created attachment 170388 [details]
throughput graphs for two bonnie++ runs
On a test system with 10 Intel P3500 NVMEs I have found that TRIM
activity can cause a severe I/O stall. After running several bonnie++
tests, the ZFS file system has been almost unusable for 15 minutes (yes,
HOW TO REPRODUCE:
- Create a ZFS pool, in this case, a raidz2 pool with the 10 NVMEs.
- Create a dataset without compression (we want to test actual I/O
- Run bonnie++. As bonnie++ can quickly saturate a single CPU core and
hence it's unable to generate enough bandwidth for this setup, I run
four bonnie++ processes concurrently. In order to demonstrate this
issue, each bonnie++ performs two runs. So,
( bonnie++ -s 512g -x 2 -f) & # four times.
Graphs included. Made with devilator (an Orca compatible data collector)
pulling data from devstat(9). The disk is just one out of 10 (the other
9 graphs are identical, as expected).
The first run of four bonnie++ processes runs without flaws. On graph
1 (TwoBonniesTput) we have the first bonnie++ from the start of the
graph to around 08:30 (the green line is the "Intelligent reading"
phase, and a second bonnie++ starting right after it.
Bonnie++ does several tests, beginning with a write test (blue line
showing around 230 MBps, from the start to 07:40), followed by a
read/write test (from 07:40 to 08:15 on the graphs), showing
read/write/delete activity and finally a read test (green line showing
250 MBps from 08:15 to 08:30 more or less). After bonnie++ ends, the
files it created are deleted. In this particular test, four concurrent
bonnie++ processes created four files of 512 GB each, a total of 2 TB.
After the first run, the disks show the TRIM activity going on at a rate of around
200 MB/s. It seems quite slow, since a test I did at home on an OCZ Vertex4 SSD
(albeit, a single one, not a pool) gave a peak of 2 GB/s. But I understand that
the ada driver is coalescing TRIM requests, while the nvd driver doesn't.
The trouble is: the second bonnie++ process is started right after the first one,
and, THERE IS ALMOST NO WRITE ACTIVITY FOR 15 MINUTES. The writing activity is
just frozen, and it doesn't pick up until about 08:45, stalling again, although
for a shorter time, around 08:50.
On exhibit 2, "TwoBonniesTimes", it can be seen that the write latency during the stall
is zero, which means (unless I am wrong) that no write commands are actually reaching
During the stalls the ZFS system was unresponsive. Any commands such as a simple
"zfs list" were painfully slow, taking even some minutes to complete.
I understand that a heavy TRIM activity must have an impact, but in this case it's
causing a complete starvation for the rest of the ZFS I/O activity which is clearly
wrong. This behavior could cause a severe problem, por example, when destroying a large
snapshot. In this case, the system is deleting 2 TB of data.
ATTEMPTS TO MITIGATE IT:
The first thing I tried was to reduce the priority of the TRIM operations in the I/O
with no visible effect.
After reading the article describing the ZFS I/O scheduler I suspected that the trim
activity might be activating the write throttle. So I just disabled it.
But it didn't help either. The writing processes still got stuck, but on dp->dp_s rather
There are two problems here. It seems that the nvd driver doesn't coalesce trim
requests. On the other hand, ZFS is dumping a lot of trim requests assuming that the
lower layer will coalesce them.
I don't think it's a good idea to make such an assumption blindly in ZFS. On the other
hand, I think that there should be some throttling mechanism applied to trim requests.
Created attachment 170389 [details]
I/O request times for two bonnie++ runs
I believe that this issue still on FreeBSD-12-CURRENT with dmesg log:
nvme0: async event occurred (log page id=0x2)
nvme0: temperature above threshold
and iostat -x 5
extended device statistics
device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t qlen %b
nvd0 0 1 0.6 153.6 2384 3287 0 2664 80 99
I am glad to test any patch for this in my test environment.
FreeBSD david-al.localdomain 12.0-CURRENT FreeBSD 12.0-CURRENT #11 5ea1f67b4ad(master)-dirty: Fri Nov 24 09:14:41 NZDT 2017 email@example.com:/tank/cross/obj/12.0-amd64.amd64/sys/INITZ-NODEBUG amd64
01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a5 (rev 03)
nvme0: <Generic NVMe Device> mem 0xdf200000-0xdf203fff irq 16 at device 0.0 on pci1
nvd0: <INTEL SSDPEKKW512G7> NVMe namespace
I'm going to post this as a separate issue, but will add it here since it is related. We were seeing non-NVMe SSDs on ZFS timing out on TRIM operations and have been able to greatly reduce the incidence by setting (for all drives)
sysctl kern.cam.da.0.delete_max=536870912 based on the advice here:
nvd and nda have the opposite problem that da/ada have. ada/da collapse as many BIO_DELETE commands down into a single TRIM to the device. That's why limiting helps since huge trims take a long time.
nvd/nda, however, has the opposite problem. They do no trim collapsing at all, so flood the device with TRIM requests that starve read/write requests.
Seeing the same on 11.0-RELEASE-p6 using a Samsung 960 EVO NVME. I've stopped my (extremely slow) copy to the drive and this background traffic continues.
tty nvd0 da0 da1 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
0 150 32.00 456 14.25 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
0 493 32.00 455 14.22 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
0 151 32.00 454 14.18 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
0 164 32.00 455 14.21 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
I've seen this on another system with Intel 750 PCIe cards but if I remember right it seemed to handle the background trim better. It's been a while since I tested it though. Both systems are test boxes so I can test any changes needed and can move them to whatever base system versions are needed.
I see the exact same problem with TRIM, Samsung 960 EVO, and UFS.
Part of the problem is consumer controller (960 EVO) doing a bad job with TRIM.
Part of the problem is nvd does not coalesces TRIMs. Part of the problem is cam_iosched separates out and prioritizes TRIM over all other IO; see cam_iosched_next_bio(). (Separating out is useful for coalescing, but we don't actually coalesce yet.)
I observed iostat showing non-zero queue depths and long IO latencies with TRIM enabled (960 EVO). With TRIM disabled, qdepth was at most ever 1 and IO latency fell drastically under the same workload.
So is some immediate relief as simple as reordering the trim option in cam_iosched_next_bio to prioritize normal I/O over trim? At least for now?
Only if you are using nda, which most people aren't. It doesn't exist in 10 or 11 (well, until last week). It's not turned on by default in 12.
So talking about cam_ioschded and nvd in the same breath misunderstands the problem. nvd doesn't collapse trims and that's the problem there. It doesn't use CAM at all so no tweaks to cam_iosched is going to help because it's completely unused. nda, on the other hand, has the collapsing trim issue and the priority inversion issue cem points out. I've been tweaking the code to reduce trim priority as well as collapse trims, but that won't help nvd. I have no plans on doing anything with nvd.
So if you are using nvd, the whole problem is lack of trim collapsing. If you are using nda, that problem is compounded by the I/O scheduler. I should have a fix for nda by the end of next week.
Is it possible the following commit addresses this issue?
(In reply to ncrogers from comment #9)
I don't think it will. At least if this person's issues with TRIM are the same as the ones I've investigated for Netflix, the issue is too many trims starving other I/O by taking up I/O slots and/or triggering performance issues in the driver (or both).
It may help, but I don't think the FLUSH is the issues with TRIM.
(In reply to Warner Losh from comment #10)
Interesting. Does changing vfs.zfs.vdev.trim_max_pending or other sysctls improve this problem? I guess I am looking for some advice as I am also experiencing issues with some NVMe + ZFS systems hanging under 12.0-RELEASE.
Just another data point:
We are seeing strange behaviour with some PCIe SSDs - Intel DC P3520 which randomly/after a while seems to hang (go dead) - so dead that they disappear from the PCI bus and requires a power off / wait until caps have drained / power on to revive... But this happens on 11.2 - and some other folks here have seen similar problems on Linux. So I suspect this is not really OS-related.
We've also had problems with the SATA version (Intel S3520) where they occasionally freeze up on us (and go "silent" on the SATA/SAS port). Often if we wait long enough (days/a week or so) they will revive themselfs and start working again...
(In reply to Peter Eriksson from comment #12)
But that would suggest a firmware problem? In my experience with NVMEs the situation improved a lot with a firmware update. Although of course the lack of trmi collapse is a problem.
I still think that maybe discarding TRIM operations in serious congestion situations is a lesser evil. After all, what are people doing when facing this issue? Disabling TRIMs. It's better to do some TRIMs than none at all.
Certainly you cannot discard read/write operations like you would discard packets on a network in order to enforce a bandwidth limit. Buffering them doesn't help either when applications need to commit data to storage.