Bug 197164 - Zpool with L2ARC hangs whole system
Summary: Zpool with L2ARC hangs whole system
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.1-RELEASE
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-29 07:38 UTC by Karli Sjöberg
Modified: 2015-06-24 20:00 UTC (History)
3 users (show)

See Also:


Attachments
Graphite - System Overview (321.31 KB, image/png)
2015-01-29 07:38 UTC, Karli Sjöberg
no flags Details
FreeBSD 10.1-STABLE #0 r277949 (304.67 KB, image/png)
2015-02-09 08:22 UTC, Karli Sjöberg
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Karli Sjöberg 2015-01-29 07:38:50 UTC
Created attachment 152328 [details]
Graphite - System Overview

Hi!

At present we have 4 ZFS storage systems that _were_ configured with SSD disks as cache and after different periods of time, depending on amount of RAM and load, they go unresponsive.

Initially you can ping them and change VT's at the console but nothing prints when you type, all services are gone etc. After a while they stop responding to ping as well. After a reboot all is good again for a while until the process repeats itself.

Now I have found out exactly what´s causing it: L2ARC! Just removing the cache drive(s), they run rock-solid again, but performance is severely degraded. The caching in ZFS really does wonders to offload the "slow" rotating disks and we´d very much like to be able to re-add them to our pools again.

This is similar to:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594

But this might be another issue. And since the OP couldn´t experiment with the systems being in production, the case couldn´t really come any further, but this one can! We have a virtual machine set up exactly like our "real" storage's, but miniturized in performance and capacity. It´s upgraded to 10.1-RELEASE with these patches applied:
https://svnweb.freebsd.org/base?view=revision&revision=272875

With a script that loops copying files from my desktop to the VM and then back again, I have been able to reliably hang the system just by re-adding the cache to the pool, take a look at the attached screenshot. It shows the system overview of this virtual storage server were I was running my script over night and added the cache to the pool at around 9 AM. See what happens with the ARC? That´s the problem. And then it went unresponsive around 3-4 PM.

Thanks in advance!
Karli Sjöberg
Comment 1 Steven Hartland freebsd_committer freebsd_triage 2015-01-29 09:36:06 UTC
There's been a number of fixes since 10.1-RELEASE which effect L2ARC.

One that strings to mind is: r274172 which was MFC'ed to stable as r275492

There is also r275609.

I would recommend trying these in turn to see if they fix your issue.

If still no joy try a full stable/10.
Comment 2 Karli Sjöberg 2015-02-09 08:22:52 UTC
Created attachment 152791 [details]
FreeBSD 10.1-STABLE #0 r277949
Comment 3 Karli Sjöberg 2015-02-09 08:24:45 UTC
Never tried any more patches, easier just to go up to STABLE, now at:
FreeBSD 10.1-STABLE #0 r277949

Now with the same script copying files, the system stays perfectly stable after more than 12 hours. Graphs attached. For us this issue seems fixed, thank you!

/K
Comment 4 Xin LI freebsd_committer freebsd_triage 2015-06-24 20:00:54 UTC
For the record in case someone looks up this bug: this is fixed in FreeBSD-EN-15:07.zfs https://www.freebsd.org/security/advisories/FreeBSD-EN-15:07.zfs.asc .