Bug 194654 - 10.1-RC3 hangs with simultanious writes
Summary: 10.1-RC3 hangs with simultanious writes
Status: Closed Unable to Reproduce
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 10.1-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-28 09:36 UTC by Shane
Modified: 2015-07-08 09:19 UTC (History)
1 user (show)

See Also:


Attachments
ASUS P8H61-M LE/USB3 dmesg.boot (12.02 KB, text/plain)
2014-10-28 09:36 UTC, Shane
no flags Details
forced stress script (151 bytes, application/x-shellscript)
2014-10-28 09:39 UTC, Shane
no flags Details
disk writing test (149 bytes, application/x-shellscript)
2014-11-26 16:37 UTC, Shane
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Shane 2014-10-28 09:36:51 UTC
Created attachment 148728 [details]
ASUS P8H61-M LE/USB3 dmesg.boot

After almost 3 years of running 9.x on this machine with only manual reboots to install updates I find that I am unable to keep 10.1 running for more than a day.

After some trial and error I have found that two simultaneous writes to disk causes a memory issue that ends with a hang once wired memory usage hits 7GB (8GB installed) at this stage the reset button is the only thing that still works.

Under light usage this would appear to build up over the period of a day, while forced heavy usage causes a hang within a minute. Running two copies of the attached test.sh while watching top seems normal for a while, then within a minute the wired memory amount can quickly jump from 3GB to 7GB within 6 seconds. Running two instances of the script simultaneously on a USB memstick and the zfs filesystem or both on the zfs filesystem produces the same problem.

I have tested this by booting into single user mode, mounting the filesystem and using screen to run top and two scripts at the same time with the same result.

Machine is an ASUS P8H61-M LE/USB3 corei5 8GB with 3x 2TB Seagate drives in raidz. I think the pool was created with 9.2 (possibly with 9.1) and the pool and bootcode was updated after installing 10.1 just before RC2 was tagged.

FreeBSD leader.local 10.1-RC3 FreeBSD 10.1-RC3 #16 r273471: Thu Oct 23 06:32:33 ACDT 2014     root@leader.local:/usr/obj/usr/src/sys/GENERIC  amd64

In single user mode the following modules were loaded -

kernel
zfs.ko
opensolaris.ko
geom_eli.ko
crypto.ko
geom_journal.ko
geom_mirror.ko
geom_uzip.ko
aio.ko
coretemp.ko
sem.ko
smb.ko
smbus.ko
libiconv.ko
libmchain.ko
cd9660_iconv.ko
ext2fs.ko
msdosfs_iconv.ko
udf.ko
udf_iconv.ko
tmpfs.ko
nvidia.ko
dtraceall.ko
profile.ko
cyclic.ko
dtrace.ko
systrace_freebsd32.ko
systrace.ko
sdt.ko
lockstat.ko
fasttrap.ko
fbt.ko
dtnfscl.ko
dtmalloc.ko
Comment 1 Shane 2014-10-28 09:39:00 UTC
Created attachment 148729 [details]
forced stress script
Comment 2 Shane 2014-10-30 22:27:03 UTC
This is limited to the zpool I have 10.1 installed on. I can boot from 10.1-RC3-amd64-disc1.iso and import the zpool to repeat the issue.

I also have a single disk zpool (external usb3 drive) that is a version 28 zpool - I can import this and have no issue.

Properties of the zpool are -

zrpleader  size                           5.41T                          -
zrpleader  capacity                       86%                            -
zrpleader  altroot                        -                              default
zrpleader  health                         ONLINE                         -
zrpleader  guid                           7653467844531205029            default
zrpleader  version                        -                              default
zrpleader  bootfs                         zrpleader                      local
zrpleader  delegation                     on                             default
zrpleader  autoreplace                    off                            default
zrpleader  cachefile                      -                              default
zrpleader  failmode                       wait                           default
zrpleader  listsnapshots                  off                            default
zrpleader  autoexpand                     off                            default
zrpleader  dedupditto                     0                              default
zrpleader  dedupratio                     1.00x                          -
zrpleader  free                           761G                           -
zrpleader  allocated                      4.66T                          -
zrpleader  readonly                       off                            -
zrpleader  comment                        -                              default
zrpleader  expandsize                     0                              -
zrpleader  freeing                        0                              default
zrpleader  fragmentation                  28%                            -
zrpleader  leaked                         0                              default
zrpleader  feature@async_destroy          enabled                        local
zrpleader  feature@empty_bpobj            active                         local
zrpleader  feature@lz4_compress           active                         local
zrpleader  feature@multi_vdev_crash_dump  enabled                        local
zrpleader  feature@spacemap_histogram     active                         local
zrpleader  feature@enabled_txg            active                         local
zrpleader  feature@hole_birth             active                         local
zrpleader  feature@extensible_dataset     enabled                        local
zrpleader  feature@embedded_data          active                         local
zrpleader  feature@bookmarks              enabled                        local
zrpleader  feature@filesystem_limits      enabled                        local
Comment 3 Shane 2014-11-02 03:50:55 UTC
I've now updated to RC4 - while there has been an improvement, the issue is not totally resolved.

FreeBSD leader.local 10.1-RC4 FreeBSD 10.1-RC4 #19 r273922: Sat Nov  1 16:36:48 ACDT 2014     root@leader.local:/usr/obj/usr/src/sys/GENERIC  amd64

Compression would appear to be a factor. I disabled compression and installed the new world, in single user mode wired memory increased slower and with swap enabled rose to 6.8G and stayed there for an hour uptime. At 26 minutes uptime the wired amount jumped from 3.8G to 6.5G. Without swap enabled processes were terminated after the drop in free ram.

With compression enabled the wired amount jumped from 421M to 6.7G in 4 seconds at 3 minutes uptime.

Back in multi-user mode I was able to run two copies of the script for several hours. I saw the wired amount rise over 7G a few times and while the system slowed down it remained responsive. Unfortunately the damage was done and the machine was of limited use. I could start simple things like man and ls but ps top and su failed to start. This extended to X apps, I could start an xterm instance but not gnome-terminal, firefox or chrome.. leaving me to hit the reset button.
Comment 4 Shane 2014-11-26 16:34:26 UTC
I think my troubles may be related to zfs and the arc_max setting may play a part.

Booting into single user mode and running two instances of my test script I get varying results with different arc_max settings.

vfs.zfs.arc_max=2G test hangs about 10 min
vfs.zfs.arc_max=2560M tests still running after an hour

I have a second machine (Pentium E2140 with 1GB ram) setup with 3 disks in raidz. I have been unable to recreate this issue on this machine. After installing 10.1 didn't re-create the issue I went back to 9.1 and created the zpool, write some test data, upgrade to 9.2, write some data, enable compression, write some data, upgrade to 10.1 and it still didn't break. Either something in my zpool is amiss or the amount of ram makes the difference.
Comment 5 Shane 2014-11-26 16:37:08 UTC
Created attachment 149896 [details]
disk writing test

Reduced the count values to reduce disk space used during tests.
Comment 6 Marcus von Appen freebsd_committer freebsd_triage 2015-02-18 11:54:20 UTC
Updated 10.1-BETA and 10.1-RC versioned bugs to 10.1-STABLE.
Comment 7 Shane 2015-02-22 04:21:12 UTC
I have been running 10-STABLE for a while and after updating to r278305 on 7th Feb I have not seen this issue after two weeks under normal load.

While simultaneously running two copies of the sample script shows a slow response to releasing wired ram, it does get released without any issue. Under my normal load I have not noticed this issue.
Comment 8 Shane 2015-04-01 05:56:30 UTC
While the wired accumulates slower the issue is still present.
Comment 9 Glen Barber freebsd_committer freebsd_triage 2015-07-07 15:49:31 UTC
Can you verify if the issue persists on 10.2-PRERELEASE?
Comment 10 Shane 2015-07-08 09:19:31 UTC
I am currently running 10.2-PRERELEASE #13 r285123

Testing in single user mode I still see wired allocation jump from 1500M to 7300M within a few seconds, but there is still a few hundred MB left free.

While I think there could be improvement on the sudden wired allocation I don't see it locking up in this situation.