I have 64 Servers Running on Supermicro blades. Most of them running 10.1-release. We are in the process of updating them to 10.2-release.
On all updated systems the /var partition locks after 20 to 50 days. Services on the system start to fail. Login Block. On Preopen Connections I can access all other partitions just fine. Just accessing anything on /var blocks the whole process without return.
There are no massages on the console. We have to force power off to restart the system.
All partitions where SU+Journaling. I've deactivated journaling on all partitions but the lockups are still coming.
On all systems are different services. (MySQL, Jenkins, Apache).
The same Hardware was working and still is working on 10.1-Release an earlier releases just fine.
Hostname dd.mm.yyyy hh:mm Actions taken
amnesix 7. 9.2015 10:30 Reboot
miraculix 29. 9.2015 03:00 Reboot
amnesix 18.10.2015 05:00 Reboot
amnesix 29.11.2015 16:00 Reboot. Disabled journaling
olympia 12.12.2015 11:00 Reboot. Disabled journaling
devzope 17.12.2015 04:00 Reboot. Disabled journaling
miraculix 22.12.2015 01:20 Reboot. Disabled journaling
olympia 30.12.2015 19:00 Reboot
amnesix 9. 1.2016 02:00 Reboot. fsck -f on all partitions
delphi 28. 1.2016 05:40 Reboot. Disabled journaling. fsck -f
devzope 30. 1.2016 04:50 Reboot. fsck -f
devzope 1. 2.2016 01:45 Reboot. fsck -f
miraculix 2. 2.2016 06:30 Reboot. fsck -f
The symptoms are always the same. After the Power down and reboot the Raid is resyncing but the system is working just fine until the next lookup.
I've attached the dmesg.boot of one of the servers. They are all identical.
Created attachment 166418 [details]
dmesg.boot of one of the servers
Had three more lockups since reporting this bug.
This is a real showstopper.
if needed i can provide access to one of the machines for testing.
we've also seen this issue happening since migration towards 10.3-RELEASE.
Oct 3 16:00:44 storage4 kernel: vputx: negative ref count
Oct 3 16:00:44 storage4 kernel: 0xfffff801b881d760: tag zfs, type VDIR
Oct 3 16:00:44 storage4 kernel: usecount 0, writecount 0, refcount 0 mountedhere 0
Oct 3 16:00:44 storage4 kernel: flags (VI_FREE)
Oct 3 16:00:44 storage4 kernel: VI_LOCKed lock type zfs: EXCL by thread 0xfffff8016083e000 (pid 1833, zfs, tid 102301)
We haven't seen this issue in 10.1 either.
For us also this is kind of a serious bug forcing a server restart.