Bug 206836

Summary: Random Lockups of partition /var
Product: Base System Reporter: schmidt
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Some People CC: amd64, daniel.genis, schmidt
Priority: ---    
Version: 10.2-RELEASE   
Hardware: amd64   
OS: Any   
Description Flags
dmesg.boot of one of the servers none

Description schmidt 2016-02-02 07:26:43 UTC
I have 64 Servers Running on Supermicro blades. Most of them running 10.1-release. We are in the process of updating them to 10.2-release. 

On all updated systems the /var partition locks after 20 to 50 days. Services on the system start to fail. Login Block. On Preopen Connections I can access all other partitions just fine. Just accessing anything on /var blocks the whole process without return. 

There are no massages on the console. We have to force power off to restart the system. 

All partitions where SU+Journaling. I've deactivated journaling on all partitions but the lockups are still coming. 

On all systems are different services. (MySQL, Jenkins, Apache). 

The same Hardware was working and still is working on 10.1-Release an earlier releases just fine. 

Hostname   dd.mm.yyyy hh:mm Actions taken 
amnesix     7. 9.2015 10:30 Reboot
miraculix  29. 9.2015 03:00 Reboot
amnesix    18.10.2015 05:00 Reboot
amnesix    29.11.2015 16:00 Reboot. Disabled journaling
olympia    12.12.2015 11:00 Reboot. Disabled journaling
devzope    17.12.2015 04:00 Reboot. Disabled journaling
miraculix  22.12.2015 01:20 Reboot. Disabled journaling
olympia    30.12.2015 19:00 Reboot
amnesix     9. 1.2016 02:00 Reboot. fsck -f on all partitions
delphi 	   28. 1.2016 05:40 Reboot. Disabled journaling. fsck -f
devzope    30. 1.2016 04:50 Reboot. fsck -f
devzope     1. 2.2016 01:45 Reboot. fsck -f
miraculix   2. 2.2016 06:30 Reboot. fsck -f

The symptoms are always the same. After the Power down and reboot the Raid is resyncing but the system is working just fine until the next lookup. 

I've attached the dmesg.boot of one of the servers. They are all identical.
Comment 1 schmidt 2016-02-02 07:30:04 UTC
Created attachment 166418 [details]
dmesg.boot of one of the servers
Comment 2 schmidt 2016-02-18 09:41:01 UTC
Had three more lockups since reporting this bug. 

This is a real showstopper. 

if needed i can provide access to one of the machines for testing. 

Comment 3 Daniel G 2016-10-03 14:44:14 UTC

we've also seen this issue happening since migration towards 10.3-RELEASE.

Oct  3 16:00:44 storage4 kernel: vputx: negative ref count
Oct  3 16:00:44 storage4 kernel: 0xfffff801b881d760: tag zfs, type VDIR
Oct  3 16:00:44 storage4 kernel: usecount 0, writecount 0, refcount 0 mountedhere 0
Oct  3 16:00:44 storage4 kernel: flags (VI_FREE)
Oct  3 16:00:44 storage4 kernel: VI_LOCKed    lock type zfs: EXCL by thread 0xfffff8016083e000 (pid 1833, zfs, tid 102301)

We haven't seen this issue in 10.1 either.
For us also this is kind of a serious bug forcing a server restart.