Bug 221618 - Random System freeze after update to 11.1
Summary: Random System freeze after update to 11.1
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 11.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2017-08-19 00:48 UTC by kayasaman
Modified: 2019-01-27 11:50 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description kayasaman 2017-08-19 00:48:26 UTC
Hi,

I'm experiencing complete freezes on two of my systems which started happening after updating from version 10.3-Release to 11.1-Release.

Both systems have the same motherboard and memory however, disk and system configurations are quite different.

System 1. has ZFS on root and is an iscsi target and additionally runs 2x jails, it is configured with 2x Intel igb NICs with 2x vlans over lagg

System 2. has UFS on root and serves several drives up through NFS that are configured as Zpools. This system also runs lagg over the 2x Intel igb NICs but no vlans


I'm not sure what more information I could provide as there are no kernel dumps since the systems become completely unresponsive. I have run 'top' on the systems and no high CPU load or memory usage at time of freeze.


Hopefully someone can help fix this issue?? :-)



The dmesg output of the systems are as follows:

Copyright (c) 1992-2017 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.1-RELEASE-p1 #0: Wed Aug  9 11:55:48 UTC 2017
    root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0)
VT(vga): resolution 640x480
CPU: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz (2000.05-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x30678  Family=0x6  Model=0x37  Stepping=8
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x41d8e3bf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,RDRAND>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x101<LAHF,Prefetch>
  Structured Extended Features=0x2282<TSCADJ,SMEP,ERMS,NFPUSG>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 8589934592 (8192 MB)
avail memory = 8102825984 (7727 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <SUPERM SMCI--MB>
WARNING: L1 data cache covers less APIC IDs than a core
0 < 1
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
random: unblocking device.
ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe0Block: 128/32 (20170303/tbfadt-748)
ioapic0 <Version 2.0> irqs 0-86 on motherboard
SMP: AP CPU #2 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #1 Launched!
Timecounter "TSC" frequency 2000049576 Hz quality 1000
random: entropy device external interface
kbd1 at kbdmux0
netmap: loaded module
module_register_init: MOD_LOAD (vesa, 0xffffffff80f5b220, 0) error 19
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
nexus0
vtvga0: <VT VGA driver> on motherboard
cryptosoft0: <software crypto> on motherboard
acpi0: <SUPERM SMCI--MB> on motherboard
acpi0: Power Button (fixed)
unknown: I/O range not supported
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
atrtc0: <AT realtime clock> port 0x70-0x77 on acpi0
atrtc0: Warning: Couldn't map I/O.
Event timer "RTC" frequency 32768 Hz quality 0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff irq 8 on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 450
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-safe" frequency 3579545 Hz quality 850
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pcib0: _OSC returned error 0x10
pci0: <ACPI PCI bus> on pcib0
vgapci0: <VGA-compatible display> port 0xe080-0xe087 mem 0x90000000-0x903fffff,0x80000000-0x8fffffff irq 16 at device 2.0 on pci0
vgapci0: Boot video device
ahci0: <AHCI SATA controller> port 0xe070-0xe077,0xe060-0xe063,0xe050-0xe057,0xe040-0xe043,0xe020-0xe03f mem 0x90a06000-0x90a067ff irq 19 at device 19.0 on pci0
ahci0: AHCI v1.30 with 2 3Gbps ports, Port Multiplier not supported
ahcich1: <AHCI channel> at channel 1 on ahci0
pci0: <encrypt/decrypt> at device 26.0 (no driver attached)
hdac0: <Intel BayTrail HDA Controller> mem 0x90a00000-0x90a03fff irq 22 at device 27.0 on pci0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pcib1: [GIANT-LOCKED]
pcib2: <ACPI PCI-PCI bridge> irq 18 at device 28.2 on pci0
pcib2: [GIANT-LOCKED]
pci1: <ACPI PCI bus> on pcib2
igb0: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xd000-0xd01f mem 0x90900000-0x9097ffff,0x90980000-0x90983fff irq 18 at device 0.0 on pci1
igb0: Using MSIX interrupts with 5 vectors
igb0: Ethernet address: 0c:c4:7a:b0:5f:30
igb0: Bound queue 0 to cpu 0
igb0: Bound queue 1 to cpu 1
igb0: Bound queue 2 to cpu 2
igb0: Bound queue 3 to cpu 3
igb0: netmap queues/slots: TX 4/1024, RX 4/1024
pcib3: <ACPI PCI-PCI bridge> irq 19 at device 28.3 on pci0
pcib3: [GIANT-LOCKED]
pci2: <ACPI PCI bus> on pcib3
pcib4: <ACPI PCI-PCI bridge> mem 0x90800000-0x90803fff irq 19 at device 0.0 on pci2
pci3: <ACPI PCI bus> on pcib4
pcib5: <PCI-PCI bridge> irq 16 at device 1.0 on pci3
pci4: <PCI bus> on pcib5
igb1: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xc000-0xc01f mem 0x90700000-0x9077ffff,0x90780000-0x90783fff irq 16 at device 0.0 on pci4
igb1: Using MSIX interrupts with 5 vectors
igb1: Ethernet address: 0c:c4:7a:b0:5f:31
igb1: Bound queue 0 to cpu 0
igb1: Bound queue 1 to cpu 1
igb1: Bound queue 2 to cpu 2
igb1: Bound queue 3 to cpu 3
igb1: netmap queues/slots: TX 4/1024, RX 4/1024
pcib6: <PCI-PCI bridge> irq 17 at device 2.0 on pci3
pci5: <PCI bus> on pcib6
pcib7: <PCI-PCI bridge> irq 18 at device 3.0 on pci3
pci6: <PCI bus> on pcib7
ahci1: <Marvell 88SE9230 AHCI SATA controller> port 0xb050-0xb057,0xb040-0xb043,0xb030-0xb037,0xb020-0xb023,0xb000-0xb01f mem 0x90610000-0x906107ff irq 18 at device 0.0 on pci6
ahci1: AHCI v1.20 with 8 6Gbps ports, Port Multiplier not supported
ahci1: quirks=0x900<NOBSYRES,ALTSIG>
ahcich2: <AHCI channel> at channel 0 on ahci1
ahcich3: <AHCI channel> at channel 1 on ahci1
ahcich4: <AHCI channel> at channel 2 on ahci1
ahcich5: <AHCI channel> at channel 3 on ahci1
ahcich6: <AHCI channel> at channel 4 on ahci1
ahcich7: <AHCI channel> at channel 5 on ahci1
ahcich8: <AHCI channel> at channel 6 on ahci1
ahcich9: <AHCI channel> at channel 7 on ahci1
ehci0: <Intel BayTrail USB 2.0 controller> mem 0x90a05000-0x90a053ff irq 23 at device 29.0 on pci0
usbus0: EHCI version 1.0
usbus0 on ehci0
usbus0: 480Mbps High Speed USB v2.0
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
acpi_button0: <Power Button> on acpi0
acpi_button1: <Sleep Button> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
uart0: <16950 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart2: <16950 or compatible> port 0x3e0-0x3e7 irq 3 on acpi0
uart3: <16950 or compatible> port 0x3e8-0x3ef irq 4 on acpi0
uart4: <16950 or compatible> port 0x2e0-0x2e7 irq 3 on acpi0
orm0: <ISA Option ROM> at iomem 0xd2000-0xd2fff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
fdc0: <Enhanced floppy controller> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
ppc0: cannot reserve I/O port range
est0: <Enhanced SpeedStep Frequency Control> on cpu0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
est2: <Enhanced SpeedStep Frequency Control> on cpu2
est3: <Enhanced SpeedStep Frequency Control> on cpu3
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 1.000 msec
tcp_init: WARNING: TCB hash size not a power of 2, clipped from 32000 to 32768.
nvme cam probe device init
hdacc0: <Realtek ALC888 HDA CODEC> at cad 0 on hdac0
hdaa0: <Realtek ALC888 Audio Function Group> at nid 1 on hdacc0
pcm0: <Realtek ALC888 (Front Analog)> at nid 27 and 25 on hdaa0
pcm1: <Realtek ALC888 (Internal Digital)> at nid 17 on hdaa0
hdacc1: <Intel (0x2882) HDA CODEC> at cad 2 on hdac0
hdaa1: <Intel (0x2882) Audio Function Group> at nid 1 on hdacc1
hdaa1: hdaa_audio_as_parse: Duplicate pin 0 (5) in association 1! Disabling association.
pcm2: <Intel (0x2882) (HDMI/DP 8ch)> at nid 6 on hdaa1
ugen0.1: <Intel EHCI root HUB> at usbus0
uhub0: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus0
ada0 at ahcich1 bus 0 scbus0 target 0 lun 0
ada0: <SAMSUNG MZ7LM240HMHQ-00005 GXT5204Q> ACS-2 ATA SATA 3.x device
ada0: Serial Number S2TWNX0J300288
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 228936MB (468862128 512 byte sectors)
ada0: quirks=0x3<4K,NCQ_TRIM_BROKEN>
ada1 at ahcich2 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD101KRYZ-01JPDB0 01.01H01> ACS-2 ATA SATA 3.x device
ada1: Serial Number 7JHLZY0C
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 9537536MB (19532873728 512 byte sectors)
ada2 at ahcich3 bus 0 scbus2 target 0 lun 0
ada2: <WDC WD101KRYZ-01JPDB0 01.01H01> ACS-2 ATA SATA 3.x device
ada2: Serial Number 7JGJPUSC
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 9537536MB (19532873728 512 byte sectors)
pass3 at ahcich9 bus 0 scbus8 target 0 lun 0
pass3: <Marvell Console 1.01> Removable Processor SCSI device
pass3: Serial Number HKDP221516WL
pass3: 150.000MB/s transfers (SATA 1.x, UDMA4, ATAPI 12bytes, PIO 8192bytes)
Trying to mount root from zfs:zroot/ROOT/default []...
Root mount waiting for: usbus0
Root mount waiting for: usbus0
Root mount waiting for: usbus0
uhub0: 8 ports with 8 removable, self powered
ugen0.2: <vendor 0x8087 product 0x07e6> at usbus0
uhub1 on uhub0
uhub1: <vendor 0x8087 product 0x07e6, class 9/0, rev 2.00/0.14, addr 2> on usbus0
Root mount waiting for: usbus0
uhub1: 4 ports with 4 removable, self powered
ugen0.3: <vendor 0x0409 product 0x005a> at usbus0
uhub2 on uhub1
uhub2: <vendor 0x0409 product 0x005a, class 9/0, rev 2.00/1.00, addr 3> on usbus0
Root mount waiting for: usbus0
uhub2: 4 ports with 4 removable, self powered
lagg0: link state changed to DOWN
igb0: link state changed to UP
lagg0: link state changed to UP
lagg0.192: link state changed to UP
lagg0.300: link state changed to UP
igb1: link state changed to UP
Comment 1 kayasaman 2017-08-19 15:10:51 UTC
I've done some extra testing by disabling all the tuneables in /boot/loader.conf and /etc/sysctl.conf which didn't have any effect. System 2 is still freezing :-(

I also took out the lagg and currently just have 1x NIC hooked up to see if perhaps the lagg is the issue as there was a mention of updated code in the bugtracker reports. For this one I'm currently waiting to see the results as I've just changed the setup; though on first reboot the system wasn't up for even 40 minutes before it become completely unresponsive???

Out of both systems I have the 'freeze' issue seems more prominent on System 2 then System 1, meaning that something on the 2nd machine is triggering the bug/issue more then on the 1st system..... but what?? Lagg?? NFS?? ZFS??

Currently 'top' output looks fine though NFS seems a little high... between 2 -7% of CPU:



last pid:  9346;  load averages:  0.12,  0.18,  0.14    up 0+00:15:12  16:10:00
37 processes:  1 running, 36 sleeping
CPU:     % user,     % nice,     % system,     % interrupt,     % idle
Mem: 44M Active, 34M Inact, 6646M Wired, 15M Buf, 1061M Free
ARC: 6246M Total, 865M MFU, 5338M MRU, 2051K Anon, 13M Header, 29M Other
     6178M Compressed, 6448M Uncompressed, 1.04:1 Ratio
Swap: 2327M Total, 2327M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
  673 root        128  20    0  8328K  4008K rpcsvc  3   0:46   5.08% nfsd
  699 root          1  20    0 72120K 16404K select  3   0:01   0.00% snmpd
 2783 root          1  20    0 20160K  3628K select  1   0:01   0.00% top
  847 munin         1  20    0 46300K 14320K select  0   0:00   0.00% perl
  678 root          1  20    0 12524K  3176K rpcsvc  1   0:00   0.00% rpc.lockd
  735 root          1  20    0 20568K 12476K select  3   0:00   0.00% ntpd
 9319 root          1  21    0 62480K  7840K select  0   0:00   0.00% sshd
 1863 root          1  20    0 19660K  3808K pause   3   0:00   0.00% csh
  671 root          1  20    0 10376K  2996K select  0   0:00   0.00% nfsd
  545 root          1  20    0 10492K  2436K select  0   0:00   0.00% syslogd
  959 root          1  24    0 43764K  3056K wait    2   0:00   0.00% login
  839 root          1  20    0 42472K  8748K kqread  1   0:00   0.00% master
 9324 root          1  20    0 19660K  3812K pause   0   0:00   0.00% csh
  841 postfix       1  20    0 44584K  8824K kqread  2   0:00   0.00% qmgr
Comment 2 mlavkin 2018-03-12 07:30:52 UTC
(In reply to kayasaman from comment #1)
Hi, do you fix your problem?