Created attachment 169760 [details] Image of Screen showing panic. I have been battling a bug for about 4 months now. This has spanned 10.2-release, 10.3-release, 11-current. If I could get some assistance here in how to trace this problem I would gladly provide information. Known Good State: FreeBSD 10.2-11-current, works flawlessly. Bad State: Install py27-salt, run minions and or master. Note: tested only on my hardware, but reproducible. Steps to reproduce, run in ide compatibility mode or hardware that cannot do AHCI. Install FreeBSD 10.2-11-Current Install py27-salt run 'salt machine test.ping' Machine will then report. run 'salt machine state.apply test=True' Everything reports properly run 'salt machine state.apply' Expected behavior is report back of success, observed behavior is kernel panic. I get a very limited window that I can do any input to the machine. At this time I cannot boot past starting jails as salt minion loads previously. I cannot verify who this impacts, I cannot verify the extent of it as I have read zero data regarding this specific to salt. I can replicate this, I can reproduce it across multiple machines. Changing one machine from AHCI to IDE will induce this as well, that machine is now stable for weeks without issue. Sadly the two machines that are causing the headache have no allocation for this due to age. I have tried 2 different SSD's and 2 different spinners. The secondary machine does not have py27-salt installed, months without panic. Install py27-salt and it starts panicking.
In testing, I have used 10.2, 10.3, 11-current and now 10.0. 10.0, does not exhibit this behavior. The hardware I am having trouble with is PreAHCI ich7. Configuration, setup, is identical provisioned with ansible. I have tried loader.conf settings to no avail. hint.ahci.0.msi="0" hint.ahcich.0.sata_rev="0"
That's interesting. I would be curious to know if the machine panics at a consistent point in the state run if you did a `salt-call -l debug state.apply`. If it's a consistent point in the state run it may point to some underlying OS commands salt is using as the trigger.
It seems that when running 'salt-call -l debug state.apply' there is no real issue. It executes the command and does not create any problem. Once states are applied if I reboot and the minion restarts, it panic's just after loading the minion. Yet running 'salt machine state.apply' still does same behavior. Running salt-call I get enormous output. However, the paste here. If you take a look at line 9, "please install 'virt-what'" when running 'salt machine state.apply' this is the last line printed with debugging turned on the minion before panic. [DEBUG ] Reading configuration from /usr/local/etc/salt/minion [DEBUG ] Including configuration from '/usr/local/etc/salt/minion.d/_schedule.conf' [DEBUG ] Reading configuration from /usr/local/etc/salt/minion.d/_schedule.conf [DEBUG ] Configuration file path: /usr/local/etc/salt/minion [WARNING ] Insecure logging configuration detected! Sensitive data may be logged. [DEBUG ] Reading configuration from /usr/local/etc/salt/minion [DEBUG ] Including configuration from '/usr/local/etc/salt/minion.d/_schedule.conf' [DEBUG ] Reading configuration from /usr/local/etc/salt/minion.d/_schedule.conf [DEBUG ] Please install 'virt-what' to improve results of the 'virtual' grain. [DEBUG ] Initializing new SAuth for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506') [DEBUG ] Generated random reconnect delay between '1000ms' and '11000ms' (1950) [DEBUG ] Setting zmq_reconnect_ivl to '1950ms' [DEBUG ] Setting zmq_reconnect_ivl_max to '11000ms' [INFO ] Determining pillar cache [DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506', 'aes') [DEBUG ] Initializing new SAuth for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506') [DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506', 'clear') [DEBUG ] Decrypting the current master AES key [DEBUG ] Loaded minion key: /usr/local/etc/salt/pki/minion/minion.pem [DEBUG ] Loaded minion key: /usr/local/etc/salt/pki/minion/minion.pem [DEBUG ] LazyLoaded jinja.render [DEBUG ] LazyLoaded yaml.render [DEBUG ] LazyLoaded state.apply [DEBUG ] LazyLoaded grains.get [DEBUG ] LazyLoaded saltutil.is_running [DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506', 'aes') [DEBUG ] Initializing new SAuth for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506') [INFO ] Determining pillar cache [DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506', 'aes') [DEBUG ] Initializing new SAuth for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506') [DEBUG ] Loaded minion key: /usr/local/etc/salt/pki/minion/minion.pem [INFO ] Loading fresh modules for state activity [DEBUG ] LazyLoaded jinja.render [DEBUG ] LazyLoaded yaml.render [DEBUG ] In saltenv 'prod', looking at rel_path u'top.sls' to resolve u'salt://top.sls' [DEBUG ] In saltenv 'prod', ** considering ** path u'/var/cache/salt/minion/files/prod/top.sls' to resolve u'salt://top.sls' [INFO ] Fetching file from saltenv 'prod', ** skipped ** latest already in cache u'salt://top.sls' [DEBUG ] compile template: /var/cache/salt/minion/files/prod/top.sls [DEBUG ] Jinja search path: ['/var/cache/salt/minion/files/prod'] [PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/prod/top.sls' using 'jinja' renderer: 0.00688982009888 [DEBUG ] Rendered data from file: /var/cache/salt/minion/files/prod/top.sls:
Two interesting notes, 1: Installing on USB Thumb drive uses da not ata. No panic on the hardware. I believe I will run like this for now. I still retain copies of FreeBSD 10.0, 10.3 and 11-CURRENT on disk drives for testing purposes. 2: The same behavior is exhibited on my dual socket westmere, setting the bios to IDE instead of AHCI. I should have noted it was more modern hardware that I saw it on in the initial report. Thinking of this makes me think its possibly a broader problem with ata and salt.
I have an Asus P8B-X motherboard with 16G memory and an Intel 530 Series 120G SSD as the boot drive. We are running FreeBSD 10.3-STABLE #3 r300822. I believe I have this same issue. I was using salt minion to update this machine and it hung (several times). Prior to the hang I would always see the console message: ata2: already running! The machine would continue to run for a short period of time and then become unresponsive. A few minutes later I might get a panic otherwise it would just be in a hung state. I was able to pull apart the minion and found it was camcontrol causing the issue. I create a script that looped over the command: camcontrol identify ada0 and I am able to cause the system to either hang or panic regularly. Using the kernel debugger I was able to glean this information: spin lock 0xffffffff816f17e0 (sleepq chain) held by 0xfffff800g panic: spin lock held too long cpuid = 6 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe046562e800 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe046562e8b0 vpanic() at vpanic+0x126/frame 0xfffffe046562e8f0 panic() at panic+0x43/frame 0xfffffe046562e950 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x287/frame 0xfffffe046562e9c0 wakeup() at wakeup+0xf/frame 0xfffffe046562e9e0 vnlru_proc() at vnlru_proc+0x157/frame 0xfffffe046562ea70 fork_exit() at fork_exit+0x9a/frame 0xfffffe046562eab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe046562eab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 20 tid 100078 ] Stopped at kdb_enter+0x3e: movq $0,kdb_why If salt is disabled and I dont run camcontrol commands, this machine is stable. We have several other machines with different mother boards using the same kernel that dont have this problem.
MARKED AS SPAM