Bug 209099

Summary: ata2: already running! panic: bad link elm 0xfffff80003b7e6a0 prev->next != elm
Product: Base System Reporter: Greg Furstenwerth <furstenwerth>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Only Me CC: amd64, daryl, junovitch, pikachu17997
Priority: ---    
Version: CURRENT   
Hardware: amd64   
OS: Any   
Description Flags
Image of Screen showing panic. none

Description Greg Furstenwerth 2016-04-27 16:17:49 UTC
Created attachment 169760 [details]
Image of Screen showing panic.

I have been battling a bug for about 4 months now. This has spanned 10.2-release, 10.3-release, 11-current. If I could get some assistance here in how to trace this problem I would gladly provide information.

Known Good State: FreeBSD 10.2-11-current, works flawlessly.
Bad State: Install py27-salt, run minions and or master. 

Note: tested only on my hardware, but reproducible. 

Steps to reproduce, run in ide compatibility mode or hardware that cannot do AHCI.
Install FreeBSD 10.2-11-Current
Install py27-salt
run 'salt machine test.ping'
Machine will then report.
run 'salt machine state.apply test=True'
Everything reports properly
run 'salt machine state.apply' 

Expected behavior is report back of success, observed behavior is kernel panic.

I get a very limited window that I can do any input to the machine. At this time I cannot boot past starting jails as salt minion loads previously. I cannot verify who this impacts, I cannot verify the extent of it as I have read zero data regarding this specific to salt. I can replicate this, I can reproduce it across multiple machines. 

Changing one machine from AHCI to IDE will induce this as well, that machine is now stable for weeks without issue. Sadly the two machines that are causing the headache have no allocation for this due to age. I have tried 2 different SSD's and 2 different spinners. The secondary machine does not have py27-salt installed, months without panic. Install py27-salt and it starts panicking.
Comment 1 Greg Furstenwerth 2016-04-28 22:59:10 UTC
In testing, I have used 10.2, 10.3, 11-current and now 10.0. 

10.0, does not exhibit this behavior. The hardware I am having trouble with is PreAHCI ich7. Configuration, setup, is identical provisioned with ansible.

I have tried loader.conf settings to no avail.

Comment 2 Jason Unovitch freebsd_committer 2016-04-29 00:13:16 UTC
That's interesting.  I would be curious to know if the machine panics at a consistent point in the state run if you did a `salt-call -l debug state.apply`.  If it's a consistent point in the state run it may point to some underlying OS commands salt is using as the trigger.
Comment 3 Greg Furstenwerth 2016-04-29 13:58:19 UTC
It seems that when running 'salt-call -l debug state.apply' there is no real issue. It executes the command and does not create any problem. Once states are applied if I reboot and the minion restarts, it panic's just after loading the minion. Yet running 'salt machine state.apply' still does same behavior. Running salt-call I get enormous output.

However, the paste here. If you take a look at line 9, "please install 'virt-what'" when running 'salt machine state.apply' this is the last line printed with debugging turned on the minion before panic. 

[DEBUG   ] Reading configuration from /usr/local/etc/salt/minion
[DEBUG   ] Including configuration from '/usr/local/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /usr/local/etc/salt/minion.d/_schedule.conf
[DEBUG   ] Configuration file path: /usr/local/etc/salt/minion
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG   ] Reading configuration from /usr/local/etc/salt/minion
[DEBUG   ] Including configuration from '/usr/local/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /usr/local/etc/salt/minion.d/_schedule.conf
[DEBUG   ] Please install 'virt-what' to improve results of the 'virtual' grain.
[DEBUG   ] Initializing new SAuth for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506')
[DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (1950)
[DEBUG   ] Setting zmq_reconnect_ivl to '1950ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
[INFO    ] Determining pillar cache
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506', 'clear')
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /usr/local/etc/salt/pki/minion/minion.pem
[DEBUG   ] Loaded minion key: /usr/local/etc/salt/pki/minion/minion.pem
[DEBUG   ] LazyLoaded jinja.render
[DEBUG   ] LazyLoaded yaml.render
[DEBUG   ] LazyLoaded state.apply
[DEBUG   ] LazyLoaded grains.get
[DEBUG   ] LazyLoaded saltutil.is_running
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506')
[INFO    ] Determining pillar cache
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/usr/local/etc/salt/pki/minion', 'machine.lan', 'tcp://salt.master.lan:4506')
[DEBUG   ] Loaded minion key: /usr/local/etc/salt/pki/minion/minion.pem
[INFO    ] Loading fresh modules for state activity
[DEBUG   ] LazyLoaded jinja.render
[DEBUG   ] LazyLoaded yaml.render
[DEBUG   ] In saltenv 'prod', looking at rel_path u'top.sls' to resolve u'salt://top.sls'
[DEBUG   ] In saltenv 'prod', ** considering ** path u'/var/cache/salt/minion/files/prod/top.sls' to resolve u'salt://top.sls'
[INFO    ] Fetching file from saltenv 'prod', ** skipped ** latest already in cache u'salt://top.sls'
[DEBUG   ] compile template: /var/cache/salt/minion/files/prod/top.sls
[DEBUG   ] Jinja search path: ['/var/cache/salt/minion/files/prod']
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/prod/top.sls' using 'jinja' renderer: 0.00688982009888
[DEBUG   ] Rendered data from file: /var/cache/salt/minion/files/prod/top.sls:
Comment 4 Greg Furstenwerth 2016-04-30 16:24:50 UTC
Two interesting notes,

1: Installing on USB Thumb drive uses da not ata. No panic on the hardware. I believe I will run like this for now. I still retain copies of FreeBSD 10.0, 10.3 and 11-CURRENT on disk drives for testing purposes.

2: The same behavior is exhibited on my dual socket westmere, setting the bios to IDE instead of AHCI. I should have noted it was more modern hardware that I saw it on in the initial report. Thinking of this makes me think its possibly a broader problem with  ata and salt.
Comment 5 dksayers 2016-06-08 06:32:40 UTC
I have an Asus P8B-X motherboard with 16G memory and an Intel 530 Series
120G SSD as the boot drive. We are running FreeBSD 10.3-STABLE #3 r300822.

I believe I have this same issue. I was using salt minion to update this
machine and it hung (several times). Prior to the hang I would always see the
console message:

ata2: already running!

The machine would continue to run for a short period of time and then become
unresponsive. A few minutes later I might get a panic otherwise it would just 
be in a hung state.

I was able to pull apart the minion and found it was camcontrol causing the
issue. I create a script that looped over the command:

camcontrol identify ada0 

and I am able to cause the system to either hang or panic regularly.
Using the kernel debugger I was able to glean this information:

spin lock 0xffffffff816f17e0 (sleepq chain) held by 0xfffff800g
panic: spin lock held too long                                                  
cpuid = 6                                                                       
KDB: stack backtrace:                                                           
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe046562e800  
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe046562e8b0                  
vpanic() at vpanic+0x126/frame 0xfffffe046562e8f0                               
panic() at panic+0x43/frame 0xfffffe046562e950                                  
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x287/frame 0xfffffe046562e9c0 
wakeup() at wakeup+0xf/frame 0xfffffe046562e9e0                                 
vnlru_proc() at vnlru_proc+0x157/frame 0xfffffe046562ea70                       
fork_exit() at fork_exit+0x9a/frame 0xfffffe046562eab0                          
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe046562eab0               
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---                                       
KDB: enter: panic                                                               
[ thread pid 20 tid 100078 ]                                                    
Stopped at      kdb_enter+0x3e: movq    $0,kdb_why

If salt is disabled and I dont run camcontrol commands, this machine is stable.
We have several other machines with different mother boards using the same
kernel that dont have this problem.
Comment 6 Sunny007 2019-06-16 16:03:21 UTC