When a firewire device is powered on or off, the firewire bus is reset. A firewire bus reset hogs the CPU for too long, more than 0.1 second, causing data to be lost. Disk and Ethernet activity (and probably all i/o) stop. Stopping Ethernet for too long causes data to be lost. Userland processes also don't run, but they have to expect that, it is the devices not getting serviced that I'm concerned about. How-To-Repeat: Have some i/o going. fwcontrol -r observe that all i/o stops for 1/10 second
I found the source of this problem. When a firewire bus resets, the firewire driver prints a few lines to the console, using printf(9) and device_printf(9). I suspect that these are running at splfw aka splimp, locking out other i/o. Commenting out the *printf() calls fixes the problem, but that isn't a good solution. Would changing the *printf() calls to log(9) calls be safe? ("safe" meaning other i/o doesn't get locked out) Ah, for the good old days when 19200 baud seemed fast...
Dieter wrote: > I found the source of this problem. When a firewire bus resets, > the firewire driver prints a few lines to the console, > using printf(9) and device_printf(9). I suspect that these are > running at splfw aka splimp, locking out other i/o. > > Commenting out the *printf() calls fixes the problem, but that > isn't a good solution. > > Would changing the *printf() calls to log(9) calls be safe? > ("safe" meaning other i/o doesn't get locked out) > > Ah, for the good old days when 19200 baud seemed fast... > _______________________________________________ > Which one are you looking at Dieter? Sean
Hi, I believe no side effect is anticipated the message. "side effect" includes locking i/o. ---- Katsushi Kobayashi On 2008/12/15, at 4:30, Dieter wrote: > I found the source of this problem. When a firewire bus resets, > the firewire driver prints a few lines to the console, > using printf(9) and device_printf(9). I suspect that these are > running at splfw aka splimp, locking out other i/o. > > Commenting out the *printf() calls fixes the problem, but that > isn't a good solution. > > Would changing the *printf() calls to log(9) calls be safe? > ("safe" meaning other i/o doesn't get locked out) > > Ah, for the good old days when 19200 baud seemed fast... > _______________________________________________ > freebsd-firewire@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-firewire > To unsubscribe, send any mail to "freebsd-firewire-unsubscribe@freebsd.org"
> I found the source of this problem. When a firewire bus resets, > the firewire driver prints a few lines to the console, > using printf(9) and device_printf(9). I suspect that these are > running at splfw aka splimp, locking out other i/o. This can't be the case. There's no SPL involved at all. Maybe removing the printfs causes an interrupt to be serviced faster, resulting in what appears to be better performance... > Commenting out the *printf() calls fixes the problem, but that > isn't a good solution. > > Would changing the *printf() calls to log(9) calls be safe? > ("safe" meaning other i/o doesn't get locked out) > > Ah, for the good old days when 19200 baud seemed fast... > _______________________________________________ > freebsd-drivers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-drivers > To unsubscribe, send any mail to "freebsd-drivers-unsubscribe@freebsd.org" > >
Sean> Which file in dev/firewire are you looking at? fwohci.c and firewire.c examples: printf("non CYCLEMASTER mode\n"); device_printf(fc->dev, "Initiate bus reset\n"); ------------------- Warner> This can't be the case. There's no SPL involved at all. Maybe Warner> removing the printfs causes an interrupt to be serviced faster, Warner> resulting in what appears to be better performance... With the printfs, Ethernet is not getting serviced. This is repeatable and easily reproduced. Without the printfs, it seems ok. If it isn't spl, what is locking out Ethernet?
In message: <200812170329.DAA17848@sopwith.solgatos.com> Dieter <freebsd@sopwith.solgatos.com> writes: : Sean> Which file in dev/firewire are you looking at? : : fwohci.c and firewire.c : : examples: : : printf("non CYCLEMASTER mode\n"); : : device_printf(fc->dev, "Initiate bus reset\n"); : : ------------------- : : Warner> This can't be the case. There's no SPL involved at all. Maybe : Warner> removing the printfs causes an interrupt to be serviced faster, : Warner> resulting in what appears to be better performance... : : With the printfs, Ethernet is not getting serviced. This : is repeatable and easily reproduced. Without the printfs, : it seems ok. : : If it isn't spl, what is locking out Ethernet? interrupt storm? In old-spl-locked drivers, often times the interrupt would be masked while certain operations happened. In new mutex-locked freebsd, there's no way to block the interrupts, so sometimes old code gets into an interrupt storm, which starves other things. Not sure why printf would trigger this, however... Warner
> > : Warner> This can't be the case. There's no SPL involved at all. Maybe > > : Warner> removing the printfs causes an interrupt to be serviced faster, > > : Warner> resulting in what appears to be better performance... > > : > > : With the printfs, Ethernet is not getting serviced. This > > : is repeatable and easily reproduced. Without the printfs, > > : it seems ok. > > : > > : If it isn't spl, what is locking out Ethernet? > > > > interrupt storm? In old-spl-locked drivers, often times the interrupt > > would be masked while certain operations happened. In new > > mutex-locked freebsd, there's no way to block the interrupts, so > > sometimes old code gets into an interrupt storm, which starves other > > things. Not sure why printf would trigger this, however... I tried logging in on the console and "cat /etc/termcap" does not create a problem. Systat -vmstat reports that IRQ 5 (console) interrupts go from 0 to 120. CPU: 0.0%Sys 0.6%Intr 0.0%User 0.0%Nice 99.4%Idle The normal firewire bus reset printfs generate 890 IRQ 5 interrupts with the shell firing off "fwcontrol -u 1 -r" in a loop. CPU goes from idle to 100% busy: 98.1%Sys 1.7%Intr 0.3%User 0.0%Nice 0.0%Idle "fwcontrol -u 0 -r" (the onboard firewire controller) gives the same 890 IRQ 5 interrupts, but CPU usage is different: 79.2%Sys 20.8%Intr 0.0%User 0.0%Nice 0.0%Idle A couple of other IRQs go up to 30-40 range, nothing huge. A single "fwcontrol -u 0 -r" generates 14 interrupts on IRQ 5 and 4 interrupts on 3 other IRQs. A single "fwcontrol -u 1 -r" generates 14 interrupts on IRQ 5 and 3 interrupts on 2 other IRQs. yawn ------ With the bus reset printfs commented out, I get: -u 0 80.0%Sys 4.8%Intr 13.9%User 0.0%Nice 1.2%Idle 3300 interrupts on the onboard firewire controller -u 1 84.8%Sys 1.1%Intr 14.1%User 0.0%Nice 0.0%Idle 3900 interrupts on the PCI card firewire controller After awhile the console says: fwohci1: phy read failed(1). i = 100 retry = 0 fwohci1: phy read failed(1). i = 100 retry = 1 fwohci1: phy read failed(1). i = 100 retry = 2 fwohci1: phy read failed(1). i = 100 retry = 3 ... and the system becomes unresponsive. Reset button required. :-( Observations: Kernel printfs to console appear to take more interrupts and more CPU than userland output to console. Seems odd that the onboard controller uses so much CPU for interrupts in the with-printf case, but there wasn't a large number on interrupts. The without-printf case had a LOT more interrupts from the fw controller. Resetting the firewire bus in an endless loop is not recommended. I'm not an expert on interrupt storms, but none of this looks like an interrupt storm to me. (Assuming systat is trustworthy.) Looks like the kernel printf takes a *lot* of CPU, and somehow locks out Ethernet.
M. Warner Losh wrote: > In message: <200812170329.DAA17848@sopwith.solgatos.com> > Dieter <freebsd@sopwith.solgatos.com> writes: > : Sean> Which file in dev/firewire are you looking at? > : > : fwohci.c and firewire.c > : > : examples: > : > : printf("non CYCLEMASTER mode\n"); > : > : device_printf(fc->dev, "Initiate bus reset\n"); > : > : ------------------- > : > : Warner> This can't be the case. There's no SPL involved at all. Maybe > : Warner> removing the printfs causes an interrupt to be serviced faster, > : Warner> resulting in what appears to be better performance... > : > : With the printfs, Ethernet is not getting serviced. This > : is repeatable and easily reproduced. Without the printfs, > : it seems ok. > : > : If it isn't spl, what is locking out Ethernet? > > interrupt storm? In old-spl-locked drivers, often times the interrupt > would be masked while certain operations happened. In new > mutex-locked freebsd, there's no way to block the interrupts, so > sometimes old code gets into an interrupt storm, which starves other > things. Not sure why printf would trigger this, however... > > Warner > _______________________________________________ > freebsd-drivers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-drivers > To unsubscribe, send any mail to "freebsd-drivers-unsubscribe@freebsd.org" > Let me take a stab at this one this week. There's got to be something more going on than a printf() causing hell. It could be a mutex being held causing nightmares. The firewire code is my current project, I'll look at this issue now. Sean
Here are the results of two more experiments: I tried to verify what spl the firewire code is at when calling printf(9) by adding the following block into fwohci.c just before one of the printfs. { #include <sys/types.h> #include <sys/systm.h> intrmask_t debug_spl; /* __uint32_t */ intrmask_t debug_spl_high; intrmask_t debug_spl_tty; intrmask_t debug_spl_fw; intrmask_t debug_spl_0; intrmask_t debug_spl_net; debug_spl = splhigh(); debug_spl_high = spltty(); debug_spl_tty = splfw(); debug_spl_fw = splnet(); debug_spl_net = splhigh(); spl0(); /* void */ debug_spl_0 = splhigh(); splx(debug_spl); printf("fwohci_intr_core(): spl = 0x%x\n splhigh=0x%x spltty=0x%x splfw=0x%x splnet=0x%x spl0=0x%x\n", debug_spl, debug_spl_high, debug_spl_tty, debug_spl_fw, debug_spl_net, debug_spl_0); } But my results appear bogus: fwohci_intr_core(): spl = 0x0 splhigh=0x0 spltty=0x0 splfw=0x0 splnet=0x0 spl0=0x0 I have examined my code and the spl(9) man page several times but I can't find what is wrong. Any clues? -------------------------------------------------------- To isolate the effects of printf(9) from the firewire bus reset, I picked a trivial system call (chown(2)) and added some printf(9) calls. Calling chown several times and monitoring with systat -vmstat gives: 1098 interrupts on the console IRQ 93.1%Sys 6.7%Intr 0.2%User 0.0%Nice 0.0%Idle This did NOT interfere with Ethernet. So printf(9) interferes with Ethernet when called from the firewire driver, but not when called from a vanilla system call. -------------------------------------------------------- sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A, console sio0: [FILTER] What does [FILTER] mean? I don't see an explanation on the sio man page.
In message <494DAEB1.1070909@miralink.com>, Sean Bruno writes: > I setup my system to execute a bus reset every 1 second, simultaneously, > I started copying a large > file onto the system see if anything would happen. > > I saw no change to copy speed reported by SCP during the entire > transaction. I also see no change > to the system load while this is occurring. Does your system have a RS-232 console or VGA console? The problem might be specific to RS-232. > This seems indicative of "something else" going on. Any idea what this "something else" might be? Does anyone have a clue why my spl calls are returning 0 ?
Dieter wrote: > Here are the results of two more experiments: > > I tried to verify what spl the firewire code is at when calling > printf(9) by adding the following block into fwohci.c just > before one of the printfs. > > { > #include <sys/types.h> > #include <sys/systm.h> > intrmask_t debug_spl; /* __uint32_t */ > intrmask_t debug_spl_high; > intrmask_t debug_spl_tty; > intrmask_t debug_spl_fw; > intrmask_t debug_spl_0; > intrmask_t debug_spl_net; > > debug_spl = splhigh(); > debug_spl_high = spltty(); > debug_spl_tty = splfw(); > debug_spl_fw = splnet(); > debug_spl_net = splhigh(); > spl0(); /* void */ > debug_spl_0 = splhigh(); > splx(debug_spl); > > printf("fwohci_intr_core(): spl = 0x%x\n splhigh=0x%x spltty=0x%x splfw=0x%x splnet=0x%x spl0=0x%x\n", > debug_spl, debug_spl_high, debug_spl_tty, debug_spl_fw, debug_spl_net, debug_spl_0); > } > > But my results appear bogus: > > fwohci_intr_core(): spl = 0x0 > splhigh=0x0 spltty=0x0 splfw=0x0 splnet=0x0 spl0=0x0 > > I have examined my code and the spl(9) man page several times > but I can't find what is wrong. Any clues? > > -------------------------------------------------------- > > To isolate the effects of printf(9) from the firewire bus reset, > I picked a trivial system call (chown(2)) and added some printf(9) > calls. > > Calling chown several times and monitoring with systat -vmstat gives: > 1098 interrupts on the console IRQ > 93.1%Sys 6.7%Intr 0.2%User 0.0%Nice 0.0%Idle > > This did NOT interfere with Ethernet. > > So printf(9) interferes with Ethernet when called from the > firewire driver, but not when called from a vanilla system call. > > I setup my system to execute a bus reset every 1 second, simultaneously, I started copying a large file onto the system see if anything would happen. I saw no change to copy speed reported by SCP during the entire transaction. I also see no change to the system load while this is occurring. This seems indicative of "something else" going on. > -------------------------------------------------------- > > sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 > sio0: type 16550A, console > sio0: [FILTER] > > What does [FILTER] mean? I don't see an explanation on the sio man page. > Did you mean to ask a different list? Because I have no idea. :)
Dieter wrote: > In message <494DAEB1.1070909@miralink.com>, Sean Bruno writes: > > >> I setup my system to execute a bus reset every 1 second, simultaneously, >> I started copying a large >> file onto the system see if anything would happen. >> >> I saw no change to copy speed reported by SCP during the entire >> transaction. I also see no change >> to the system load while this is occurring. >> > > Does your system have a RS-232 console or VGA console? > The problem might be specific to RS-232. > > I have both. I disabled my serial console to test, no change when using the VGA console only. >> This seems indicative of "something else" going on. >> > > Any idea what this "something else" might be? > > Does anyone have a clue why my spl calls are returning 0 ? > > I confirmed that spl's are complete no-ops since rel 5. So, you want to ignore them as they are just markers now where locking should be implemented. I went back in your original submission and I can see a significant drop in I/O when resetting the F/W bus. If I have the following running: dd if=/dev/zero of=/dev/ad6 bs=64k Then I see about 64MB/S normally. When I reset any firewire bus, I see the throughput drop as reported by iostat by a significant amount(2-4 MB/s). I am assuming that the firewire driver is holding a very heavy lock here that is cascading down to the printf(). At least, that would be my ASSumption. :) I will continue to poke around. You are not crazy Dieter. Sean
Sean Bruno wrote: > Dieter wrote: >> In message <494DAEB1.1070909@miralink.com>, Sean Bruno writes: >> >> >>> I setup my system to execute a bus reset every 1 second, >>> simultaneously, I started copying a large >>> file onto the system see if anything would happen. >>> >>> I saw no change to copy speed reported by SCP during the entire >>> transaction. I also see no change >>> to the system load while this is occurring. >>> >> >> Does your system have a RS-232 console or VGA console? >> The problem might be specific to RS-232. >> >> > I have both. I disabled my serial console to test, no change when using > the VGA console only. > >>> This seems indicative of "something else" going on. >>> >> >> Any idea what this "something else" might be? >> >> Does anyone have a clue why my spl calls are returning 0 ? >> >> > I confirmed that spl's are complete no-ops since rel 5. So, you want > to ignore > them as they are just markers now where locking should be implemented. > > I went back in your original submission and I can see a significant > drop in I/O > when resetting the F/W bus. If I have the following running: > dd if=/dev/zero of=/dev/ad6 bs=64k > > Then I see about 64MB/S normally. When I reset any firewire bus, I > see the throughput > drop as reported by iostat by a significant amount(2-4 MB/s). > > I am assuming that the firewire driver is holding a very heavy lock > here that is cascading > down to the printf(). At least, that would be my ASSumption. :) > > I will continue to poke around. You are not crazy Dieter. > > Sean > > Hrm ... Looking over the log messages that are generated during a bus reset, there's a mish-mash of printf() and device_printf() calls. I'm not sure what the impact of that is to real behavior, but /var/log/messages has a tendency to get garbled like this: Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, CYCLEMASTER mode Dec 22 16:00:18 home-test kernel: firewi Dec 22 16:00:18 home-test kernel: re1: Dec 22 16:00:18 home-test kernel: 1 n Dec 22 16:00:18 home-test kernel: odes Dec 22 16:00:18 home-test kernel: , ma Dec 22 16:00:18 home-test kernel: xhop Dec 22 16:00:18 home-test kernel: <= Dec 22 16:00:18 home-test kernel: 0, c Dec 22 16:00:18 home-test kernel: able Dec 22 16:00:18 home-test kernel: IRM Dec 22 16:00:18 home-test kernel: = 0 Dec 22 16:00:18 home-test kernel: (me Dec 22 16:00:18 home-test kernel: ) Dec 22 16:00:18 home-test kernel: f Dec 22 16:00:18 home-test kernel: irew Dec 22 16:00:18 home-test kernel: ire1 Dec 22 16:00:18 home-test kernel: : bu Dec 22 16:00:18 home-test kernel: s ma Dec 22 16:00:18 home-test kernel: nage Dec 22 16:00:18 home-test kernel: r 0 Dec 22 16:00:18 home-test kernel: (me) Dec 22 16:00:18 home-test kernel: Let me try to eliminate this behaviour first. Sean
> > I confirmed that spl's are complete no-ops since rel 5. So, you want > > to ignore > > them as they are just markers now where locking should be implemented. I hunted down the spl code, and you're right. Wow, I wonder how drivers still using spl calls work at all? > is to real behavior, but /var/log/messages has a tendency to get garbled > like this: > > Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset > Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset > Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, > CYCLEMASTER mode > Dec 22 16:00:18 home-test kernel: firewi > Dec 22 16:00:18 home-test kernel: re1: > Dec 22 16:00:18 home-test kernel: 1 n > Dec 22 16:00:18 home-test kernel: odes > Dec 22 16:00:18 home-test kernel: , ma > Dec 22 16:00:18 home-test kernel: xhop > Dec 22 16:00:18 home-test kernel: <= > Dec 22 16:00:18 home-test kernel: 0, c > Dec 22 16:00:18 home-test kernel: able Do the lines get folded on the console, or only in /var/log/messages?
> >>> I confirmed that spl's are complete no-ops since rel 5. So, you want > >>> to ignore > >>> them as they are just markers now where locking should be implemented. > >>> > > > > I hunted down the spl code, and you're right. Wow, I wonder how drivers > > still using spl calls work at all? > > > > > I believe that the spl() calls are just left there as a hint where > locking should be. > > As far as I understand, we need to pay attention to the mutex locks. I'll rephrase my question. In the old days, locking was done with spl. The new way is with mutex. But with the spl calls being replaced with noops, and as far as I can tell the driver is not using mutex, there doesn't appear to be any locking. So the driver can step on itself. > >> is to real behavior, but /var/log/messages has a tendency to get garbled > >> like this: > >> > >> Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset > >> Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset > >> Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, > >> CYCLEMASTER mode > >> Dec 22 16:00:18 home-test kernel: firewi > >> Dec 22 16:00:18 home-test kernel: re1: > >> Dec 22 16:00:18 home-test kernel: 1 n > >> Dec 22 16:00:18 home-test kernel: odes > >> Dec 22 16:00:18 home-test kernel: , ma > >> Dec 22 16:00:18 home-test kernel: xhop > >> Dec 22 16:00:18 home-test kernel: <= > >> Dec 22 16:00:18 home-test kernel: 0, c > >> Dec 22 16:00:18 home-test kernel: able > >> > > > > Do the lines get folded on the console, or only in /var/log/messages? > > > > As far as I can see, the console messages are fine. It's only the > messages that get > garbled. Perhaps an artifact of syslogd?
Dieter wrote: >>> I confirmed that spl's are complete no-ops since rel 5. So, you want >>> to ignore >>> them as they are just markers now where locking should be implemented. >>> > > I hunted down the spl code, and you're right. Wow, I wonder how drivers > still using spl calls work at all? > > I believe that the spl() calls are just left there as a hint where locking should be. As far as I understand, we need to pay attention to the mutex locks. >> is to real behavior, but /var/log/messages has a tendency to get garbled >> like this: >> >> Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset >> Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset >> Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, >> CYCLEMASTER mode >> Dec 22 16:00:18 home-test kernel: firewi >> Dec 22 16:00:18 home-test kernel: re1: >> Dec 22 16:00:18 home-test kernel: 1 n >> Dec 22 16:00:18 home-test kernel: odes >> Dec 22 16:00:18 home-test kernel: , ma >> Dec 22 16:00:18 home-test kernel: xhop >> Dec 22 16:00:18 home-test kernel: <= >> Dec 22 16:00:18 home-test kernel: 0, c >> Dec 22 16:00:18 home-test kernel: able >> > > Do the lines get folded on the console, or only in /var/log/messages? > As far as I can see, the console messages are fine. It's only the messages that get garbled. Sean
> > I'll rephrase my question. In the old days, locking was done with spl. > > The new way is with mutex. But with the spl calls being replaced with > > noops, and as far as I can tell the driver is not using mutex, there > > doesn't appear to be any locking. So the driver can step on itself. > > > Well, there is locking around a couple of mutex's via FW_GLOCK(). Ah, I wasn't grepping for the right string. So there *is* mutex locking. Although the lingering spl calls are still troubling.
Dieter wrote: >>> >>> >>> >> I believe that the spl() calls are just left there as a hint where >> locking should be. >> >> As far as I understand, we need to pay attention to the mutex locks. >> > > I'll rephrase my question. In the old days, locking was done with spl. > The new way is with mutex. But with the spl calls being replaced with > noops, and as far as I can tell the driver is not using mutex, there > doesn't appear to be any locking. So the driver can step on itself. > > Well, there is locking around a couple of mutex's via FW_GLOCK(). It appears that the locking is not robust, and that is one of the issues that I am looking into right now. > >>>> is to real behavior, but /var/log/messages has a tendency to get garbled >>>> like this: >>>> >>>> Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset >>>> Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset >>>> Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, >>>> CYCLEMASTER mode >>>> Dec 22 16:00:18 home-test kernel: firewi >>>> Dec 22 16:00:18 home-test kernel: re1: >>>> Dec 22 16:00:18 home-test kernel: 1 n >>>> Dec 22 16:00:18 home-test kernel: odes >>>> Dec 22 16:00:18 home-test kernel: , ma >>>> Dec 22 16:00:18 home-test kernel: xhop >>>> Dec 22 16:00:18 home-test kernel: <= >>>> Dec 22 16:00:18 home-test kernel: 0, c >>>> Dec 22 16:00:18 home-test kernel: able >>>> >>>> >>> Do the lines get folded on the console, or only in /var/log/messages? >>> >>> >> As far as I can see, the console messages are fine. It's only the >> messages that get >> garbled. >> > > Perhaps an artifact of syslogd? > I doubt it. I'm working on a patch that improves the locking a bit and does some other "gross" things to try and keep things from flying apart. I've SEEN this behaviour in my implementations with sbp_targ and couldn't pin it down. Scott Long gave me a couple of pointers this evening, but I'm still working on locking down the taskqueues and some of the callback_handlers. There are some bad things going on specifically during initialization that are pre-empting normal operation. Sean
In message <1231181797.21260.8.camel@localhost.localdomain>, Sean Bruno writes: > @@ -1388,6 +1387,7 @@ > struct fw_device *fwdev; > > s = splfw(); > + FW_GLOCK(fc); > fc->status = FWBUSEXPLORE; > > /* Invalidate all devices, just after bus reset. */ > @@ -1396,6 +1396,7 @@ > fwdev->status = FWDEVINVAL; > fwdev->rcnt = 0; > } > + FW_GUNLOCK(fc); > splx(s); > > wakeup((void *)fc); If the (null these days) spl calls have been replaced with FW_GLOCK/FW_GUNLOCK calls, can the spl calls be removed now? > @@ -1922,6 +1924,8 @@ > u_int i; > struct firewire_comm *fc = (struct firewire_comm *)sc; > > + FW_GLOCK(fc); > + FW_GUNLOCK(fc); > if (stat & OHCI_INT_DMA_IR) { > irstat = atomic_readandclear_int(&sc->irstat); > for(i = 0; i < fc->nisodma ; i++){ This needs to be reviewed by someone who understands what is protected by getting a lock and then immediately releasing it. I have no clue. > @@ -1969,8 +1973,8 @@ > OWRITE(sc, OHCI_LNKCTLCLR, OHCI_CNTL_CYCTIMER); > #endif > OWRITE(sc, FWOHCI_INTMASKCLR, OHCI_INT_CYC_LOST); > - device_printf(fc->dev, "too many cycle lost, " > - "no cycle master presents?\n"); > + device_printf(fc->dev, "%s: too many cycle lost, " > + "no cycle master presents?\n", __func__); > } Should this be "too many cycles lost, " and "no cycle master present?\n" > uint32_t fun; > > + FW_GLOCK(fc); > device_printf(fc->dev, "Initiate bus reset\n"); > sc = (struct fwohci_softc *)fc; > > @@ -2487,6 +2495,7 @@ > fun |= FW_PHY_ISBR | FW_PHY_RHB; > fun = fwphy_wrdata(sc, FW_PHY_ISBR_REG, fun); > #endif > + FW_GUNLOCK(fc); > } Does the lock need to protect the printf?
Well ... this has really sent me down the rabbit hole the last couple of days. There is a need to audit all locking in the firewire stack right now and I have started that task. Essentially, threads, callouts, interrupts and task queues are all jumping around causing context to be switched from one thread to another. It's kind of bad in there and I need to sort it out. I'm working with a stable/7 tree, so I've started a patch for it. This patch has quite a few printf -> device_printf changes in it, so try to ignore thost for now. The meat of the patch is a judicious implementation of FW_GLOCK() in certain code areas. Note that sometimes the code is just trying to get the lock and then drops it immediately. This is not very optimal, but it does the trick. I'm still seeing a high level of broken log messages in /var/log/messages but this may help the issue you were seeing. Give it a spin. Sean
On Mon, 2009-01-05 at 17:52 +0000, Dieter wrote: > In message <1231181797.21260.8.camel@localhost.localdomain>, Sean Bruno writes: > > > @@ -1388,6 +1387,7 @@ > > struct fw_device *fwdev; > > > > s = splfw(); > > + FW_GLOCK(fc); > > fc->status = FWBUSEXPLORE; > > > > /* Invalidate all devices, just after bus reset. */ > > @@ -1396,6 +1396,7 @@ > > fwdev->status = FWDEVINVAL; > > fwdev->rcnt = 0; > > } > > + FW_GUNLOCK(fc); > > splx(s); > > > > wakeup((void *)fc); > > If the (null these days) spl calls have been replaced with FW_GLOCK/FW_GUNLOCK > calls, can the spl calls be removed now? > They most definitely can be removed. However, they are a roadmap to where locks SHOULD be. I'm using them as guideposts through the code to see if I can figure out what's missing. > > @@ -1922,6 +1924,8 @@ > > u_int i; > > struct firewire_comm *fc = (struct firewire_comm *)sc; > > > > + FW_GLOCK(fc); > > + FW_GUNLOCK(fc); > > if (stat & OHCI_INT_DMA_IR) { > > irstat = atomic_readandclear_int(&sc->irstat); > > for(i = 0; i < fc->nisodma ; i++){ > > This needs to be reviewed by someone who understands what is protected by > getting a lock and then immediately releasing it. I have no clue. > Indeed, it's gross. I only sent this patch over as a "test" of sorts to validate that the issue I'm looking into(locking in the firewire driver) is actually something related to what you reported in this issue. > > @@ -1969,8 +1973,8 @@ > > OWRITE(sc, OHCI_LNKCTLCLR, OHCI_CNTL_CYCTIMER); > > #endif > > OWRITE(sc, FWOHCI_INTMASKCLR, OHCI_INT_CYC_LOST); > > - device_printf(fc->dev, "too many cycle lost, " > > - "no cycle master presents?\n"); > > + device_printf(fc->dev, "%s: too many cycle lost, " > > + "no cycle master presents?\n", __func__); > > } > > Should this be "too many cycles lost, " and "no cycle master present?\n" > Who knows. I haven't reviewed this specific chunk of code to understand what it really means as per the FW specs. > > uint32_t fun; > > > > + FW_GLOCK(fc); > > device_printf(fc->dev, "Initiate bus reset\n"); > > sc = (struct fwohci_softc *)fc; > > > > @@ -2487,6 +2495,7 @@ > > fun |= FW_PHY_ISBR | FW_PHY_RHB; > > fun = fwphy_wrdata(sc, FW_PHY_ISBR_REG, fun); > > #endif > > + FW_GUNLOCK(fc); > > } > > Does the lock need to protect the printf? These are gross, overpowered, way to heavy handed locks that I'm playing with. I need to prevent pre-emption of certain events while they are in progress. One of these events is the firewire's assertion of "bus reset" on the firewire device. I see the h/w interrupt firing before this code can actually complete, causing the driver to be confused on occasion. Thanks for surveying this code with me, it helps to see what other people's eyes can pick up. I hope to have this in working order soon-ish. Sean
In message <1231222803.21260.17.camel@localhost.localdomain>, Sean Bruno writes: > > > uint32_t fun; > > > > > > + FW_GLOCK(fc); > > > device_printf(fc->dev, "Initiate bus reset\n"); > > > sc = (struct fwohci_softc *)fc; > > > > > > @@ -2487,6 +2495,7 @@ > > > fun |= FW_PHY_ISBR | FW_PHY_RHB; > > > fun = fwphy_wrdata(sc, FW_PHY_ISBR_REG, fun); > > > #endif > > > + FW_GUNLOCK(fc); > > > } > > > > Does the lock need to protect the printf? > > These are gross, overpowered, way to heavy handed locks that I'm playing > with. I need to prevent pre-emption of certain events while they are in > progress. One of these events is the firewire's assertion of "bus > reset" on the firewire device. I see the h/w interrupt firing before > this code can actually complete, causing the driver to be confused on > occasion. I understand the basic concept of locking, or at least I did many many years ago when spls really were levels. Since then they changed spls to not be levels (at least that's what the man page says) and later replaced with mutex. I assume these changes are at least in part to better support SMP. What I don't understand is things like getting a lock and immediately releasing it, which appears to me to protect nothing. Or why the printf needs to be inside the locked section of code. I thought the goal was to hold a lock for as short a time as possible, and it is not clear to me why the printf needs to be protected.
On Tue, 2009-01-06 at 09:33 +0000, Dieter wrote: > In message <1231222803.21260.17.camel@localhost.localdomain>, Sean Bruno writes: > > > > > uint32_t fun; > > > > > > > > + FW_GLOCK(fc); > > > > device_printf(fc->dev, "Initiate bus reset\n"); > > > > sc = (struct fwohci_softc *)fc; > > > > > > > > @@ -2487,6 +2495,7 @@ > > > > fun |= FW_PHY_ISBR | FW_PHY_RHB; > > > > fun = fwphy_wrdata(sc, FW_PHY_ISBR_REG, fun); > > > > #endif > > > > + FW_GUNLOCK(fc); > > > > } > > > > > > Does the lock need to protect the printf? > > > > These are gross, overpowered, way to heavy handed locks that I'm playing > > with. I need to prevent pre-emption of certain events while they are in > > progress. One of these events is the firewire's assertion of "bus > > reset" on the firewire device. I see the h/w interrupt firing before > > this code can actually complete, causing the driver to be confused on > > occasion. > > I understand the basic concept of locking, or at least I did many many years > ago when spls really were levels. Since then they changed spls to not be levels > (at least that's what the man page says) and later replaced with mutex. I assume > these changes are at least in part to better support SMP. > > What I don't understand is things like getting a lock and immediately releasing > it, which appears to me to protect nothing. Or why the printf needs to be inside > the locked section of code. I thought the goal was to hold a lock for as short a > time as possible, and it is not clear to me why the printf needs to be protected. The printf DOES NOT need protection at all. You are absolutely correct. The lock should be moved lower in the function for sure. This is just my attempt to see if I am looking at the same problem you are reporting. Conceptually, I am trying to keep the h/w interrupt which fires during a bus reset from preempting this function and executing before this function is finished. The acquisition of the lock and release keeps the interrupt handler from executing before the setting of bus reset is finished. I hope to fine tune the locks after I have confirmed operational goodness. Sean
Sean, I tried your patch, but bad news: a bus reset still tromps on Ethernet.
Responsible Changed From-To: freebsd-bugs->freebsd-firewire Over to maintainer(s). This looks like it may be some bad interaction between the firewire stack and using a serial console. To submitter: It may be worth while switching to uart(4) rather than sio(4) for your serial ports and seeing if that makes any difference (as I don't think uart(4) uses the Giant lock).
In message <200902022113.n12LDfEI059994@freefall.freebsd.org>, gavin@freebsd.org writes: > This looks like it may be some bad > interaction between the firewire stack and using a serial > console. To submitter: It may be worth while switching to > uart(4) rather than sio(4) for your serial ports and seeing > if that makes any difference (as I don't think uart(4) uses > the Giant lock). The box is running 7.1 now. I commented out sio in the config file and built a new kernel. normal kernel: freebsd # dmesg | grep ^sio sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A, console sio0: [FILTER] sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A sio1: [FILTER] freebsd # dmesg | grep -i ^uart freebsd # dmesg | grep -i 1655 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A, console sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A freebsd # dmesg | grep -i console sio0: type 16550A, console sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x100> freebsd # dmesg | grep -i giant ohci0: [GIANT-LOCKED] ehci0: [GIANT-LOCKED] ohci1: [GIANT-LOCKED] ohci2: [GIANT-LOCKED] ehci1: [GIANT-LOCKED] ppc0: [GIANT-LOCKED] atkbd0: [GIANT-LOCKED] With sio commented out: OK boot kernel_uart /boot/kernel_uart/kernel text=0x3f8090 data=0x5b678+0x38cc8 syms=[0x8+0x6ee08+0x 8+0x609a9] / And nothing else appears on the console. freebsd # dmesg | grep -i ^sio freebsd # dmesg | grep -i ^uart uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 on acpi0 uart0: [FILTER] uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0 uart1: [FILTER] freebsd # dmesg | grep -i 1655 uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 on acpi0 uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0 freebsd # dmesg | grep -i console sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> echo hello >> /dev/console doesn't show up. Nothing shows up on console. And looking at files in /var/log I found: getty[789]: open /dev/ttyd0: No such file or directory getty[790]: open /dev/ttyd1: No such file or directory Looks to me like commenting out sio doesn't work so well on my box. Is that the wrong way to switch from sio to uart? Is there something else I need to change instead, or in addition?
On Wed, 2009-03-04 at 21:37 +0000, Dieter wrote: > In message <200902022113.n12LDfEI059994@freefall.freebsd.org>, gavin@freebsd.org writes: > > > This looks like it may be some bad > > interaction between the firewire stack and using a serial > > console. To submitter: It may be worth while switching to > > uart(4) rather than sio(4) for your serial ports and seeing > > if that makes any difference (as I don't think uart(4) uses > > the Giant lock). > > The box is running 7.1 now. I commented out sio in the config file > and built a new kernel. > > normal kernel: > > freebsd # dmesg | grep ^sio > sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 > sio0: type 16550A, console > sio0: [FILTER] > sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 > sio1: type 16550A > sio1: [FILTER] > freebsd # dmesg | grep -i ^uart > freebsd # dmesg | grep -i 1655 > sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 > sio0: type 16550A, console > sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 > sio1: type 16550A > freebsd # dmesg | grep -i console > sio0: type 16550A, console > sc0: <System console> at flags 0x100 on isa0 > sc0: VGA <16 virtual consoles, flags=0x100> > freebsd # dmesg | grep -i giant > ohci0: [GIANT-LOCKED] > ehci0: [GIANT-LOCKED] > ohci1: [GIANT-LOCKED] > ohci2: [GIANT-LOCKED] > ehci1: [GIANT-LOCKED] > ppc0: [GIANT-LOCKED] > atkbd0: [GIANT-LOCKED] > > With sio commented out: > > OK boot kernel_uart > /boot/kernel_uart/kernel text=0x3f8090 data=0x5b678+0x38cc8 syms=[0x8+0x6ee08+0x > 8+0x609a9] > / > > And nothing else appears on the console. > > > freebsd # dmesg | grep -i ^sio > freebsd # dmesg | grep -i ^uart > uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 on acpi0 > uart0: [FILTER] > uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0 > uart1: [FILTER] > freebsd # dmesg | grep -i 1655 > uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 on acpi0 > uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0 > freebsd # dmesg | grep -i console > sc0: <System console> at flags 0x100 on isa0 > sc0: VGA <16 virtual consoles, flags=0x300> > > echo hello >> /dev/console > doesn't show up. Nothing shows up on console. > > And looking at files in /var/log I found: > > getty[789]: open /dev/ttyd0: No such file or directory > getty[790]: open /dev/ttyd1: No such file or directory > > > Looks to me like commenting out sio doesn't work so well on my box. > Is that the wrong way to switch from sio to uart? > Is there something else I need to change instead, or in addition? You'll also need to add the uart hints to your device.hints file, if you haven't already, and update /etc/ttys. hint.uart.0.at="isa" hint.uart.0.port="0x3F8" hint.uart.0.flags="0x10" hint.uart.0.irq="4" /etc/ttys: you may find you need to change "ttyd0" to "ttyu0". Gavin
> > > This looks like it may be some bad > > > interaction between the firewire stack and using a serial > > > console. To submitter: It may be worth while switching to > > > uart(4) rather than sio(4) for your serial ports and seeing > > > if that makes any difference (as I don't think uart(4) uses > > > the Giant lock). > > Looks to me like commenting out sio doesn't work so well on my box. > > Is that the wrong way to switch from sio to uart? > > Is there something else I need to change instead, or in addition? > > You'll also need to add the uart hints to your device.hints file, if you > haven't already, and update /etc/ttys. > > hint.uart.0.at="isa" > hint.uart.0.port="0x3F8" > hint.uart.0.flags="0x10" > hint.uart.0.irq="4" > > /etc/ttys: you may find you need to change "ttyd0" to "ttyu0". > > Gavin Thanks, with those changes it works with uart now. The bad news is that switching from sio to uart didn't fix the CPU hogging problem.
> Looks like the kernel printf takes a *lot* of CPU, and > somehow locks out Ethernet. Looking through man pages working on an unrelated problem, I found: man timed says: Messages printed by the kernel on the system console occur with inter- rupts disabled. This means that the clock stops while they are printing. A machine with many disk or network hardware problems and consequent mes- sages cannot keep good time by itself. Each message typically causes the clock to lose a dozen milliseconds. If a kernel printf disables Ethernet interrupts for 12 milliseconds, that would create the problem.
New data: USB causes the same problem that firewire does. FreeBSD 8.0 amd64 nfe0: <NVIDIA nForce4 CK804 MCP9 Networking Adapter> port 0xb400-0xb407=20 mem 0xfebf9000-0xfebf9fff irq 23 at device 10.0 on pci0 bge0: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev.=20 0x4101> mem 0xfe4f0000-0xfe4fffff irq 19 at device 0.0 on pci5 fwohci1: <NEC uPD72871/2> mem 0xfdeff000-0xfdefffff irq 19 at device=20 8.0 on pci2 ehci0: <NVIDIA nForce4 USB 2.0 controller> mem 0xfebfe000-0xfebfe0ff=20 irq 22 at device 2.1 on pci0 Plugging in a USB_to_RS-232 bridge generates: ugen0.2: <Prolific Technology Inc.> at usbus0 uplcom0: <Prolific Technology Inc. USB-Serial Controller D, class 0/0,=20 rev 1.10/ 4.00, addr 2> on usbus0 And networking is locked out for too long, and data is lost. (The node=20 sending data via Ethernet has too small a transmit buffer, but is a closed source=20 closed hardware black box and cannot be fixed. Once data is lost it is lost forever,=20 it cannot be recreated, so this is very bad.) The FreeBSD box receiving data has a=20 very large receive buffer, but if another device driver locks out networking it=20 doesn't help. Changing the printf(9)s to log(9)s fixes the problem, but this is not a=20 good workaround, there are LOTS of printfs in the kernel, I keep hitting new ones. =20 Printf(9) alone isn't the problem, adding printfs to chown(2) does not cause the problem, but=20 printfs from device drivers do. (usb, firewire, ...) My theory is there is a locking conflict. Here is lock profiling data for USB_RS-232 bridge being plugged in=20 printfs locking out networking: debug.lock.prof.enable: 0 debug.lock.prof.reset: 0 debug.lock.prof.stats: max wait_max total wait_total count avg wait_avg=20 cnt_hold cnt_lock name 251240 0 499982 0 303 1650 0 0= =20 0 /usr/src/sys/dev/usb/usb_request.c:322 (sx:0123456789ABCDEF -=20 USB device SX lock) 104255 0 110429 0 2 55214 0 0= =20 0 /usr/src/sys/dev/usb/usb_device.c:2459 (sx:123456789ABCDEF - USB= =20 config SX lock) 104217 0 29670949 0 47087 630 0 0= =20 0 /usr/src/sys/kern/uipc_sockbuf.c:148 (sx:so_rcv_sx) =20 <---------<<<< 101962 0 103089 0 27 3818 0 0= =20 0 /usr/src/sys/kern/kern_condvar.c:145 (sleep mutex:Giant) 51071 0 81088 0 3 27029 0 0= =20 0 /usr/src/sys/kern/kern_cons.c:422 (spin mutex:cnputs_mtx) 23049 0 31351 0 2 15675 0 0= =20 0 /usr/src/sys/kern/vfs_syscalls.c:3508 (lockmgr:ufs) 12408 0 978544 0 971 1007 0 0= =20 0 /usr/src/sys/kern/vfs_vnops.c:604 (lockmgr:ufs) 11466 0 744372 0 11567 64 0 0= =20 0 /usr/src/sys/kern/vfs_bio.c:2559 (lockmgr:bufwait) 11141 0 501155 0 3850 130 0 0= =20 0 /usr/src/sys/kern/vfs_bio.c:1835 (lockmgr:bufwait) 6115 0 61720 0 26 2373 0 0= =20 0 /usr/src/sys/dev/usb/usb_request.c:322 (sx:123456789ABCDEF - USB= =20 device SX lock) 5551 0 16749 0 19 881 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:1693 (lockmgr:syncer) 3043 0 3749 0 6 624 0 0= =20 0 /usr/src/sys/ufs/ffs/ffs_vfsops.c:1321 (sleep mutex:struct mount= =20 mtx) 2333 0 9901 0 36 275 0 0= =20 0 /usr/src/sys/kern/vfs_mount.c:2231 (sleep mutex:struct mount mtx) 2055 0 9462 0 100 94 0 0= =20 0 /usr/src/sys/kern/subr_hints.c:117 (sleep mutex:kernel=20 environment) 1649 0 79272 0 3846 20 0 0= =20 0 /usr/src/sys/kern/vfs_bio.c:2963 (sleep mutex:vm object) 1636 0 36196 0 15515 2 0 0= =20 0 /usr/src/sys/vm/vm_page.c:1052 (sleep mutex:vm page queue free=20 mutex) 1205 0 55601 0 15411 3 0 0= =20 0 /usr/src/sys/kern/vfs_bio.c:2545 (sleep mutex:bufobj interlock) 1076 0 1076 0 1 1076 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:1693 (lockmgr:ufs) 1069 0 31117 0 30 1037 0 0= =20 0 /usr/src/sys/kern/kern_mutex.c:147 (sleep mutex:nfe0) 848 0 3752 0 1359 2 0 0= =20 0 /usr/src/sys/ufs/ffs/ffs_vfsops.c:1297 (sleep mutex:vnode=20 interlock) 754 0 10016 0 3846 2 0 0= =20 0 /usr/src/sys/ufs/ffs/ffs_softdep.c:4274 (sleep mutex:Softdep=20 Lock) 739 0 2463 0 4 615 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:1693 (lockmgr:devfs) 540 0 3370 0 1365 2 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:3179 (sleep mutex:vnode interlock) 525 0 80852 0 160 505 0 0= =20 0 /usr/src/sys/dev/uart/uart_cpu.h:92 (spin mutex:uart_hwmtx) 512 0 2065 0 23 89 0 0= =20 0 /usr/src/sys/kern/vfs_default.c:599 (lockmgr:bufwait) 487 0 97058 0 481 201 0 0= =20 0 /usr/src/sys/vm/vm_pager.c:311 (lockmgr:bufwait) 467 0 2069 0 12 172 0 0= =20 0 /usr/src/sys/dev/usb/usb_busdma.c:547 (sleep mutex:uplcom) 455 0 2978 0 18 165 0 0= =20 0 /usr/src/sys/dev/usb/usb_transfer.c:286 (sleep mutex:uplcom) 423 0 727 0 2 363 0 0= =20 0 /usr/src/sys/vm/uma_core.c:1565 (sleep mutex:UMA lock) 367 0 230166 0 47087 4 0 0= =20 0 /usr/src/sys/netinet/tcp_usrreq.c:708 (rw:tcpinp) 358 0 1387 0 11 126 0 0= =20 0 /usr/src/sys/kern/kern_synch.c:241 (sleep mutex:Giant) 339 0 72085 0 46119 1 0 0= =20 0 /usr/src/sys/kern/kern_mutex.c:147 (sleep mutex:so_rcv) 332 0 225434 0 93206 2 0 0= =20 0 /usr/src/sys/kern/uipc_socket.c:1441 (sleep mutex:so_rcv) 332 0 19529 0 11551 1 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:1927 (sleep mutex:bufobj interlock) 331 0 162523 0 94887 1 0 0= =20 0 /usr/src/sys/netinet/tcp_output.c:280 (sleep mutex:so_snd) 330 0 73512 0 47087 1 0 0= =20 0 /usr/src/sys/kern/uipc_socket.c:1843 (sleep mutex:so_rcv) 329 0 88069 0 47771 1 0 0= =20 0 /usr/src/sys/kern/uipc_socket.c:1687 (sleep mutex:so_rcv) 329 0 348 0 12 29 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:1955 (sleep mutex:Syncer mtx) 320 0 66185 0 3353 19 0 0= =20 0 /usr/src/sys/kern/vfs_cluster.c:868 (lockmgr:bufwait) 303 0 7410 0 4360 1 0 0= =20 0 /usr/src/sys/kern/vfs_bio.c:3944 (sleep mutex:bufobj interlock) debug.lock.prof.rejected: 0 debug.lock.prof.skipcount: 0 debug.lock.prof.skipspin: 0 Here is lock profiling data for firewire bus reset printfs locking out=20 Ethernet: max wait_max total wait_total count avg wait_avg=20 cnt_hold cnt_lock name 373345 0 397495 0 643 618 0 0= =20 0 /usr/src/sys/kern/vfs_lookup.c:497 (lockmgr:ufs) 373325 0 390815 0 3140 124 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:2218 (sleep mutex:vnode interlock) 359514 0 32199808 0 45582 706 0 0= =20 0 /usr/src/sys/kern/uipc_sockbuf.c:148 (sx:so_rcv_sx) =20 <---------<<<< 63827 320 746196 740 70067 10 0 0= =20 4 /usr/src/sys/dev/firewire/fwohci.c:2107 (sleep mutex:firewire) 49927 0 165929 0 8 20741 0 0= =20 0 /usr/src/sys/kern/vfs_syscalls.c:3508 (lockmgr:ufs) 33358 0 1707461 0 1106 1543 0 0= =20 0 /usr/src/sys/kern/vfs_vnops.c:604 (lockmgr:ufs) 30294 0 401224 0 2641 151 0 0= =20 0 /usr/src/sys/vm/vm_map.c:3526 (sx:user map) 28432 0 934681 0 74 12630 0 0= =20 0 /usr/src/sys/kern/kern_cons.c:422 (spin mutex:cnputs_mtx) 28296 0 66892 0 2641 25 0 0= =20 0 /usr/src/sys/vm/vm_fault.c:297 (sleep mutex:vm object) 25020 0 26667 0 45 592 0 0= =20 0 /usr/src/sys/kern/kern_exit.c:825 (sx:proctree) 19027 0 1225496 0 11937 102 0 0= =20 0 /usr/src/sys/kern/vfs_bio.c:2559 (lockmgr:bufwait) 18887 0 906297 0 4140 218 0 0= =20 0 /usr/src/sys/kern/vfs_bio.c:1835 (lockmgr:bufwait) 12101 0 23459 0 1675 14 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:2342 (sleep mutex:vnode interlock) 7448 0 170841 0 2229 76 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:2083 (lockmgr:ufs) 7342 0 21792 0 2169 10 0 0= =20 0 /usr/src/sys/kern/vfs_cache.c:390 (rw:Name Cache) 4648 0 10831 0 11 984 0 0= =20 0 /usr/src/sys/kern/vfs_subr.c:1693 (lockmgr:syncer) 3536 0 35352 0 9218 3 0 0= =20 0 /usr/src/sys/amd64/amd64/pmap.c:3989 (sleep mutex:pmap) 3039 0 34283 0 2641 12 0 0= =20 0 /usr/src/sys/vm/vm_fault.c:937 (sleep mutex:vm object) 2825 0 53949 0 3495 15 0 0= =20 0 /usr/src/sys/vm/vm_fault.c:1010 (sleep mutex:vm object) 2286 0 7169 0 547 13 0 0= =20 0 /usr/src/sys/kern/kern_sig.c:983 (sleep mutex:process lock) 2083 4010 9037 5380 2143 4 2 0= =20 2 /usr/src/sys/kern/vfs_subr.c:2118 (sleep mutex:vnode interlock) 2083 57 2089 57 4 522 14 0= =20 1 /usr/src/sys/kern/kern_mutex.c:147 (sleep mutex:pipe mutex) 2022 0 181449 0 515 352 0 0= =20 0 /usr/src/sys/vm/vm_pager.c:311 (lockmgr:bufwait) 1997 0 3397 0 2 1698 0 0= =20 0 /usr/src/sys/kern/kern_sysctl.c:1513 (sx:sysctl mem) 1994 0 4890 0 117 41 0 0= =20 0 /usr/src/sys/kern/kern_sysctl.c:1521 (sx:sysctl lock) 1971 0 4156 0 13 319 0 0= =20 0 /usr/src/sys/kern/kern_sysctl.c:1417 (sleep mutex:Giant) 1965 0 31838 0 3471 9 0 0= =20 0 /usr/src/sys/vm/vm_object.c:482 (sleep mutex:vm object) 1926 0 17695 0 83 213 0 0= =20 0 /usr/src/sys/vm/vm_object.c:541 (sleep mutex:vm object) There is at least one line in common that looks networking related,=20 /usr/src/sys/kern/uipc_sockbuf.c:148 (sx:so_rcv_sx) int sblock(struct sockbuf *sb, int flags) { KASSERT((flags & SBL_VALID) =3D=3D flags, ("sblock: flags invalid (0x%x)", flags)); if (flags & SBL_WAIT) { if ((sb->sb_flags & SB_NOINTR) || (flags & SBL_NOINTR)) { sx_xlock(&sb->sb_sx); return (0); } return (sx_xlock_sig(&sb->sb_sx)); } else { if (sx_try_xlock(&sb->sb_sx) =3D=3D 0) return (EWOULDBLOCK); return (0); } } It is not clear to me what this code is doing. (comments might help) It is not clear to me why networking and a USB_to_RS-232 bridge would have a lock conflict. It is not clear to me why networking and firewire would have a lock conflict. How do we fix this?
For bugs matching the following criteria: Status: In Progress Changed: (is less than) 2014-06-01 Reset to default assignee and clear in-progress tags. Mail being skipped
Created attachment 257900 [details] firewire diff ^Triage: rebase patch.
The firewire driver is deprecated and slated to be removed before 16.0, and there hasn't been any activity here for a long time. The discussion suggests that there were generic SMP stability problems at play here too, which should no longer be relevant. So, I think there isn't much value in keeping this report open.