Bug 118093 - [firewire] firewire bus reset hogs CPU, causing data to be lost
Summary: [firewire] firewire bus reset hogs CPU, causing data to be lost
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-16 22:20 UTC by Dieter
Modified: 2018-01-03 05:16 UTC (History)
0 users

See Also:


Attachments
firewire.diff (8.94 KB, patch)
2009-01-05 18:23 UTC, sean.bruno
no flags Details | Diff
firewire.diff (8.94 KB, patch)
2009-01-05 18:56 UTC, sean.bruno
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dieter 2007-11-16 22:20:01 UTC
When a firewire device is powered on or off, the firewire bus is reset.
A firewire bus reset hogs the CPU for too long, more than 0.1 second, causing
data to be lost.  Disk and Ethernet activity (and probably all i/o) stop.
Stopping Ethernet for too long causes data to be lost.  Userland processes
also don't run, but they have to expect that, it is the devices not getting
serviced that I'm concerned about.

How-To-Repeat: Have some i/o going.
fwcontrol -r
observe that all i/o stops for 1/10 second
Comment 1 Dieter 2008-12-14 19:30:04 UTC
I found the source of this problem.  When a firewire bus resets,
the firewire driver prints a few lines to the console,
using printf(9) and device_printf(9).  I suspect that these are
running at splfw aka splimp, locking out other i/o.

Commenting out the *printf() calls fixes the problem, but that
isn't a good solution.

Would changing the *printf() calls to log(9) calls be safe?
("safe" meaning other i/o doesn't get locked out)

Ah, for the good old days when 19200 baud seemed fast...
Comment 2 Sean Bruno 2008-12-16 05:14:32 UTC
Dieter wrote:
> I found the source of this problem.  When a firewire bus resets,
> the firewire driver prints a few lines to the console,
> using printf(9) and device_printf(9).  I suspect that these are
> running at splfw aka splimp, locking out other i/o.
>
> Commenting out the *printf() calls fixes the problem, but that
> isn't a good solution.
>
> Would changing the *printf() calls to log(9) calls be safe?
> ("safe" meaning other i/o doesn't get locked out)
>
> Ah, for the good old days when 19200 baud seemed fast...
> _______________________________________________
>   
Which one are you looking at Dieter?

Sean
Comment 3 ikob 2008-12-16 05:17:10 UTC
Hi, 

I believe no side effect is anticipated the message.
"side effect" includes locking i/o.

----
Katsushi Kobayashi




On 2008/12/15, at 4:30, Dieter wrote:

> I found the source of this problem.  When a firewire bus resets,
> the firewire driver prints a few lines to the console,
> using printf(9) and device_printf(9).  I suspect that these are
> running at splfw aka splimp, locking out other i/o.
>
> Commenting out the *printf() calls fixes the problem, but that
> isn't a good solution.
>
> Would changing the *printf() calls to log(9) calls be safe?
> ("safe" meaning other i/o doesn't get locked out)
>
> Ah, for the good old days when 19200 baud seemed fast...
> _______________________________________________
> freebsd-firewire@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-firewire
> To unsubscribe, send any mail to "freebsd-firewire-unsubscribe@freebsd.org"
Comment 4 M. Warner Losh 2008-12-16 06:01:30 UTC
> I found the source of this problem.  When a firewire bus resets,
> the firewire driver prints a few lines to the console,
> using printf(9) and device_printf(9).  I suspect that these are
> running at splfw aka splimp, locking out other i/o.

This can't be the case.  There's no SPL involved at all.  Maybe
removing the printfs causes an interrupt to be serviced faster,
resulting in what appears to be better performance...

> Commenting out the *printf() calls fixes the problem, but that
> isn't a good solution.
> 
> Would changing the *printf() calls to log(9) calls be safe?
> ("safe" meaning other i/o doesn't get locked out)
> 
> Ah, for the good old days when 19200 baud seemed fast...
> _______________________________________________
> freebsd-drivers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-drivers
> To unsubscribe, send any mail to "freebsd-drivers-unsubscribe@freebsd.org"
> 
>
Comment 5 Dieter 2008-12-16 19:29:13 UTC
Sean>  Which file in dev/firewire are you looking at?

fwohci.c  and  firewire.c

examples:

printf("non CYCLEMASTER mode\n");

device_printf(fc->dev, "Initiate bus reset\n");

-------------------

Warner>  This can't be the case.  There's no SPL involved at all.  Maybe
Warner>  removing the printfs causes an interrupt to be serviced faster,
Warner>  resulting in what appears to be better performance...

With the printfs, Ethernet is not getting serviced.  This
is repeatable and easily reproduced.   Without the printfs,
it seems ok.

If it isn't spl, what is locking out Ethernet?
Comment 6 M. Warner Losh 2008-12-17 06:59:41 UTC
In message: <200812170329.DAA17848@sopwith.solgatos.com>
            Dieter <freebsd@sopwith.solgatos.com> writes:
: Sean>  Which file in dev/firewire are you looking at?
: 
: fwohci.c  and  firewire.c
: 
: examples:
: 
: printf("non CYCLEMASTER mode\n");
: 
: device_printf(fc->dev, "Initiate bus reset\n");
: 
: -------------------
: 
: Warner>  This can't be the case.  There's no SPL involved at all.  Maybe
: Warner>  removing the printfs causes an interrupt to be serviced faster,
: Warner>  resulting in what appears to be better performance...
: 
: With the printfs, Ethernet is not getting serviced.  This
: is repeatable and easily reproduced.   Without the printfs,
: it seems ok.
: 
: If it isn't spl, what is locking out Ethernet?

interrupt storm?  In old-spl-locked drivers, often times the interrupt
would be masked while certain operations happened.  In new
mutex-locked freebsd, there's no way to block the interrupts, so
sometimes old code gets into an interrupt storm, which starves other
things.  Not sure why printf would trigger this, however...

Warner
Comment 7 Dieter 2008-12-18 00:51:41 UTC
> > : Warner>  This can't be the case.  There's no SPL involved at all.  Maybe
> > : Warner>  removing the printfs causes an interrupt to be serviced faster,
> > : Warner>  resulting in what appears to be better performance...
> > : 
> > : With the printfs, Ethernet is not getting serviced.  This
> > : is repeatable and easily reproduced.   Without the printfs,
> > : it seems ok.
> > : 
> > : If it isn't spl, what is locking out Ethernet?
> >
> > interrupt storm?  In old-spl-locked drivers, often times the interrupt
> > would be masked while certain operations happened.  In new
> > mutex-locked freebsd, there's no way to block the interrupts, so
> > sometimes old code gets into an interrupt storm, which starves other
> > things.  Not sure why printf would trigger this, however...

I tried logging in on the console and "cat /etc/termcap" does
not create a problem.  Systat -vmstat reports that IRQ 5 (console)
interrupts go from 0 to 120.  CPU:
	0.0%Sys   0.6%Intr  0.0%User  0.0%Nice 99.4%Idle

The normal firewire bus reset printfs generate 890 IRQ 5 interrupts
with the shell firing off "fwcontrol -u 1 -r" in a loop.
CPU goes from idle to 100% busy:
	98.1%Sys   1.7%Intr  0.3%User  0.0%Nice  0.0%Idle

"fwcontrol -u 0 -r" (the onboard firewire controller) gives
the same 890 IRQ 5 interrupts, but CPU usage is different:
	79.2%Sys  20.8%Intr  0.0%User  0.0%Nice  0.0%Idle

A couple of other IRQs go up to 30-40 range, nothing huge.

A single "fwcontrol -u 0 -r" generates 14 interrupts on IRQ 5
and 4 interrupts on 3 other IRQs.

A single "fwcontrol -u 1 -r" generates 14 interrupts on IRQ 5
and 3 interrupts on 2 other IRQs.  yawn

------

With the bus reset printfs commented out, I get:

-u 0

   80.0%Sys   4.8%Intr 13.9%User  0.0%Nice  1.2%Idle

   3300 interrupts on the onboard firewire controller

-u 1

   84.8%Sys   1.1%Intr 14.1%User  0.0%Nice  0.0%Idle

   3900 interrupts on the PCI card firewire controller

   After awhile the console says:

   fwohci1: phy read failed(1). i = 100 retry = 0
   fwohci1: phy read failed(1). i = 100 retry = 1
   fwohci1: phy read failed(1). i = 100 retry = 2
   fwohci1: phy read failed(1). i = 100 retry = 3
   ...

   and the system becomes unresponsive.  Reset button required.  :-(

Observations:

Kernel printfs to console appear to take more interrupts and
more CPU than userland output to console.

Seems odd that the onboard controller uses so much CPU for
interrupts in the with-printf case, but there wasn't a
large number on interrupts.  The without-printf case had
a LOT more interrupts from the fw controller.

Resetting the firewire bus in an endless loop is not recommended.

I'm not an expert on interrupt storms, but none of this looks
like an interrupt storm to me.  (Assuming systat is trustworthy.)

Looks like the kernel printf takes a *lot* of CPU, and
somehow locks out Ethernet.
Comment 8 Sean Bruno 2008-12-18 01:10:53 UTC
M. Warner Losh wrote:
> In message: <200812170329.DAA17848@sopwith.solgatos.com>
>             Dieter <freebsd@sopwith.solgatos.com> writes:
> : Sean>  Which file in dev/firewire are you looking at?
> : 
> : fwohci.c  and  firewire.c
> : 
> : examples:
> : 
> : printf("non CYCLEMASTER mode\n");
> : 
> : device_printf(fc->dev, "Initiate bus reset\n");
> : 
> : -------------------
> : 
> : Warner>  This can't be the case.  There's no SPL involved at all.  Maybe
> : Warner>  removing the printfs causes an interrupt to be serviced faster,
> : Warner>  resulting in what appears to be better performance...
> : 
> : With the printfs, Ethernet is not getting serviced.  This
> : is repeatable and easily reproduced.   Without the printfs,
> : it seems ok.
> : 
> : If it isn't spl, what is locking out Ethernet?
>
> interrupt storm?  In old-spl-locked drivers, often times the interrupt
> would be masked while certain operations happened.  In new
> mutex-locked freebsd, there's no way to block the interrupts, so
> sometimes old code gets into an interrupt storm, which starves other
> things.  Not sure why printf would trigger this, however...
>
> Warner
> _______________________________________________
> freebsd-drivers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-drivers
> To unsubscribe, send any mail to "freebsd-drivers-unsubscribe@freebsd.org"
>   
Let me take a stab at this one this week.  There's got to be something 
more going on
than a printf() causing hell.  It could be a mutex being held causing 
nightmares.

The firewire code is my current project, I'll look at this issue now.

Sean
Comment 9 Dieter 2008-12-19 11:02:21 UTC
Here are the results of two more experiments:

I tried to verify what spl the firewire code is at when calling
printf(9) by adding the following block into fwohci.c just
before one of the printfs.

{
#include <sys/types.h>
#include <sys/systm.h>
  intrmask_t debug_spl;  /* __uint32_t */
  intrmask_t debug_spl_high;
  intrmask_t debug_spl_tty;
  intrmask_t debug_spl_fw;
  intrmask_t debug_spl_0;
  intrmask_t debug_spl_net;

  debug_spl = splhigh();
  debug_spl_high = spltty();
  debug_spl_tty = splfw();
  debug_spl_fw = splnet();
  debug_spl_net = splhigh();
  spl0();  /* void */
  debug_spl_0 = splhigh();
  splx(debug_spl);

  printf("fwohci_intr_core(): spl = 0x%x\n splhigh=0x%x spltty=0x%x splfw=0x%x splnet=0x%x spl0=0x%x\n",
         debug_spl, debug_spl_high, debug_spl_tty, debug_spl_fw, debug_spl_net, debug_spl_0);
}

But my results appear bogus:

fwohci_intr_core(): spl = 0x0
 splhigh=0x0 spltty=0x0 splfw=0x0 splnet=0x0 spl0=0x0

I have examined my code and the spl(9) man page several times
but I can't find what is wrong.  Any clues?

--------------------------------------------------------

To isolate the effects of printf(9) from the firewire bus reset,
I picked a trivial system call (chown(2)) and added some printf(9)
calls.

Calling chown several times and monitoring with systat -vmstat gives:
  1098 interrupts on the console IRQ
  93.1%Sys   6.7%Intr  0.2%User  0.0%Nice  0.0%Idle

This did NOT interfere with Ethernet.

So printf(9) interferes with Ethernet when called from the
firewire driver, but not when called from a vanilla system call.

--------------------------------------------------------

sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A, console
sio0: [FILTER]

What does [FILTER] mean?  I don't see an explanation on the sio man page.
Comment 10 Dieter 2008-12-20 21:24:23 UTC
In message <494DAEB1.1070909@miralink.com>, Sean Bruno writes:

> I setup my system to execute a bus reset every 1 second, simultaneously, 
> I started copying a large
> file onto the system see if anything would happen.
> 
> I saw no change to copy speed reported by SCP during the entire 
> transaction.  I also see no change
> to the system load while this is occurring.

Does your system have a RS-232 console or VGA console?
The problem might be specific to RS-232.

> This seems indicative of "something else" going on.

Any idea what this "something else" might be?

Does anyone have a clue why my spl calls are returning 0 ?
Comment 11 Sean Bruno 2008-12-21 02:49:21 UTC
Dieter wrote:
> Here are the results of two more experiments:
>
> I tried to verify what spl the firewire code is at when calling
> printf(9) by adding the following block into fwohci.c just
> before one of the printfs.
>
> {
> #include <sys/types.h>
> #include <sys/systm.h>
>   intrmask_t debug_spl;  /* __uint32_t */
>   intrmask_t debug_spl_high;
>   intrmask_t debug_spl_tty;
>   intrmask_t debug_spl_fw;
>   intrmask_t debug_spl_0;
>   intrmask_t debug_spl_net;
>
>   debug_spl = splhigh();
>   debug_spl_high = spltty();
>   debug_spl_tty = splfw();
>   debug_spl_fw = splnet();
>   debug_spl_net = splhigh();
>   spl0();  /* void */
>   debug_spl_0 = splhigh();
>   splx(debug_spl);
>
>   printf("fwohci_intr_core(): spl = 0x%x\n splhigh=0x%x spltty=0x%x splfw=0x%x splnet=0x%x spl0=0x%x\n",
>          debug_spl, debug_spl_high, debug_spl_tty, debug_spl_fw, debug_spl_net, debug_spl_0);
> }
>
> But my results appear bogus:
>
> fwohci_intr_core(): spl = 0x0
>  splhigh=0x0 spltty=0x0 splfw=0x0 splnet=0x0 spl0=0x0
>
> I have examined my code and the spl(9) man page several times
> but I can't find what is wrong.  Any clues?
>
> --------------------------------------------------------
>
> To isolate the effects of printf(9) from the firewire bus reset,
> I picked a trivial system call (chown(2)) and added some printf(9)
> calls.
>
> Calling chown several times and monitoring with systat -vmstat gives:
>   1098 interrupts on the console IRQ
>   93.1%Sys   6.7%Intr  0.2%User  0.0%Nice  0.0%Idle
>
> This did NOT interfere with Ethernet.
>
> So printf(9) interferes with Ethernet when called from the
> firewire driver, but not when called from a vanilla system call.
>
>   
I setup my system to execute a bus reset every 1 second, simultaneously, 
I started copying a large
file onto the system see if anything would happen.

I saw no change to copy speed reported by SCP during the entire 
transaction.  I also see no change
to the system load while this is occurring.

This seems indicative of "something else" going on.
> --------------------------------------------------------
>
> sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
> sio0: type 16550A, console
> sio0: [FILTER]
>
> What does [FILTER] mean?  I don't see an explanation on the sio man page.
>   

Did you mean to ask a different list?  Because I have no idea.  :)
Comment 12 Sean Bruno 2008-12-23 03:21:49 UTC
Dieter wrote:
> In message <494DAEB1.1070909@miralink.com>, Sean Bruno writes:
>
>   
>> I setup my system to execute a bus reset every 1 second, simultaneously, 
>> I started copying a large
>> file onto the system see if anything would happen.
>>
>> I saw no change to copy speed reported by SCP during the entire 
>> transaction.  I also see no change
>> to the system load while this is occurring.
>>     
>
> Does your system have a RS-232 console or VGA console?
> The problem might be specific to RS-232.
>
>   
I have both.  I disabled my serial console to test, no change when using
the VGA console only.

>> This seems indicative of "something else" going on.
>>     
>
> Any idea what this "something else" might be?
>
> Does anyone have a clue why my spl calls are returning 0 ?
>
>   
I confirmed that spl's are complete no-ops since rel 5.  So, you want to 
ignore
them as they are just markers now where locking should be implemented.

I went back in your original submission and I can see a significant drop 
in I/O
when resetting the F/W bus.   If I have the following running:
dd if=/dev/zero of=/dev/ad6 bs=64k

Then I see about 64MB/S normally.  When I reset any firewire bus, I see 
the throughput
drop as reported by iostat by a significant amount(2-4 MB/s).

I am assuming that the firewire driver is holding a very heavy lock here 
that is cascading
down to the printf().  At least, that would be my ASSumption.  :)

I will continue to poke around.  You are not crazy Dieter.

Sean
Comment 13 Sean Bruno 2008-12-23 06:01:55 UTC
Sean Bruno wrote:
> Dieter wrote:
>> In message <494DAEB1.1070909@miralink.com>, Sean Bruno writes:
>>
>>  
>>> I setup my system to execute a bus reset every 1 second, 
>>> simultaneously, I started copying a large
>>> file onto the system see if anything would happen.
>>>
>>> I saw no change to copy speed reported by SCP during the entire 
>>> transaction.  I also see no change
>>> to the system load while this is occurring.
>>>     
>>
>> Does your system have a RS-232 console or VGA console?
>> The problem might be specific to RS-232.
>>
>>   
> I have both.  I disabled my serial console to test, no change when using
> the VGA console only.
>
>>> This seems indicative of "something else" going on.
>>>     
>>
>> Any idea what this "something else" might be?
>>
>> Does anyone have a clue why my spl calls are returning 0 ?
>>
>>   
> I confirmed that spl's are complete no-ops since rel 5.  So, you want 
> to ignore
> them as they are just markers now where locking should be implemented.
>
> I went back in your original submission and I can see a significant 
> drop in I/O
> when resetting the F/W bus.   If I have the following running:
> dd if=/dev/zero of=/dev/ad6 bs=64k
>
> Then I see about 64MB/S normally.  When I reset any firewire bus, I 
> see the throughput
> drop as reported by iostat by a significant amount(2-4 MB/s).
>
> I am assuming that the firewire driver is holding a very heavy lock 
> here that is cascading
> down to the printf().  At least, that would be my ASSumption.  :)
>
> I will continue to poke around.  You are not crazy Dieter.
>
> Sean
>
>
Hrm ... Looking over the log messages that are generated during a bus 
reset, there's
a mish-mash of printf() and device_printf() calls.  I'm not sure what 
the impact of that
is to real behavior, but /var/log/messages has a tendency to get garbled 
like this:

Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset
Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset
Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, 
CYCLEMASTER mode
Dec 22 16:00:18 home-test kernel: firewi
Dec 22 16:00:18 home-test kernel: re1:
Dec 22 16:00:18 home-test kernel: 1 n
Dec 22 16:00:18 home-test kernel: odes
Dec 22 16:00:18 home-test kernel: , ma
Dec 22 16:00:18 home-test kernel: xhop
Dec 22 16:00:18 home-test kernel: <=
Dec 22 16:00:18 home-test kernel: 0, c
Dec 22 16:00:18 home-test kernel: able
Dec 22 16:00:18 home-test kernel: IRM
Dec 22 16:00:18 home-test kernel: = 0
Dec 22 16:00:18 home-test kernel: (me
Dec 22 16:00:18 home-test kernel: )
Dec 22 16:00:18 home-test kernel: f
Dec 22 16:00:18 home-test kernel: irew
Dec 22 16:00:18 home-test kernel: ire1
Dec 22 16:00:18 home-test kernel: : bu
Dec 22 16:00:18 home-test kernel: s ma
Dec 22 16:00:18 home-test kernel: nage
Dec 22 16:00:18 home-test kernel: r 0
Dec 22 16:00:18 home-test kernel: (me)
Dec 22 16:00:18 home-test kernel:


Let me try to eliminate this behaviour first.

Sean
Comment 14 Dieter 2008-12-25 22:23:09 UTC
> > I confirmed that spl's are complete no-ops since rel 5.  So, you want 
> > to ignore
> > them as they are just markers now where locking should be implemented.

I hunted down the spl code, and you're right.  Wow, I wonder how drivers
still using spl calls work at all?

> is to real behavior, but /var/log/messages has a tendency to get garbled 
> like this:
> 
> Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset
> Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset
> Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, 
> CYCLEMASTER mode
> Dec 22 16:00:18 home-test kernel: firewi
> Dec 22 16:00:18 home-test kernel: re1:
> Dec 22 16:00:18 home-test kernel: 1 n
> Dec 22 16:00:18 home-test kernel: odes
> Dec 22 16:00:18 home-test kernel: , ma
> Dec 22 16:00:18 home-test kernel: xhop
> Dec 22 16:00:18 home-test kernel: <=
> Dec 22 16:00:18 home-test kernel: 0, c
> Dec 22 16:00:18 home-test kernel: able

Do the lines get folded on the console, or only in /var/log/messages?
Comment 15 Dieter 2008-12-28 21:49:00 UTC
> >>> I confirmed that spl's are complete no-ops since rel 5.  So, you want 
> >>> to ignore
> >>> them as they are just markers now where locking should be implemented.
> >>>       
> >
> > I hunted down the spl code, and you're right.  Wow, I wonder how drivers
> > still using spl calls work at all?
> >
> >   
> I believe that the spl() calls are just left there as a hint where 
> locking should be.
> 
> As far as I understand, we need to pay attention to the mutex locks.

I'll rephrase my question.  In the old days, locking was done with spl.
The new way is with mutex.  But with the spl calls being replaced with
noops, and as far as I can tell the driver is not using mutex, there
doesn't appear to be any locking.  So the driver can step on itself.


> >> is to real behavior, but /var/log/messages has a tendency to get garbled 
> >> like this:
> >>
> >> Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset
> >> Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset
> >> Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, 
> >> CYCLEMASTER mode
> >> Dec 22 16:00:18 home-test kernel: firewi
> >> Dec 22 16:00:18 home-test kernel: re1:
> >> Dec 22 16:00:18 home-test kernel: 1 n
> >> Dec 22 16:00:18 home-test kernel: odes
> >> Dec 22 16:00:18 home-test kernel: , ma
> >> Dec 22 16:00:18 home-test kernel: xhop
> >> Dec 22 16:00:18 home-test kernel: <=
> >> Dec 22 16:00:18 home-test kernel: 0, c
> >> Dec 22 16:00:18 home-test kernel: able
> >>     
> >
> > Do the lines get folded on the console, or only in /var/log/messages?
> >   
> 
> As far as I can see, the console messages are fine.  It's only the 
> messages that get
> garbled.

Perhaps an artifact of syslogd?
Comment 16 Sean Bruno 2008-12-28 22:15:36 UTC
Dieter wrote:
>>> I confirmed that spl's are complete no-ops since rel 5.  So, you want 
>>> to ignore
>>> them as they are just markers now where locking should be implemented.
>>>       
>
> I hunted down the spl code, and you're right.  Wow, I wonder how drivers
> still using spl calls work at all?
>
>   
I believe that the spl() calls are just left there as a hint where 
locking should be.

As far as I understand, we need to pay attention to the mutex locks.
>> is to real behavior, but /var/log/messages has a tendency to get garbled 
>> like this:
>>
>> Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset
>> Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset
>> Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, 
>> CYCLEMASTER mode
>> Dec 22 16:00:18 home-test kernel: firewi
>> Dec 22 16:00:18 home-test kernel: re1:
>> Dec 22 16:00:18 home-test kernel: 1 n
>> Dec 22 16:00:18 home-test kernel: odes
>> Dec 22 16:00:18 home-test kernel: , ma
>> Dec 22 16:00:18 home-test kernel: xhop
>> Dec 22 16:00:18 home-test kernel: <=
>> Dec 22 16:00:18 home-test kernel: 0, c
>> Dec 22 16:00:18 home-test kernel: able
>>     
>
> Do the lines get folded on the console, or only in /var/log/messages?
>   

As far as I can see, the console messages are fine.  It's only the 
messages that get
garbled.

Sean
Comment 17 Dieter 2008-12-28 23:30:55 UTC
> > I'll rephrase my question.  In the old days, locking was done with spl.
> > The new way is with mutex.  But with the spl calls being replaced with
> > noops, and as far as I can tell the driver is not using mutex, there
> > doesn't appear to be any locking.  So the driver can step on itself.
> 
> 
> Well, there is locking around a couple of mutex's via FW_GLOCK().

Ah, I wasn't grepping for the right string.  So there *is* mutex
locking.  Although the lingering spl calls are still troubling.
Comment 18 Sean Bruno 2008-12-29 06:22:26 UTC
Dieter wrote:
>>>
>>>   
>>>       
>> I believe that the spl() calls are just left there as a hint where 
>> locking should be.
>>
>> As far as I understand, we need to pay attention to the mutex locks.
>>     
>
> I'll rephrase my question.  In the old days, locking was done with spl.
> The new way is with mutex.  But with the spl calls being replaced with
> noops, and as far as I can tell the driver is not using mutex, there
> doesn't appear to be any locking.  So the driver can step on itself.
>
>   
Well, there is locking around a couple of mutex's via FW_GLOCK().

It appears that the locking is not robust, and that is one of the issues 
that I am
looking into right now.
>   
>>>> is to real behavior, but /var/log/messages has a tendency to get garbled 
>>>> like this:
>>>>
>>>> Dec 22 16:00:18 home-test kernel: fwohci1: Initiate bus reset
>>>> Dec 22 16:00:18 home-test kernel: fwohci1: BUS reset
>>>> Dec 22 16:00:18 home-test kernel: fwohci1: node_id=0xc800ffc0, gen=8, 
>>>> CYCLEMASTER mode
>>>> Dec 22 16:00:18 home-test kernel: firewi
>>>> Dec 22 16:00:18 home-test kernel: re1:
>>>> Dec 22 16:00:18 home-test kernel: 1 n
>>>> Dec 22 16:00:18 home-test kernel: odes
>>>> Dec 22 16:00:18 home-test kernel: , ma
>>>> Dec 22 16:00:18 home-test kernel: xhop
>>>> Dec 22 16:00:18 home-test kernel: <=
>>>> Dec 22 16:00:18 home-test kernel: 0, c
>>>> Dec 22 16:00:18 home-test kernel: able
>>>>     
>>>>         
>>> Do the lines get folded on the console, or only in /var/log/messages?
>>>   
>>>       
>> As far as I can see, the console messages are fine.  It's only the 
>> messages that get
>> garbled.
>>     
>
> Perhaps an artifact of syslogd?
>   
I doubt it.  I'm working on a patch that improves the locking a bit and 
does some
other "gross" things to try and keep things from flying apart.

I've SEEN this behaviour in my implementations with sbp_targ and 
couldn't pin it down.

Scott Long gave me a couple of pointers this evening, but I'm still 
working on locking
down the taskqueues and some of the callback_handlers.  There are some 
bad things going
on specifically during initialization that are pre-empting normal operation.

Sean
Comment 19 Dieter 2009-01-05 17:52:29 UTC
In message <1231181797.21260.8.camel@localhost.localdomain>, Sean Bruno writes:

> @@ -1388,6 +1387,7 @@
>  	struct fw_device *fwdev;
>  
>  	s = splfw();
> +	FW_GLOCK(fc);
>  	fc->status = FWBUSEXPLORE;
>  
>  	/* Invalidate all devices, just after bus reset. */
> @@ -1396,6 +1396,7 @@
>  			fwdev->status = FWDEVINVAL;
>  			fwdev->rcnt = 0;
>  		}
> +	FW_GUNLOCK(fc);
>  	splx(s);
>  
>  	wakeup((void *)fc);

If the (null these days) spl calls have been replaced with FW_GLOCK/FW_GUNLOCK
calls, can the spl calls be removed now?

> @@ -1922,6 +1924,8 @@
>  	u_int i;
>  	struct firewire_comm *fc = (struct firewire_comm *)sc;
>  
> +	FW_GLOCK(fc);
> +	FW_GUNLOCK(fc);
>  	if (stat & OHCI_INT_DMA_IR) {
>  		irstat = atomic_readandclear_int(&sc->irstat);
>  		for(i = 0; i < fc->nisodma ; i++){

This needs to be reviewed by someone who understands what is protected by
getting a lock and then immediately releasing it.  I have no clue.

> @@ -1969,8 +1973,8 @@
>  			OWRITE(sc, OHCI_LNKCTLCLR, OHCI_CNTL_CYCTIMER);
>  #endif
>  			OWRITE(sc, FWOHCI_INTMASKCLR,  OHCI_INT_CYC_LOST);
> -			device_printf(fc->dev, "too many cycle lost, "
> -			 "no cycle master presents?\n");
> +			device_printf(fc->dev, "%s: too many cycle lost, "
> +			 "no cycle master presents?\n", __func__);
>  		}

Should this be "too many cycles lost, " and  "no cycle master present?\n"

>  	uint32_t fun;
>  
> +	FW_GLOCK(fc);
>  	device_printf(fc->dev, "Initiate bus reset\n");
>  	sc = (struct fwohci_softc *)fc;
>  
> @@ -2487,6 +2495,7 @@
>  	fun |= FW_PHY_ISBR | FW_PHY_RHB;
>  	fun = fwphy_wrdata(sc, FW_PHY_ISBR_REG, fun);
>  #endif
> +	FW_GUNLOCK(fc);
>  }

Does the lock need to protect the printf?
Comment 20 sean.bruno 2009-01-05 18:23:56 UTC
Well ... this has really sent me down the rabbit hole the last couple of
days.

There is a need to audit all locking in the firewire stack right now and
I have started that task.

Essentially, threads, callouts, interrupts and task queues are all
jumping around causing context to be switched from one thread to
another.  It's kind of bad in there and I need to sort it out.

I'm working with a stable/7 tree, so I've started a patch for it.  This
patch has quite a few printf -> device_printf changes in it, so try to 
ignore thost for now.

The meat of the patch is a judicious implementation of FW_GLOCK() in
certain code areas.  Note that sometimes the code is just trying to get
the lock and then drops it immediately.  This is not very optimal, but
it does the trick.

I'm still seeing a high level of broken log messages
in /var/log/messages but this may help the issue you were seeing.  Give
it a spin.

Sean
Comment 21 sean.bruno 2009-01-05 18:56:37 UTC
Well ... this has really sent me down the rabbit hole the last couple of
days.

There is a need to audit all locking in the firewire stack right now and
I have started that task.

Essentially, threads, callouts, interrupts and task queues are all
jumping around causing context to be switched from one thread to
another.  It's kind of bad in there and I need to sort it out.

I'm working with a stable/7 tree, so I've started a patch for it.  This
patch has quite a few printf -> device_printf changes in it, so try to 
ignore thost for now.

The meat of the patch is a judicious implementation of FW_GLOCK() in
certain code areas.  Note that sometimes the code is just trying to get
the lock and then drops it immediately.  This is not very optimal, but
it does the trick.

I'm still seeing a high level of broken log messages
in /var/log/messages but this may help the issue you were seeing.  Give
it a spin.

Sean
Comment 22 sean.bruno 2009-01-06 06:20:03 UTC
On Mon, 2009-01-05 at 17:52 +0000, Dieter wrote:
> In message <1231181797.21260.8.camel@localhost.localdomain>, Sean Bruno writes:
> 
> > @@ -1388,6 +1387,7 @@
> >  	struct fw_device *fwdev;
> >  
> >  	s = splfw();
> > +	FW_GLOCK(fc);
> >  	fc->status = FWBUSEXPLORE;
> >  
> >  	/* Invalidate all devices, just after bus reset. */
> > @@ -1396,6 +1396,7 @@
> >  			fwdev->status = FWDEVINVAL;
> >  			fwdev->rcnt = 0;
> >  		}
> > +	FW_GUNLOCK(fc);
> >  	splx(s);
> >  
> >  	wakeup((void *)fc);
> 
> If the (null these days) spl calls have been replaced with FW_GLOCK/FW_GUNLOCK
> calls, can the spl calls be removed now?
> 
They most definitely can be removed.  However, they are a roadmap to
where locks SHOULD be.  I'm using them as guideposts through the code to
see if I can figure out what's missing.

> > @@ -1922,6 +1924,8 @@
> >  	u_int i;
> >  	struct firewire_comm *fc = (struct firewire_comm *)sc;
> >  
> > +	FW_GLOCK(fc);
> > +	FW_GUNLOCK(fc);
> >  	if (stat & OHCI_INT_DMA_IR) {
> >  		irstat = atomic_readandclear_int(&sc->irstat);
> >  		for(i = 0; i < fc->nisodma ; i++){
> 
> This needs to be reviewed by someone who understands what is protected by
> getting a lock and then immediately releasing it.  I have no clue.
> 
Indeed, it's gross.  I only sent this patch over as a "test" of sorts to
validate that the issue I'm looking into(locking in the firewire driver)
is actually something related to what you reported in this issue.

> > @@ -1969,8 +1973,8 @@
> >  			OWRITE(sc, OHCI_LNKCTLCLR, OHCI_CNTL_CYCTIMER);
> >  #endif
> >  			OWRITE(sc, FWOHCI_INTMASKCLR,  OHCI_INT_CYC_LOST);
> > -			device_printf(fc->dev, "too many cycle lost, "
> > -			 "no cycle master presents?\n");
> > +			device_printf(fc->dev, "%s: too many cycle lost, "
> > +			 "no cycle master presents?\n", __func__);
> >  		}
> 
> Should this be "too many cycles lost, " and  "no cycle master present?\n"
> 
Who knows.  I haven't reviewed this specific chunk of code to understand
what it really means as per the FW specs.
> >  	uint32_t fun;
> >  
> > +	FW_GLOCK(fc);
> >  	device_printf(fc->dev, "Initiate bus reset\n");
> >  	sc = (struct fwohci_softc *)fc;
> >  
> > @@ -2487,6 +2495,7 @@
> >  	fun |= FW_PHY_ISBR | FW_PHY_RHB;
> >  	fun = fwphy_wrdata(sc, FW_PHY_ISBR_REG, fun);
> >  #endif
> > +	FW_GUNLOCK(fc);
> >  }
> 
> Does the lock need to protect the printf?

These are gross, overpowered, way to heavy handed locks that I'm playing
with.  I need to prevent pre-emption of certain events while they are in
progress.  One of these events is the firewire's assertion of "bus
reset" on the firewire device.  I see the h/w interrupt firing before
this code can actually complete, causing the driver to be confused on
occasion.


Thanks for surveying this code with me, it helps to see what other
people's eyes can pick up.  I hope to have this in working order
soon-ish.

Sean
Comment 23 Dieter 2009-01-06 09:33:21 UTC
In message <1231222803.21260.17.camel@localhost.localdomain>, Sean Bruno writes:

> > >  	uint32_t fun;
> > >  
> > > +	FW_GLOCK(fc);
> > >  	device_printf(fc->dev, "Initiate bus reset\n");
> > >  	sc = (struct fwohci_softc *)fc;
> > >  
> > > @@ -2487,6 +2495,7 @@
> > >  	fun |= FW_PHY_ISBR | FW_PHY_RHB;
> > >  	fun = fwphy_wrdata(sc, FW_PHY_ISBR_REG, fun);
> > >  #endif
> > > +	FW_GUNLOCK(fc);
> > >  }
> > 
> > Does the lock need to protect the printf?
> 
> These are gross, overpowered, way to heavy handed locks that I'm playing
> with.  I need to prevent pre-emption of certain events while they are in
> progress.  One of these events is the firewire's assertion of "bus
> reset" on the firewire device.  I see the h/w interrupt firing before
> this code can actually complete, causing the driver to be confused on
> occasion.

I understand the basic concept of locking, or at least I did many many years
ago when spls really were levels.  Since then they changed spls to not be levels
(at least that's what the man page says) and later replaced with mutex.  I assume
these changes are at least in part to better support SMP.

What I don't understand is things like getting a lock and immediately releasing
it, which appears to me to protect nothing.  Or why the printf needs to be inside
the locked section of code.  I thought the goal was to hold a lock for as short a
time as possible, and it is not clear to me why the printf needs to be protected.
Comment 24 sean.bruno 2009-01-10 18:54:37 UTC
On Tue, 2009-01-06 at 09:33 +0000, Dieter wrote:
> In message <1231222803.21260.17.camel@localhost.localdomain>, Sean Bruno writes:
> 
> > > >  	uint32_t fun;
> > > >  
> > > > +	FW_GLOCK(fc);
> > > >  	device_printf(fc->dev, "Initiate bus reset\n");
> > > >  	sc = (struct fwohci_softc *)fc;
> > > >  
> > > > @@ -2487,6 +2495,7 @@
> > > >  	fun |= FW_PHY_ISBR | FW_PHY_RHB;
> > > >  	fun = fwphy_wrdata(sc, FW_PHY_ISBR_REG, fun);
> > > >  #endif
> > > > +	FW_GUNLOCK(fc);
> > > >  }
> > > 
> > > Does the lock need to protect the printf?
> > 
> > These are gross, overpowered, way to heavy handed locks that I'm playing
> > with.  I need to prevent pre-emption of certain events while they are in
> > progress.  One of these events is the firewire's assertion of "bus
> > reset" on the firewire device.  I see the h/w interrupt firing before
> > this code can actually complete, causing the driver to be confused on
> > occasion.
> 
> I understand the basic concept of locking, or at least I did many many years
> ago when spls really were levels.  Since then they changed spls to not be levels
> (at least that's what the man page says) and later replaced with mutex.  I assume
> these changes are at least in part to better support SMP.
> 
> What I don't understand is things like getting a lock and immediately releasing
> it, which appears to me to protect nothing.  Or why the printf needs to be inside
> the locked section of code.  I thought the goal was to hold a lock for as short a
> time as possible, and it is not clear to me why the printf needs to be protected.

The printf DOES NOT need protection at all.  You are absolutely correct.

The lock should be moved lower in the function for sure.  This is just
my attempt to see if I am looking at the same problem you are reporting.
Conceptually, I am trying to keep the h/w interrupt which fires during a
bus reset from preempting this function and executing before this
function is finished.

The acquisition of the lock and release keeps the interrupt handler from
executing before the setting of bus reset is finished.  I hope to fine
tune the locks after I have confirmed operational goodness.

Sean
Comment 25 Dieter 2009-01-12 15:04:29 UTC
Sean,

	I tried your patch, but bad news: a bus reset still
tromps on Ethernet.
Comment 26 Gavin Atkinson freebsd_committer freebsd_triage 2009-02-02 21:11:18 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-firewire

Over to maintainer(s).  This looks like it may be some bad 
interaction between the firewire stack and using a serial 
console.  To submitter: It may be worth while switching to 
uart(4) rather than sio(4) for your serial ports and seeing 
if that makes any difference (as I don't think uart(4) uses 
the Giant lock).
Comment 27 Dieter 2009-03-04 21:37:31 UTC
In message <200902022113.n12LDfEI059994@freefall.freebsd.org>, gavin@freebsd.org writes:

> This looks like it may be some bad
> interaction between the firewire stack and using a serial
> console.  To submitter: It may be worth while switching to
> uart(4) rather than sio(4) for your serial ports and seeing
> if that makes any difference (as I don't think uart(4) uses
> the Giant lock).

The box is running 7.1 now.  I commented out sio in the config file
and built a new kernel.

normal kernel:

 freebsd # dmesg | grep ^sio
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A, console
sio0: [FILTER]
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
sio1: [FILTER]
 freebsd # dmesg | grep -i ^uart
 freebsd # dmesg | grep -i 1655 
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A, console
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
 freebsd # dmesg | grep -i console
sio0: type 16550A, console
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x100>
 freebsd # dmesg | grep -i giant
ohci0: [GIANT-LOCKED]
ehci0: [GIANT-LOCKED]
ohci1: [GIANT-LOCKED]
ohci2: [GIANT-LOCKED]
ehci1: [GIANT-LOCKED]
ppc0: [GIANT-LOCKED]
atkbd0: [GIANT-LOCKED]

With sio commented out:

OK boot kernel_uart
/boot/kernel_uart/kernel text=0x3f8090 data=0x5b678+0x38cc8 syms=[0x8+0x6ee08+0x
8+0x609a9]
/

And nothing else appears on the console.


 freebsd # dmesg | grep -i ^sio
 freebsd # dmesg | grep -i ^uart
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 on acpi0
uart0: [FILTER]
uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
uart1: [FILTER]
 freebsd # dmesg | grep -i 1655
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 on acpi0
uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
 freebsd # dmesg | grep -i console
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>

echo hello >> /dev/console
doesn't show up.  Nothing shows up on console.

And looking at files in /var/log I found:

getty[789]: open /dev/ttyd0: No such file or directory
getty[790]: open /dev/ttyd1: No such file or directory


Looks to me like commenting out sio doesn't work so well on my box.
Is that the wrong way to switch from sio to uart?
Is there something else I need to change instead, or in addition?
Comment 28 Gavin Atkinson freebsd_committer freebsd_triage 2009-03-06 15:53:43 UTC
On Wed, 2009-03-04 at 21:37 +0000, Dieter wrote:
> In message <200902022113.n12LDfEI059994@freefall.freebsd.org>, gavin@freebsd.org writes:
> 
> > This looks like it may be some bad
> > interaction between the firewire stack and using a serial
> > console.  To submitter: It may be worth while switching to
> > uart(4) rather than sio(4) for your serial ports and seeing
> > if that makes any difference (as I don't think uart(4) uses
> > the Giant lock).
> 
> The box is running 7.1 now.  I commented out sio in the config file
> and built a new kernel.
> 
> normal kernel:
> 
>  freebsd # dmesg | grep ^sio
> sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
> sio0: type 16550A, console
> sio0: [FILTER]
> sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
> sio1: type 16550A
> sio1: [FILTER]
>  freebsd # dmesg | grep -i ^uart
>  freebsd # dmesg | grep -i 1655 
> sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
> sio0: type 16550A, console
> sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
> sio1: type 16550A
>  freebsd # dmesg | grep -i console
> sio0: type 16550A, console
> sc0: <System console> at flags 0x100 on isa0
> sc0: VGA <16 virtual consoles, flags=0x100>
>  freebsd # dmesg | grep -i giant
> ohci0: [GIANT-LOCKED]
> ehci0: [GIANT-LOCKED]
> ohci1: [GIANT-LOCKED]
> ohci2: [GIANT-LOCKED]
> ehci1: [GIANT-LOCKED]
> ppc0: [GIANT-LOCKED]
> atkbd0: [GIANT-LOCKED]
> 
> With sio commented out:
> 
> OK boot kernel_uart
> /boot/kernel_uart/kernel text=0x3f8090 data=0x5b678+0x38cc8 syms=[0x8+0x6ee08+0x
> 8+0x609a9]
> /
> 
> And nothing else appears on the console.
> 
> 
>  freebsd # dmesg | grep -i ^sio
>  freebsd # dmesg | grep -i ^uart
> uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 on acpi0
> uart0: [FILTER]
> uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
> uart1: [FILTER]
>  freebsd # dmesg | grep -i 1655
> uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 on acpi0
> uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
>  freebsd # dmesg | grep -i console
> sc0: <System console> at flags 0x100 on isa0
> sc0: VGA <16 virtual consoles, flags=0x300>
> 
> echo hello >> /dev/console
> doesn't show up.  Nothing shows up on console.
> 
> And looking at files in /var/log I found:
> 
> getty[789]: open /dev/ttyd0: No such file or directory
> getty[790]: open /dev/ttyd1: No such file or directory
> 
> 
> Looks to me like commenting out sio doesn't work so well on my box.
> Is that the wrong way to switch from sio to uart?
> Is there something else I need to change instead, or in addition?

You'll also need to add the uart hints to your device.hints file, if you
haven't already, and update /etc/ttys.

hint.uart.0.at="isa"
hint.uart.0.port="0x3F8"
hint.uart.0.flags="0x10"
hint.uart.0.irq="4"

/etc/ttys: you may find you need to change "ttyd0" to "ttyu0".

Gavin
Comment 29 Dieter 2009-05-10 11:20:29 UTC
>  > > This looks like it may be some bad
>  > > interaction between the firewire stack and using a serial
>  > > console.  To submitter: It may be worth while switching to
>  > > uart(4) rather than sio(4) for your serial ports and seeing
>  > > if that makes any difference (as I don't think uart(4) uses
>  > > the Giant lock).

>  > Looks to me like commenting out sio doesn't work so well on my box.
>  > Is that the wrong way to switch from sio to uart?
>  > Is there something else I need to change instead, or in addition?
>  
>  You'll also need to add the uart hints to your device.hints file, if you
>  haven't already, and update /etc/ttys.
>  
>  hint.uart.0.at="isa"
>  hint.uart.0.port="0x3F8"
>  hint.uart.0.flags="0x10"
>  hint.uart.0.irq="4"
>  
>  /etc/ttys: you may find you need to change "ttyd0" to "ttyu0".
>  
>  Gavin

Thanks, with those changes it works with uart now.

The bad news is that switching from sio to uart didn't fix the
CPU hogging problem.
Comment 30 Dieter 2010-01-26 18:56:30 UTC
> Looks like the kernel printf takes a *lot* of CPU, and
> somehow locks out Ethernet.

Looking through man pages working on an unrelated problem, I found:

man timed says:

     Messages printed by the kernel on the system console occur with inter-
     rupts disabled.  This means that the clock stops while they are printing.
     A machine with many disk or network hardware problems and consequent mes-
     sages cannot keep good time by itself.  Each message typically causes the
     clock to lose a dozen milliseconds.

If a kernel printf disables Ethernet interrupts for 12 milliseconds,
that would create the problem.
Comment 31 dieterbsd 2011-02-03 21:53:32 UTC
New data: USB causes the same problem that firewire does.

FreeBSD 8.0
amd64
nfe0: <NVIDIA nForce4 CK804 MCP9 Networking Adapter> port 0xb400-0xb407=20
mem 0xfebf9000-0xfebf9fff irq 23 at device 10.0 on pci0
bge0: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev.=20
0x4101> mem 0xfe4f0000-0xfe4fffff irq 19 at device 0.0 on pci5
fwohci1: <NEC uPD72871/2> mem 0xfdeff000-0xfdefffff irq 19 at device=20
8.0 on pci2
ehci0: <NVIDIA nForce4 USB 2.0 controller> mem 0xfebfe000-0xfebfe0ff=20
irq 22 at device 2.1 on pci0

Plugging in a USB_to_RS-232 bridge generates:

ugen0.2: <Prolific Technology Inc.> at usbus0
uplcom0: <Prolific Technology Inc. USB-Serial Controller D, class 0/0,=20
rev 1.10/
4.00, addr 2> on usbus0

And networking is locked out for too long, and data is lost.  (The node=20
sending data
via Ethernet has too small a transmit buffer, but is a closed source=20
closed hardware
black box and cannot be fixed.  Once data is lost it is lost forever,=20
it cannot be
recreated, so this is very bad.)  The FreeBSD box receiving data has a=20
very large
receive buffer, but if another device driver locks out networking it=20
doesn't help.

Changing the printf(9)s to log(9)s fixes the problem, but this is not a=20
good workaround,
there are LOTS of printfs in the kernel, I keep hitting new ones. =20
Printf(9) alone isn't
the problem, adding printfs to chown(2) does not cause the problem, but=20
printfs from
device drivers do. (usb, firewire, ...)

My theory is there is a locking conflict.

Here is lock profiling data for USB_RS-232 bridge being plugged in=20
printfs locking out networking:

debug.lock.prof.enable: 0
debug.lock.prof.reset: 0
debug.lock.prof.stats:
      max  wait_max       total  wait_total       count    avg wait_avg=20
cnt_hold cnt_lock name
   251240         0      499982           0         303   1650      0  0=
=20
      0 /usr/src/sys/dev/usb/usb_request.c:322 (sx:0123456789ABCDEF -=20
USB device SX lock)
   104255         0      110429           0           2  55214      0  0=
=20
      0 /usr/src/sys/dev/usb/usb_device.c:2459 (sx:123456789ABCDEF - USB=
=20
config SX lock)

   104217         0    29670949           0       47087    630      0  0=
=20
      0 /usr/src/sys/kern/uipc_sockbuf.c:148 (sx:so_rcv_sx)  =20
<---------<<<<

   101962         0      103089           0          27   3818      0  0=
=20
     0 /usr/src/sys/kern/kern_condvar.c:145 (sleep mutex:Giant)
    51071         0       81088           0           3  27029      0  0=
=20
     0 /usr/src/sys/kern/kern_cons.c:422 (spin mutex:cnputs_mtx)
    23049         0       31351           0           2  15675      0  0=
=20
     0 /usr/src/sys/kern/vfs_syscalls.c:3508 (lockmgr:ufs)
    12408         0      978544           0         971   1007      0  0=
=20
     0 /usr/src/sys/kern/vfs_vnops.c:604 (lockmgr:ufs)
    11466         0      744372           0       11567     64      0  0=
=20
     0 /usr/src/sys/kern/vfs_bio.c:2559 (lockmgr:bufwait)
    11141         0      501155           0        3850    130      0  0=
=20
     0 /usr/src/sys/kern/vfs_bio.c:1835 (lockmgr:bufwait)
     6115         0       61720           0          26   2373      0  0=
=20
      0 /usr/src/sys/dev/usb/usb_request.c:322 (sx:123456789ABCDEF - USB=
=20
device SX lock)
     5551         0       16749           0          19    881      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:1693 (lockmgr:syncer)
     3043         0        3749           0           6    624      0  0=
=20
      0 /usr/src/sys/ufs/ffs/ffs_vfsops.c:1321 (sleep mutex:struct mount=
=20
mtx)
     2333         0        9901           0          36    275      0  0=
=20
     0 /usr/src/sys/kern/vfs_mount.c:2231 (sleep mutex:struct mount mtx)
     2055         0        9462           0         100     94      0  0=
=20
      0 /usr/src/sys/kern/subr_hints.c:117 (sleep mutex:kernel=20
environment)
     1649         0       79272           0        3846     20      0  0=
=20
     0 /usr/src/sys/kern/vfs_bio.c:2963 (sleep mutex:vm object)
     1636         0       36196           0       15515      2      0  0=
=20
      0 /usr/src/sys/vm/vm_page.c:1052 (sleep mutex:vm page queue free=20
mutex)
     1205         0       55601           0       15411      3      0  0=
=20
     0 /usr/src/sys/kern/vfs_bio.c:2545 (sleep mutex:bufobj interlock)
     1076         0        1076           0           1   1076      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:1693 (lockmgr:ufs)
     1069         0       31117           0          30   1037      0  0=
=20
     0 /usr/src/sys/kern/kern_mutex.c:147 (sleep mutex:nfe0)
      848         0        3752           0        1359      2      0  0=
=20
      0 /usr/src/sys/ufs/ffs/ffs_vfsops.c:1297 (sleep mutex:vnode=20
interlock)
      754         0       10016           0        3846      2      0  0=
=20
      0 /usr/src/sys/ufs/ffs/ffs_softdep.c:4274 (sleep mutex:Softdep=20
Lock)
      739         0        2463           0           4    615      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:1693 (lockmgr:devfs)
      540         0        3370           0        1365      2      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:3179 (sleep mutex:vnode interlock)
      525         0       80852           0         160    505      0  0=
=20
     0 /usr/src/sys/dev/uart/uart_cpu.h:92 (spin mutex:uart_hwmtx)
      512         0        2065           0          23     89      0  0=
=20
     0 /usr/src/sys/kern/vfs_default.c:599 (lockmgr:bufwait)
      487         0       97058           0         481    201      0  0=
=20
     0 /usr/src/sys/vm/vm_pager.c:311 (lockmgr:bufwait)
      467         0        2069           0          12    172      0  0=
=20
     0 /usr/src/sys/dev/usb/usb_busdma.c:547 (sleep mutex:uplcom)
      455         0        2978           0          18    165      0  0=
=20
     0 /usr/src/sys/dev/usb/usb_transfer.c:286 (sleep mutex:uplcom)
      423         0         727           0           2    363      0  0=
=20
     0 /usr/src/sys/vm/uma_core.c:1565 (sleep mutex:UMA lock)
      367         0      230166           0       47087      4      0  0=
=20
     0 /usr/src/sys/netinet/tcp_usrreq.c:708 (rw:tcpinp)
      358         0        1387           0          11    126      0  0=
=20
     0 /usr/src/sys/kern/kern_synch.c:241 (sleep mutex:Giant)
      339         0       72085           0       46119      1      0  0=
=20
     0 /usr/src/sys/kern/kern_mutex.c:147 (sleep mutex:so_rcv)
      332         0      225434           0       93206      2      0  0=
=20
     0 /usr/src/sys/kern/uipc_socket.c:1441 (sleep mutex:so_rcv)
      332         0       19529           0       11551      1      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:1927 (sleep mutex:bufobj interlock)
      331         0      162523           0       94887      1      0  0=
=20
     0 /usr/src/sys/netinet/tcp_output.c:280 (sleep mutex:so_snd)
      330         0       73512           0       47087      1      0  0=
=20
     0 /usr/src/sys/kern/uipc_socket.c:1843 (sleep mutex:so_rcv)
      329         0       88069           0       47771      1      0  0=
=20
     0 /usr/src/sys/kern/uipc_socket.c:1687 (sleep mutex:so_rcv)
      329         0         348           0          12     29      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:1955 (sleep mutex:Syncer mtx)
      320         0       66185           0        3353     19      0  0=
=20
     0 /usr/src/sys/kern/vfs_cluster.c:868 (lockmgr:bufwait)
      303         0        7410           0        4360      1      0  0=
=20
     0 /usr/src/sys/kern/vfs_bio.c:3944 (sleep mutex:bufobj interlock)
debug.lock.prof.rejected: 0
debug.lock.prof.skipcount: 0
debug.lock.prof.skipspin: 0

Here is lock profiling data for firewire bus reset printfs locking out=20
Ethernet:

      max  wait_max       total  wait_total       count    avg wait_avg=20
cnt_hold cnt_lock name
   373345         0      397495           0         643    618      0  0=
=20
     0 /usr/src/sys/kern/vfs_lookup.c:497 (lockmgr:ufs)
   373325         0      390815           0        3140    124      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:2218 (sleep mutex:vnode interlock)

   359514         0    32199808           0       45582    706      0  0=
=20
      0 /usr/src/sys/kern/uipc_sockbuf.c:148 (sx:so_rcv_sx)  =20
<---------<<<<

    63827       320      746196         740       70067     10      0  0=
=20
     4 /usr/src/sys/dev/firewire/fwohci.c:2107 (sleep mutex:firewire)
    49927         0      165929           0           8  20741      0  0=
=20
     0 /usr/src/sys/kern/vfs_syscalls.c:3508 (lockmgr:ufs)
    33358         0     1707461           0        1106   1543      0  0=
=20
     0 /usr/src/sys/kern/vfs_vnops.c:604 (lockmgr:ufs)
    30294         0      401224           0        2641    151      0  0=
=20
     0 /usr/src/sys/vm/vm_map.c:3526 (sx:user map)
    28432         0      934681           0          74  12630      0  0=
=20
     0 /usr/src/sys/kern/kern_cons.c:422 (spin mutex:cnputs_mtx)
    28296         0       66892           0        2641     25      0  0=
=20
     0 /usr/src/sys/vm/vm_fault.c:297 (sleep mutex:vm object)
    25020         0       26667           0          45    592      0  0=
=20
     0 /usr/src/sys/kern/kern_exit.c:825 (sx:proctree)
    19027         0     1225496           0       11937    102      0  0=
=20
     0 /usr/src/sys/kern/vfs_bio.c:2559 (lockmgr:bufwait)
    18887         0      906297           0        4140    218      0  0=
=20
     0 /usr/src/sys/kern/vfs_bio.c:1835 (lockmgr:bufwait)
    12101         0       23459           0        1675     14      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:2342 (sleep mutex:vnode interlock)
     7448         0      170841           0        2229     76      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:2083 (lockmgr:ufs)
     7342         0       21792           0        2169     10      0  0=
=20
     0 /usr/src/sys/kern/vfs_cache.c:390 (rw:Name Cache)
     4648         0       10831           0          11    984      0  0=
=20
     0 /usr/src/sys/kern/vfs_subr.c:1693 (lockmgr:syncer)
     3536         0       35352           0        9218      3      0  0=
=20
     0 /usr/src/sys/amd64/amd64/pmap.c:3989 (sleep mutex:pmap)
     3039         0       34283           0        2641     12      0  0=
=20
     0 /usr/src/sys/vm/vm_fault.c:937 (sleep mutex:vm object)
     2825         0       53949           0        3495     15      0  0=
=20
     0 /usr/src/sys/vm/vm_fault.c:1010 (sleep mutex:vm object)
     2286         0        7169           0         547     13      0  0=
=20
     0 /usr/src/sys/kern/kern_sig.c:983 (sleep mutex:process lock)
     2083      4010        9037        5380        2143      4      2  0=
=20
     2 /usr/src/sys/kern/vfs_subr.c:2118 (sleep mutex:vnode interlock)
     2083        57        2089          57           4    522     14  0=
=20
     1 /usr/src/sys/kern/kern_mutex.c:147 (sleep mutex:pipe mutex)
     2022         0      181449           0         515    352      0  0=
=20
     0 /usr/src/sys/vm/vm_pager.c:311 (lockmgr:bufwait)
     1997         0        3397           0           2   1698      0  0=
=20
     0 /usr/src/sys/kern/kern_sysctl.c:1513 (sx:sysctl mem)
     1994         0        4890           0         117     41      0  0=
=20
     0 /usr/src/sys/kern/kern_sysctl.c:1521 (sx:sysctl lock)
     1971         0        4156           0          13    319      0  0=
=20
     0 /usr/src/sys/kern/kern_sysctl.c:1417 (sleep mutex:Giant)
     1965         0       31838           0        3471      9      0  0=
=20
     0 /usr/src/sys/vm/vm_object.c:482 (sleep mutex:vm object)
     1926         0       17695           0          83    213      0  0=
=20
     0 /usr/src/sys/vm/vm_object.c:541 (sleep mutex:vm object)


There is at least one line in common that looks networking related,=20
/usr/src/sys/kern/uipc_sockbuf.c:148 (sx:so_rcv_sx)

int
sblock(struct sockbuf *sb, int flags)
{

        KASSERT((flags & SBL_VALID) =3D=3D flags,
            ("sblock: flags invalid (0x%x)", flags));

        if (flags & SBL_WAIT) {
                if ((sb->sb_flags & SB_NOINTR) ||
                    (flags & SBL_NOINTR)) {
                        sx_xlock(&sb->sb_sx);
                        return (0);
                }
                return (sx_xlock_sig(&sb->sb_sx));
        } else {
                if (sx_try_xlock(&sb->sb_sx) =3D=3D 0)
                        return (EWOULDBLOCK);
                return (0);
        }
}

It is not clear to me what this code is doing. (comments might help)
It is not clear to me why networking and a USB_to_RS-232 bridge would
have a lock conflict. It is not clear to me why networking and firewire
would have a lock conflict.

How do we fix this?
Comment 32 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 07:59:23 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped