Bug 242101 - sdhci-bcm0-slot0 controller timeouts with r354850
Summary: sdhci-bcm0-slot0 controller timeouts with r354850
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Kyle Evans
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-20 09:49 UTC by rz-rpi03
Modified: 2019-11-21 15:29 UTC (History)
2 users (show)

See Also:


Attachments
Serial console output from first occurence and verification. (8.83 KB, text/plain)
2019-11-20 09:49 UTC, rz-rpi03
no flags Details
Panic and backtrace from first try at 2019-11-20 (3.72 KB, text/plain)
2019-11-20 20:55 UTC, rz-rpi03
no flags Details
Controller timeout messages from 3rd try at 2019-11-20 (10.07 KB, text/plain)
2019-11-20 20:56 UTC, rz-rpi03
no flags Details
git(1) diff against base/ (2.10 KB, patch)
2019-11-21 01:18 UTC, Kyle Evans
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description rz-rpi03 2019-11-20 09:49:05 UTC
Created attachment 209272 [details]
Serial console output from first occurence and verification.

Hello,

after updating from base r354795 to base r354850 I get sdhci-bcm0-slot
controller timeouts while doing zsf send on a RPi 3 B+.
The timeouts do not happen when a r354795 or older kernel is used.

I think this shows up because of the changes which happend to bcm2835_sdhci
between r354795 and r354850.

Ralf
Comment 1 Kyle Evans freebsd_committer freebsd_triage 2019-11-20 12:34:35 UTC
Hi,

Take- almost certainly my fault. Can you please try updating past r354876 and let me know if it persists?

Thanks!
Comment 2 Kyle Evans freebsd_committer freebsd_triage 2019-11-20 12:38:41 UTC
Also, if you scroll up a little bit when you hit this, do you see any "DMA in use" messages coming from it?
Comment 3 rz-rpi03 2019-11-20 13:17:23 UTC
(In reply to Kyle Evans from comment #2)
> Also, if you scroll up a little bit when you hit this, do you see any "DMA in
> use" messages coming from it?

In the console log from the second test (r354850), which is complete from the
start of "zsf send" up to the reboot, there is no such message.
I will report what happend when past r354876 release got tested.
Comment 4 Kyle Evans freebsd_committer freebsd_triage 2019-11-20 14:14:21 UTC
(In reply to iz-rpi03 from comment #3)

Hmm... depending on how things work out, you could be hitting the same issue that I ran into in testing earlier versions (before any of the named revisions) of the driver on the RPi4, https://reviews.freebsd.org/D22430 -- namely, sdhci interrupt handler gets invoked to deal with some other interrupt and acks the data interrupt meant for us and we stall. The board I hit that on died, though, so I've not been able to work out what interrupt could've been coming in that caused it.
Comment 5 rz-rpi03 2019-11-20 20:43:51 UTC
(In reply to Kyle Evans from comment #1 and comment #2)

With a r354850 userland and a r354886 kernel the results are - different.
I repated the "zfs send" three times, each with a different result.

1st try:
  Fresh booted system. "zfs send" results in a kernel panic.

2nd try:
  Fresh booted system. "zfs send" finishes successfully without any message.

3rd try:
  Doing "zfs send" again. Got controller timeouts; system stays responsible;
  zpool got degraded because of too many errors on mmcsd0
  There are no "DMA in use" messages.

I will attach the panic backtrace and those controller messages separately.
Comment 6 Kyle Evans freebsd_committer freebsd_triage 2019-11-20 20:49:09 UTC
(In reply to iz-rpi03 from comment #5)

Can you also please try the patch @ https://reviews.freebsd.org/D22430.diff to make sure that doesn't alleviate the condition before I back out the candidate commits?
Comment 7 rz-rpi03 2019-11-20 20:55:28 UTC
Created attachment 209289 [details]
Panic and backtrace from first try at 2019-11-20
Comment 8 rz-rpi03 2019-11-20 20:56:34 UTC
Created attachment 209290 [details]
Controller timeout messages from 3rd try at 2019-11-20
Comment 9 rz-rpi03 2019-11-20 23:04:29 UTC
(In reply to Kyle Evans from comment #6)
> Can you also please try the patch @ https://reviews.freebsd.org/D22430.diff

Sure. r354886 kernel with D22430.diff applied:

...
23:20:15    164M   rootpool/ROOT/default@2019-11-19-13:53:43-0
sdhci_bcm0-slot0: Controller timeout
sdhci_bcm0-slot0: ============== REGISTER DUMP ==============
sdhci_bcm0-slot0: Sys addr: 0x014ec200 | Version:  0x00009902
sdhci_bcm0-slot0: Blk size: 0x00000200 | Blk cnt:  0x0000001f
sdhci_bcm0-slot0: Argument: 0x00e1cde0 | Trn mode: 0x00000026
sdhci_bcm0-slot0: Present:  0x01ff0506 | Host ctl: 0x00000003
sdhci_bcm0-slot0: Power:    0x0000000f | Blk gap:  0x00000000
sdhci_bcm0-slot0: Wake-up:  0x00000000 | Clock:    0x00000207
sdhci_bcm0-slot0: Timeout:  0x0000000e | Int stat: 0x00000000
sdhci_bcm0-slot0: Int enab: 0x01ff00bb | Sig enab: 0x01ff00bb
sdhci_bcm0-slot0: AC12 err: 0x00000000 | Host ctl2:0x00000000
sdhci_bcm0-slot0: Caps:     0x00000000 | Caps2:    0x00000000
sdhci_bcm0-slot0: Max curr: 0x00000001 | ADMA err: 0x00000000
sdhci_bcm0-slot0: ADMA addr:0x00000000 | Slot int: 0x00000000
sdhci_bcm0-slot0: ===========================================
mmcsd0: Error indicated: 1 Timeout
mmcsd0: Error indicated: 1 Timeout
mmcsd0: Error indicated: 1 Timeout
mmcsd0: Error indicated: 1 Timeout
mmcsd0: Error indicated: 1 Timeout
164M   rootpool/ROOT/default@2019-11-19-13:53:43-0
23:20:17    166M   rootpool/ROOT/default@2019-11-19-13:53:43-0
...
23:20:24    185M   rootpool/ROOT/default@2019-11-19-13:53:43-0
mmcsd0: Error indicated: 1 Timeout
mmcsd0: Error indicated: 1 Timeout
mmcsd0: Error indicated: 1 Timeout
mmcsd0: Error indicated: 1 Timeout
mmcsd0: Error indicated: 1 Timeout
23:20:25    188M   rootpool/ROOT/default@2019-11-19-13:53:43-0
23:20:26    191M   rootpool/ROOT/default@2019-11-19-13:53:43-0

From here on the kernel does not respond any more. Neither to USB nor
serial console. I have to toggle the power.

I support your idea from comment #4, that there is something unexpected
happening during interrrupt handling. Especially during heavy activity
on USB. Which includes network traffic on the Raspberry PIs of course.

I suspect those "smsc0: warning: Failed to read register 0x114" and
"smsc0: warning: MII is busy" messages, from e.g. bug #240201, have the
same cause as the timeouts here.
I extended the smsc_warn_printf messages to give more information
than "MII is busy" when it happend, but still have no idea about the cause.
Meanwhile it happend extremely rare, but always together with heavy network
and storage activity and on one data path. I can not actually remember
if it was smsc_miibus_read- or -writereg.
Comment 10 Kyle Evans freebsd_committer freebsd_triage 2019-11-21 01:18:52 UTC
Created attachment 209302 [details]
git(1) diff against base/

This patch backs out r354823, which is the main change that happened in the range you posted (and combined with a similar report to -current@, which is also just after this. It's build/boot tested (but I can't stress test at the moment) and may still need to be combined with the diff from reviews.f.o for good measure.
Comment 11 rz-rpi03 2019-11-21 15:24:22 UTC
(In reply to Kyle Evans from comment #10)
Just finished three "zfs send" in sequence with world and kernel at r254935
(the D22430.diff was undone as well).

No controller timeouts or panics happend. Everything looks fine now.
Thank you.

Please tell me if I should test anything else.
Comment 12 Kyle Evans freebsd_committer freebsd_triage 2019-11-21 15:29:28 UTC
(In reply to iz-rpi03 from comment #11)

Nope, I think we're good! Sorry for the hassle! Closing as "FIXED" with no further action required since the breakage won't get MFC'd -- do feel free to re-open as needed.