Created attachment 209272 [details] Serial console output from first occurence and verification. Hello, after updating from base r354795 to base r354850 I get sdhci-bcm0-slot controller timeouts while doing zsf send on a RPi 3 B+. The timeouts do not happen when a r354795 or older kernel is used. I think this shows up because of the changes which happend to bcm2835_sdhci between r354795 and r354850. Ralf
Hi, Take- almost certainly my fault. Can you please try updating past r354876 and let me know if it persists? Thanks!
Also, if you scroll up a little bit when you hit this, do you see any "DMA in use" messages coming from it?
(In reply to Kyle Evans from comment #2) > Also, if you scroll up a little bit when you hit this, do you see any "DMA in > use" messages coming from it? In the console log from the second test (r354850), which is complete from the start of "zsf send" up to the reboot, there is no such message. I will report what happend when past r354876 release got tested.
(In reply to iz-rpi03 from comment #3) Hmm... depending on how things work out, you could be hitting the same issue that I ran into in testing earlier versions (before any of the named revisions) of the driver on the RPi4, https://reviews.freebsd.org/D22430 -- namely, sdhci interrupt handler gets invoked to deal with some other interrupt and acks the data interrupt meant for us and we stall. The board I hit that on died, though, so I've not been able to work out what interrupt could've been coming in that caused it.
(In reply to Kyle Evans from comment #1 and comment #2) With a r354850 userland and a r354886 kernel the results are - different. I repated the "zfs send" three times, each with a different result. 1st try: Fresh booted system. "zfs send" results in a kernel panic. 2nd try: Fresh booted system. "zfs send" finishes successfully without any message. 3rd try: Doing "zfs send" again. Got controller timeouts; system stays responsible; zpool got degraded because of too many errors on mmcsd0 There are no "DMA in use" messages. I will attach the panic backtrace and those controller messages separately.
(In reply to iz-rpi03 from comment #5) Can you also please try the patch @ https://reviews.freebsd.org/D22430.diff to make sure that doesn't alleviate the condition before I back out the candidate commits?
Created attachment 209289 [details] Panic and backtrace from first try at 2019-11-20
Created attachment 209290 [details] Controller timeout messages from 3rd try at 2019-11-20
(In reply to Kyle Evans from comment #6) > Can you also please try the patch @ https://reviews.freebsd.org/D22430.diff Sure. r354886 kernel with D22430.diff applied: ... 23:20:15 164M rootpool/ROOT/default@2019-11-19-13:53:43-0 sdhci_bcm0-slot0: Controller timeout sdhci_bcm0-slot0: ============== REGISTER DUMP ============== sdhci_bcm0-slot0: Sys addr: 0x014ec200 | Version: 0x00009902 sdhci_bcm0-slot0: Blk size: 0x00000200 | Blk cnt: 0x0000001f sdhci_bcm0-slot0: Argument: 0x00e1cde0 | Trn mode: 0x00000026 sdhci_bcm0-slot0: Present: 0x01ff0506 | Host ctl: 0x00000003 sdhci_bcm0-slot0: Power: 0x0000000f | Blk gap: 0x00000000 sdhci_bcm0-slot0: Wake-up: 0x00000000 | Clock: 0x00000207 sdhci_bcm0-slot0: Timeout: 0x0000000e | Int stat: 0x00000000 sdhci_bcm0-slot0: Int enab: 0x01ff00bb | Sig enab: 0x01ff00bb sdhci_bcm0-slot0: AC12 err: 0x00000000 | Host ctl2:0x00000000 sdhci_bcm0-slot0: Caps: 0x00000000 | Caps2: 0x00000000 sdhci_bcm0-slot0: Max curr: 0x00000001 | ADMA err: 0x00000000 sdhci_bcm0-slot0: ADMA addr:0x00000000 | Slot int: 0x00000000 sdhci_bcm0-slot0: =========================================== mmcsd0: Error indicated: 1 Timeout mmcsd0: Error indicated: 1 Timeout mmcsd0: Error indicated: 1 Timeout mmcsd0: Error indicated: 1 Timeout mmcsd0: Error indicated: 1 Timeout 164M rootpool/ROOT/default@2019-11-19-13:53:43-0 23:20:17 166M rootpool/ROOT/default@2019-11-19-13:53:43-0 ... 23:20:24 185M rootpool/ROOT/default@2019-11-19-13:53:43-0 mmcsd0: Error indicated: 1 Timeout mmcsd0: Error indicated: 1 Timeout mmcsd0: Error indicated: 1 Timeout mmcsd0: Error indicated: 1 Timeout mmcsd0: Error indicated: 1 Timeout 23:20:25 188M rootpool/ROOT/default@2019-11-19-13:53:43-0 23:20:26 191M rootpool/ROOT/default@2019-11-19-13:53:43-0 From here on the kernel does not respond any more. Neither to USB nor serial console. I have to toggle the power. I support your idea from comment #4, that there is something unexpected happening during interrrupt handling. Especially during heavy activity on USB. Which includes network traffic on the Raspberry PIs of course. I suspect those "smsc0: warning: Failed to read register 0x114" and "smsc0: warning: MII is busy" messages, from e.g. bug #240201, have the same cause as the timeouts here. I extended the smsc_warn_printf messages to give more information than "MII is busy" when it happend, but still have no idea about the cause. Meanwhile it happend extremely rare, but always together with heavy network and storage activity and on one data path. I can not actually remember if it was smsc_miibus_read- or -writereg.
Created attachment 209302 [details] git(1) diff against base/ This patch backs out r354823, which is the main change that happened in the range you posted (and combined with a similar report to -current@, which is also just after this. It's build/boot tested (but I can't stress test at the moment) and may still need to be combined with the diff from reviews.f.o for good measure.
(In reply to Kyle Evans from comment #10) Just finished three "zfs send" in sequence with world and kernel at r254935 (the D22430.diff was undone as well). No controller timeouts or panics happend. Everything looks fine now. Thank you. Please tell me if I should test anything else.
(In reply to iz-rpi03 from comment #11) Nope, I think we're good! Sorry for the hassle! Closing as "FIXED" with no further action required since the breakage won't get MFC'd -- do feel free to re-open as needed.