Bug 237261 - Boot stuck in endless ATAPI_IDENTIFY attempts
Summary: Boot stuck in endless ATAPI_IDENTIFY attempts
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: i386 Any
: --- Affects Only Me
Assignee: freebsd-scsi (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-13 14:50 UTC by Mikhail Teterin
Modified: 2019-05-06 23:50 UTC (History)
5 users (show)

See Also:


Attachments
dmesg of 8.2 booting on the same machine (7.36 KB, text/plain)
2019-04-13 14:50 UTC, Mikhail Teterin
no flags Details
Verbose dmesg.boot from 8.2 (27.33 KB, text/plain)
2019-04-20 07:38 UTC, Mikhail Teterin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mikhail Teterin freebsd_committer freebsd_triage 2019-04-13 14:50:42 UTC
Created attachment 203644 [details]
dmesg of 8.2 booting on the same machine

This problem affects both 11.2 and 12.0 on my old laptop. The machine boots fine into 8.2 (dmesg attached).

I may be misreading the boot-messages, but it looks like it identifies 3 storage devices:

1. An SSD (ada0)
2. A CD/DVD (cd0)
3. Sony's "memory stick" reader -- with no media inserted

The boot reports:

GEOM: new disk cd0
GEOM: new disk ada0

and then goes into infinite cycle of (retyping):

(aprobe0:ata1:0:1:0): ATAPI_IDENTIFY. ACB: a1 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ata1:0:1:0): CAM status: Command timeout
(aprobe0:ata1:0:1:0): Retrying command
...
(aprobe0:ata1:0:1:0): ATAPI_IDENTIFY. ACB: a1 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ata1:0:1:0): CAM status: Command timeout
(aprobe0:ata1:0:1:0): Retries exhausted
...

It then resets ata1 and tries again... And again...

Because 8.2 continues to boot fine, I do not believe, anything is wrong with the hardware.

The Bug #202712 is similar, but over there the hang is over the SETFEATURES SET TRANSFER MODE -- my laptop can't get through the ATAPI_IDENTIFY...
Comment 1 Conrad Meyer freebsd_committer freebsd_triage 2019-04-13 15:01:23 UTC
Thank you for the report.  Can you please attach dmesg of 12.0 boot (ideally with -v), up to the point of the infinite cycle?
Comment 2 Mikhail Teterin freebsd_committer freebsd_triage 2019-04-13 16:22:25 UTC
(In reply to Conrad Meyer from comment #1)
> Can you please attach dmesg of 12.0 boot

Cannot -- it never finishes booting. In retyped the most relevant part by hand... I suppose, I can take pictures of -- or video-record -- the boot and attach that. 

21st century, eh?
Comment 3 Mikhail Teterin freebsd_committer freebsd_triage 2019-04-13 18:02:12 UTC
(In reply to Conrad Meyer from comment #1)
Ok, the video is almost 90Mb, so I don't want to upload it. But you can view it here:

https://oc.virtual-estates.net:8443/index.php/s/wmTpNXT9tjPDQyX
Comment 4 Andriy Gapon freebsd_committer freebsd_triage 2019-04-19 06:21:56 UTC
(In reply to Mikhail Teterin from comment #2)
I wonder how long you waited at most?
I suspect that after some, quite long, time the system would eventually continue to boot. The ATAPI_IDENTIFY timeout is quite large and there is a number of retries on a couple of levels.

The problem seems to be that for some reason the system seems to "detect" a phantom ATAPI slave on the same channel as the CD-ROM device (devices=0x30000 -- this mask contains two devices on the channel).
Maybe the older code had a way to check that it is a phantom device or maybe it just failed the phantom much faster.

Could you please attach a verbose dmesg from FreeBSD 8 ?
Comment 5 Mikhail Teterin freebsd_committer freebsd_triage 2019-04-20 07:38:49 UTC
Created attachment 203818 [details]
Verbose dmesg.boot from 8.2

> I wonder how long you waited at most?

Left it trying overnight -- after about 12 hours it was still at it...

Verbose dmesg attached.
Comment 6 Andriy Gapon freebsd_committer freebsd_triage 2019-04-21 08:13:15 UTC
(In reply to Mikhail Teterin from comment #5)
Thank you!
So, we can see that the old stack also sees the phantom device, tries to identify it and fails.  But I guess that that happens rather quickly (?) and, certainly, there are no endless retries.  That seems to be the main difference between the old code and the new one.

I would try to draw attention of CAM experts like Scott Long or Alexander Motin or Kenneth Merry to this bug.  I'll re-assign this bug to scsi@ as well.

Here are the relevant bits from the log:
ata1: <ATA channel 1> on atapci0
ata1: reset tp1 mask=03 ostat0=50 ostat1=00
ata1: stat0=0x00 err=0x01 lsb=0x14 msb=0xeb
ata1: stat1=0x00 err=0x01 lsb=0x14 msb=0xeb
ata1: reset tp2 stat0=00 stat1=00 devices=0x30000
ata1: Identifying devices: 00030000
ata1: New devices: 00030000
ata1: reiniting channel ..
ata1: reset tp1 mask=03 ostat0=00 ostat1=00
ata1: stat0=0x00 err=0x01 lsb=0x14 msb=0xeb
ata1: stat1=0x00 err=0x01 lsb=0x14 msb=0xeb
ata1: reset tp2 stat0=00 stat1=00 devices=0x30000
ata1: reinit done ..
unknown: FAILURE - ATAPI_IDENTIFY timed out LBA=0
ata1: reiniting channel ..
ata1: reset tp1 mask=03 ostat0=00 ostat1=00
ata1: stat0=0x00 err=0x01 lsb=0x14 msb=0xeb
ata1: stat1=0x00 err=0x01 lsb=0x14 msb=0xeb
ata1: reset tp2 stat0=00 stat1=00 devices=0x30000
ata1: reinit done ..
unknown: FAILURE - ATAPI_IDENTIFY timed out LBA=0

And then success for the real device:
ata1-master: pio=PIO4 wdma=WDMA2 udma=UDMA33 cable=40 wire
acd0: setting UDMA33
acd0: <UJDA755 DVD/CDRW/1.00> CDRW drive at ata1 as master
Comment 7 Mikhail Teterin freebsd_committer freebsd_triage 2019-05-06 02:26:54 UTC
(In reply to Andriy Gapon from comment #6)
Ok, so why does it never give up with the new code? It claims that "Retries exhausted", but comes right back to the same exhaustion again and again...
Comment 8 Alexander Motin freebsd_committer freebsd_triage 2019-05-06 21:11:50 UTC
I suppose it is not really a command retry, but a restart of probe process, triggered by ATA bus reset, triggered by ATAPI_IDENTIFY command timeout for phantom CDROM.  That should probably be work-arounded, but honestly I have no big wish to workaround PATA hardware issue in year 2019.
Comment 9 Mikhail T. 2019-05-06 23:50:21 UTC
PATA or not, an endless loop in device-detection like this is a bug on its own, is not it? There is got to be a limit on the number of iterations. What if it were a real device - with a broken controller?

> I have no
big wish to workaround PATA hardware issue in year 2019.

A small wish, maybe? Things got broken in 9.0, it seems - much earlier than 2019...