Bug 264016 - kernel crashes on boot: panic: aprobedone: invalid action state 0xfd
Summary: kernel crashes on boot: panic: aprobedone: invalid action state 0xfd
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Alexander Motin
URL:
Keywords: crash, needs-qa
Depends on:
Blocks: 14.0r
  Show dependency treegraph
 
Reported: 2022-05-16 12:38 UTC by Sergei Masharov
Modified: 2023-12-12 12:36 UTC (History)
3 users (show)

See Also:


Attachments
console screenshot (378.87 KB, image/png)
2022-05-18 11:13 UTC, Sergei Masharov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sergei Masharov 2022-05-16 12:38:04 UTC
new kernel from 13.1-RELEASE crashes on boot, the last message on console is:
panic: aprobedone: invalid action state 0xfd

I think it is related to sata controller, 13.0-RELEASE-p11 works fine
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2022-05-16 23:45:40 UTC
@Sergei 

- From what version was the 13.1 upgrade?
- What was the upgrade method?
- Can you boot verbose and attach a screenshot
Comment 2 Sergei Masharov 2022-05-18 11:13:55 UTC
Created attachment 234018 [details]
console screenshot
Comment 3 Sergei Masharov 2022-05-18 11:16:38 UTC
upgrade was done from 13.0-RELEASE via "freebsd-update -r 13.1-RELEASE upgrade"

problematic controller is Asmedia 106x

I was able to catch screenshot only by 120fps video recording, the system is reset despite of verbose settings and kern.panic_reboot_wait_time=-1
 in /boot.loader.conf
Comment 4 Sergei Masharov 2022-05-20 14:01:38 UTC
I have disconnected all disks from problematic SATA controllers and system was booted without any problems.

After that I have connected one disk and system also boot, but with several errors in the console and it took couple minutes, several seconds between each "Root mount waiting for: CAM" message

the disk on the problematic controller is not working, I do not see it at the dmesg at all. "camcontrol reset" doesn't help.


Trying to mount root from zfs:mc []...
uhub0: 4 ports with 4 removable, self powered
(aprobe3:ahcich11:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe3:ahcich11:0:0:0): CAM status: Uncorrectable parity/CRC error
(aprobe3:ahcich11:0:0:0): Error 5, Retry was blocked
uhub1: 8 ports with 8 removable, self powered
(aprobe3:ahcich11:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe3:ahcich11:0:0:0): CAM status: Uncorrectable parity/CRC error
(aprobe3:ahcich11:0:0:0): Error 5, Retry was blocked
ugen2.2: <vendor 0x0409 product 0x005a> at usbus2
uhub3 on uhub1
uhub3: <vendor 0x0409 product 0x005a, class 9/0, rev 2.00/1.00, addr 1> on usbus2
uhub2: 22 ports with 22 removable, self powered
uhub3: 2 ports with 0 removable, self powered
Root mount waiting for: usbus0 CAM usbus2
ugen2.3: <AVerMedia A805> at usbus2
ukbd0 on uhub3
ukbd0: <AVerMedia A805, class 0/0, rev 2.00/2.00, addr 2> on usbus2
kbd2 at ukbd0
ugen0.2: <Logitech USB Receiver> at usbus0
ukbd1 on uhub2
ukbd1: <Logitech USB Receiver, class 0/0, rev 2.00/12.01, addr 1> on usbus0
kbd3 at ukbd1
ugen2.4: <vendor 0x0471 product 0x082d> at usbus2
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
ahcich11: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
ahcich11: Poll timeout on slot 9 port 0
ahcich11: is 08000000 cs 00000200 ss 00000000 rs 00000200 tfd 00 serr 01880000 cmd 0004c917
(aprobe3:ahcich11:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe3:ahcich11:0:0:0): CAM status: Command timeout
(aprobe3:ahcich11:0:0:0): Error 5, Retries exhausted
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
Comment 5 Sergei Masharov 2022-08-10 09:51:48 UTC
The problem remains in 13.1-RELEASE-p1 and absent in 13.0-RELEASE-p12
Comment 6 Sergei Masharov 2022-10-14 09:07:24 UTC
The problematic controller is ASMedia ASM1062

I have tried to update firmware on the controller, but the problem persist.

It looks like the problem is also appeared in the 12.3-RELEASE with the same symptoms.

What was changed in the two latest minor releases between 12.2-12.3 and 13.0-13.1 with SATA drivers?
Comment 7 Sergei Masharov 2022-12-05 10:38:33 UTC
12.4-RELEASE has the same problem as 12.3-RELEASE, but doesn't hang as 13.1-RELEASE after I connect the drive to the problematic controller, just fills the logs and console with these messages several times per second:

Dec  5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Dec  5 13:32:52 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec  5 13:32:52 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  5 13:32:52 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
Comment 8 Sergei Masharov 2023-04-02 12:28:23 UTC
The problem still exist in 13.2-RC6, I am stuck at 13.0-RELEASE branch without an ability to upgrade even packages, because part of them are not working due to 13.1-RELEASE requirements
Comment 9 Sergei Masharov 2023-06-22 13:23:17 UTC
I have problematic devices, slightly different, bit the problem is the same:

ahci2@pci0:9:0:0:       class=0x010601 rev=0x02 hdr=0x00 vendor=0x1b21 device=0x0612 subvendor=0x1b21 subdevice=0x1060
    vendor     = 'ASMedia Technology Inc.'
    device     = 'ASM1062 Serial ATA Controller'
    class      = mass storage
    subclass   = SATA

ahci3@pci0:11:0:0:      class=0x010601 rev=0x01 hdr=0x00 vendor=0x1b21 device=0x0612 subvendor=0x1b21 subdevice=0x1060
    vendor     = 'ASMedia Technology Inc.'
    device     = 'ASM1062 Serial ATA Controller'
    class      = mass storage
    subclass   = SATA
Comment 10 Sergei Masharov 2023-09-11 22:19:58 UTC
The problem still exist in 13.2-RELEASE and 14.0-BETA1
Comment 11 Sergei Masharov 2023-09-11 22:24:20 UTC
If I connect any device to any of mentioned controllers, the system just goes backscreen and hangs just second after that. The same happens during boot at ahci detection stage with at least one connected device to the problematic controllers
Comment 12 Xin LI freebsd_committer freebsd_triage 2023-09-12 16:15:49 UTC
It appears that the sys/dev/ahci and sys/cam/ata portion of difference between releng/12.2 and releng/12.3 were quite minimal and none of these appear to be likely culprit.  Assigning to maintainer in case there are something that I have missed.
Comment 13 Xin LI freebsd_committer freebsd_triage 2023-09-12 16:17:10 UTC
Move to 14.0 queue as this may be affecting 14.0 too.
Comment 14 Warner Losh freebsd_committer freebsd_triage 2023-10-17 16:21:10 UTC
Why do you have devices connected to your system that can't even respond to an ata identify? I'm guessing that these drives work fine on older systems and it's not a 'failure in place' scenario that has started causing problems.... right?

Invalid action state of 0xfd is very odd. It sounds like something is corrupting the softc of the xpt probe device, or we have a lifetime issue where we're getting callbacks to a zombie device we've already started to tear down or have torn down.

So there's likely a bug down the error paths of ata discovery in some obscure cases maybe?

But, you get a CRC error on 13.2 but not on 13.1, which suggests to me that it's a cabling issue. And/or workarounds for controller errata got changed between 13.1 and 13.2, most likely via a changed that was MFC'd from -current, which explains why 14.0... Or we're doing something new that needs a quirk for these older controllers. If you happen to have better cables handy, and it isn't a huge hassle, that might be the easy path here (though I'm speculating: this could be a bug in the setup of the sata controller so that we don't wind up with a good PHY connection to the drive, leading to the CRC errors even with good cables). If it's hard, then the effort might not be worth the small possibility of it fixing things.

I'm not sure this is a show-stopper for 14.0, though. but I'll leave that up to Alexander..
Comment 15 Warner Losh freebsd_committer freebsd_triage 2023-10-17 16:48:28 UTC
So between releng/13.0 and releng/13.1 we added some ASMedia quirks, but they were for newer versions.
        {0x06121b21, 0x00, "ASMedia ASM1062",   AHCI_Q_NOCCS|AHCI_Q_NOAUX},
as still there from before, which should match your controller.

But I see nothing that would have changed the handling of quirks. Just some resource handling improvements on errors for enclosures... The only real change is:

-   if (!(enc->r_memc = bus_alloc_resource_any(dev, SYS_RES_MEMORY,
-       &rid, RF_ACTIVE))) {
-           mtx_destroy(&enc->mtx);
-           return (ENXIO);
-   }
-   enc->capsem = ATA_INL(enc->r_memc, 0);
-   rid = 1;
-   if (!(enc->r_memt = bus_alloc_resource_any(dev, SYS_RES_MEMORY,
-       &rid, RF_ACTIVE))) {
-           error = ENXIO;
-           goto err0;
+ if ((enc->r_memc = bus_alloc_resource_any(dev, SYS_RES_MEMORY,
+     &rid, RF_ACTIVE)) != NULL) {
+         enc->capsem = ATA_INL(enc->r_memc, 0);
+         rid = 1;
+         if (!(enc->r_memt = bus_alloc_resource_any(dev, SYS_RES_MEMORY,
+             &rid, RF_ACTIVE))) {
+                 error = ENXIO;
+                 goto err0;
+         }
+ } else {
+         enc->capsem = AHCI_EM_XMT | AHCI_EM_SMB | AHCI_EM_LED;
+         enc->r_memt = NULL;
        }

Where we hard-code the enclosure caps if we can't allocate the resource. Before we were setting it to maybe some random value or 0xffffffff. this change might affect things? This is in ahciem.c, and you could drop a printf in after we set it to see what it is on both 13.0 and 13.1.

There were also a couple of changes in cam_xpt.c dealing with finding subdevices that I don't think would matter (but might).

There's a bunch of changes to ctl and mmc, but those don't matter for this problem. Some minor changes to nda that don't matter here. New entries in the SCSI opcode table. Likewise with scsi disks, tapes, passthru, xpt scanning and enclosures.

Some changes to cd reporting (which I don't think are in play).

So I'm puzzled.

Maybe you could do a bisect between the stable/13 13.0 branchpoint and the stable/13 13.1 branchpoint? It is kinda a big ask, but would be the fastest way to discover where things went wrong and would be a big help in tracking things down.
4c44dbde5491516eba8725dc51d39c1dcc817472 is the 13.0 branch point
8824cbace389c440394bb9ea6c127d0f8f85538b is the 13.1 branch point.

Git says there's 13000 chanages, which would be 15 bisect compiles if all goes well (more if you land on one of the rare broken parts of the branch).
Comment 16 Sergei Masharov 2023-11-05 07:28:08 UTC
Thanks for the investigation, I am ready to test several different commits, but I do not know how to build the kernel from the specific build.

Could you please point me to the instruction how to do this?
Comment 17 Sergei Masharov 2023-11-05 07:31:32 UTC
build from the specific commit
Comment 18 Sergei Masharov 2023-11-08 14:09:30 UTC
I have built the kernel from the https://cgit.freebsd.org/src/snapshot/src-8824cbace389c440394bb9ea6c127d0f8f85538b.tar.gz and it is not working with the same symptoms
Comment 19 Sergei Masharov 2023-11-08 18:12:15 UTC
kernel built from https://cgit.freebsd.org/src/snapshot/src-4c44dbde5491516eba8725dc51d39c1dcc817472.tar.gz works fine
Comment 20 Sergei Masharov 2023-11-08 20:41:37 UTC
I also tried to replace src/sys/dev/ahci/ahciem.c in 13.2-RELEASE from the 4c44dbde5491516eba8725dc51d39c1dcc817472 but it didn't work :-(

so the problem is somewhere else
Comment 21 Sergei Masharov 2023-12-11 18:03:07 UTC
I have tried to do bisect, but after entering the bad and good commits bisect selects me the commit with no kernel source, what am I doing wrong?

root@:~/1/src # git switch stable/13
Updating files: 100% (46442/46442), done.
branch 'stable/13' set up to track 'origin/stable/13'.
Switched to a new branch 'stable/13'
root@:~/1/src # git bisect start
status: waiting for both good and bad commits
root@:~/1/src # git bisect good 4c44dbde5491516eba8725dc51d39c1dcc817472
status: waiting for bad commit, 1 good commit known
root@:~/1/src # git bisect bad 8824cbace389c440394bb9ea6c127d0f8f85538b
Bisecting: 6284 revisions left to test after this (roughly 13 steps)
[53245fe337d1c79e28a4b3837bc589b126bc17da] Note that keys must be loaded for 'zpool remove'
Comment 22 Warner Losh freebsd_committer freebsd_triage 2023-12-11 18:05:24 UTC
(In reply to Sergei Masharov from comment #21)

You need git bisect --first-parent because of the way ZFS is imported you don't want to land on one of those versions (which isn't a full source tree, but instead just the source from OpenZFS).
Comment 23 Sergei Masharov 2023-12-12 09:46:16 UTC
I figured it out:
git bisect start -- sys/dev/ahci/

and I think I have found the problematic commit:
[3bec9180c9e70852a8ad0a56461a7ad0dd1abd92] Add IDs for ASMedia ASM116x PCIe 3.0 AHCI controllers.

I will try 14.0 kernel without these extra lines and if it will work, I will also try with extra arguments like with other models above
Comment 24 Sergei Masharov 2023-12-12 12:36:37 UTC
I have finally found the problematic commit: 22b9f618e873140127570e629c19e25d13edc568

also I have found, that less than a year ago it can be turned off by setting hw.pci.enable_mps_tune 

with hw.pci.enable_mps_tune=0 my system works fine under the 14.0-RELEASE-p2

I think this should be set to 0 by default in all releases (12,13,14) and only who have a problems with devices should set it to the true