new kernel from 13.1-RELEASE crashes on boot, the last message on console is: panic: aprobedone: invalid action state 0xfd I think it is related to sata controller, 13.0-RELEASE-p11 works fine
@Sergei - From what version was the 13.1 upgrade? - What was the upgrade method? - Can you boot verbose and attach a screenshot
Created attachment 234018 [details] console screenshot
upgrade was done from 13.0-RELEASE via "freebsd-update -r 13.1-RELEASE upgrade" problematic controller is Asmedia 106x I was able to catch screenshot only by 120fps video recording, the system is reset despite of verbose settings and kern.panic_reboot_wait_time=-1 in /boot.loader.conf
I have disconnected all disks from problematic SATA controllers and system was booted without any problems. After that I have connected one disk and system also boot, but with several errors in the console and it took couple minutes, several seconds between each "Root mount waiting for: CAM" message the disk on the problematic controller is not working, I do not see it at the dmesg at all. "camcontrol reset" doesn't help. Trying to mount root from zfs:mc []... uhub0: 4 ports with 4 removable, self powered (aprobe3:ahcich11:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (aprobe3:ahcich11:0:0:0): CAM status: Uncorrectable parity/CRC error (aprobe3:ahcich11:0:0:0): Error 5, Retry was blocked uhub1: 8 ports with 8 removable, self powered (aprobe3:ahcich11:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (aprobe3:ahcich11:0:0:0): CAM status: Uncorrectable parity/CRC error (aprobe3:ahcich11:0:0:0): Error 5, Retry was blocked ugen2.2: <vendor 0x0409 product 0x005a> at usbus2 uhub3 on uhub1 uhub3: <vendor 0x0409 product 0x005a, class 9/0, rev 2.00/1.00, addr 1> on usbus2 uhub2: 22 ports with 22 removable, self powered uhub3: 2 ports with 0 removable, self powered Root mount waiting for: usbus0 CAM usbus2 ugen2.3: <AVerMedia A805> at usbus2 ukbd0 on uhub3 ukbd0: <AVerMedia A805, class 0/0, rev 2.00/2.00, addr 2> on usbus2 kbd2 at ukbd0 ugen0.2: <Logitech USB Receiver> at usbus0 ukbd1 on uhub2 ukbd1: <Logitech USB Receiver, class 0/0, rev 2.00/12.01, addr 1> on usbus0 kbd3 at ukbd1 ugen2.4: <vendor 0x0471 product 0x082d> at usbus2 Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM ahcich11: AHCI reset: device not ready after 31000ms (tfd = 00000080) Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM ahcich11: Poll timeout on slot 9 port 0 ahcich11: is 08000000 cs 00000200 ss 00000000 rs 00000200 tfd 00 serr 01880000 cmd 0004c917 (aprobe3:ahcich11:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe3:ahcich11:0:0:0): CAM status: Command timeout (aprobe3:ahcich11:0:0:0): Error 5, Retries exhausted ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
The problem remains in 13.1-RELEASE-p1 and absent in 13.0-RELEASE-p12
The problematic controller is ASMedia ASM1062 I have tried to update firmware on the controller, but the problem persist. It looks like the problem is also appeared in the 12.3-RELEASE with the same symptoms. What was changed in the two latest minor releases between 12.2-12.3 and 13.0-13.1 with SATA drivers?
12.4-RELEASE has the same problem as 12.3-RELEASE, but doesn't hang as 13.1-RELEASE after I connect the drive to the problematic controller, just fills the logs and console with these messages several times per second: Dec 5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:50 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:51 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked Dec 5 13:32:52 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 5 13:32:52 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): CAM status: Uncorrectable parity/CRC error Dec 5 13:32:52 z120-i386 kernel: (aprobe0:ahcich10:0:0:0): Error 5, Retry was blocked
The problem still exist in 13.2-RC6, I am stuck at 13.0-RELEASE branch without an ability to upgrade even packages, because part of them are not working due to 13.1-RELEASE requirements
I have problematic devices, slightly different, bit the problem is the same: ahci2@pci0:9:0:0: class=0x010601 rev=0x02 hdr=0x00 vendor=0x1b21 device=0x0612 subvendor=0x1b21 subdevice=0x1060 vendor = 'ASMedia Technology Inc.' device = 'ASM1062 Serial ATA Controller' class = mass storage subclass = SATA ahci3@pci0:11:0:0: class=0x010601 rev=0x01 hdr=0x00 vendor=0x1b21 device=0x0612 subvendor=0x1b21 subdevice=0x1060 vendor = 'ASMedia Technology Inc.' device = 'ASM1062 Serial ATA Controller' class = mass storage subclass = SATA
The problem still exist in 13.2-RELEASE and 14.0-BETA1
If I connect any device to any of mentioned controllers, the system just goes backscreen and hangs just second after that. The same happens during boot at ahci detection stage with at least one connected device to the problematic controllers
It appears that the sys/dev/ahci and sys/cam/ata portion of difference between releng/12.2 and releng/12.3 were quite minimal and none of these appear to be likely culprit. Assigning to maintainer in case there are something that I have missed.
Move to 14.0 queue as this may be affecting 14.0 too.
Why do you have devices connected to your system that can't even respond to an ata identify? I'm guessing that these drives work fine on older systems and it's not a 'failure in place' scenario that has started causing problems.... right? Invalid action state of 0xfd is very odd. It sounds like something is corrupting the softc of the xpt probe device, or we have a lifetime issue where we're getting callbacks to a zombie device we've already started to tear down or have torn down. So there's likely a bug down the error paths of ata discovery in some obscure cases maybe? But, you get a CRC error on 13.2 but not on 13.1, which suggests to me that it's a cabling issue. And/or workarounds for controller errata got changed between 13.1 and 13.2, most likely via a changed that was MFC'd from -current, which explains why 14.0... Or we're doing something new that needs a quirk for these older controllers. If you happen to have better cables handy, and it isn't a huge hassle, that might be the easy path here (though I'm speculating: this could be a bug in the setup of the sata controller so that we don't wind up with a good PHY connection to the drive, leading to the CRC errors even with good cables). If it's hard, then the effort might not be worth the small possibility of it fixing things. I'm not sure this is a show-stopper for 14.0, though. but I'll leave that up to Alexander..
So between releng/13.0 and releng/13.1 we added some ASMedia quirks, but they were for newer versions. {0x06121b21, 0x00, "ASMedia ASM1062", AHCI_Q_NOCCS|AHCI_Q_NOAUX}, as still there from before, which should match your controller. But I see nothing that would have changed the handling of quirks. Just some resource handling improvements on errors for enclosures... The only real change is: - if (!(enc->r_memc = bus_alloc_resource_any(dev, SYS_RES_MEMORY, - &rid, RF_ACTIVE))) { - mtx_destroy(&enc->mtx); - return (ENXIO); - } - enc->capsem = ATA_INL(enc->r_memc, 0); - rid = 1; - if (!(enc->r_memt = bus_alloc_resource_any(dev, SYS_RES_MEMORY, - &rid, RF_ACTIVE))) { - error = ENXIO; - goto err0; + if ((enc->r_memc = bus_alloc_resource_any(dev, SYS_RES_MEMORY, + &rid, RF_ACTIVE)) != NULL) { + enc->capsem = ATA_INL(enc->r_memc, 0); + rid = 1; + if (!(enc->r_memt = bus_alloc_resource_any(dev, SYS_RES_MEMORY, + &rid, RF_ACTIVE))) { + error = ENXIO; + goto err0; + } + } else { + enc->capsem = AHCI_EM_XMT | AHCI_EM_SMB | AHCI_EM_LED; + enc->r_memt = NULL; } Where we hard-code the enclosure caps if we can't allocate the resource. Before we were setting it to maybe some random value or 0xffffffff. this change might affect things? This is in ahciem.c, and you could drop a printf in after we set it to see what it is on both 13.0 and 13.1. There were also a couple of changes in cam_xpt.c dealing with finding subdevices that I don't think would matter (but might). There's a bunch of changes to ctl and mmc, but those don't matter for this problem. Some minor changes to nda that don't matter here. New entries in the SCSI opcode table. Likewise with scsi disks, tapes, passthru, xpt scanning and enclosures. Some changes to cd reporting (which I don't think are in play). So I'm puzzled. Maybe you could do a bisect between the stable/13 13.0 branchpoint and the stable/13 13.1 branchpoint? It is kinda a big ask, but would be the fastest way to discover where things went wrong and would be a big help in tracking things down. 4c44dbde5491516eba8725dc51d39c1dcc817472 is the 13.0 branch point 8824cbace389c440394bb9ea6c127d0f8f85538b is the 13.1 branch point. Git says there's 13000 chanages, which would be 15 bisect compiles if all goes well (more if you land on one of the rare broken parts of the branch).
Thanks for the investigation, I am ready to test several different commits, but I do not know how to build the kernel from the specific build. Could you please point me to the instruction how to do this?
build from the specific commit
I have built the kernel from the https://cgit.freebsd.org/src/snapshot/src-8824cbace389c440394bb9ea6c127d0f8f85538b.tar.gz and it is not working with the same symptoms
kernel built from https://cgit.freebsd.org/src/snapshot/src-4c44dbde5491516eba8725dc51d39c1dcc817472.tar.gz works fine
I also tried to replace src/sys/dev/ahci/ahciem.c in 13.2-RELEASE from the 4c44dbde5491516eba8725dc51d39c1dcc817472 but it didn't work :-( so the problem is somewhere else
I have tried to do bisect, but after entering the bad and good commits bisect selects me the commit with no kernel source, what am I doing wrong? root@:~/1/src # git switch stable/13 Updating files: 100% (46442/46442), done. branch 'stable/13' set up to track 'origin/stable/13'. Switched to a new branch 'stable/13' root@:~/1/src # git bisect start status: waiting for both good and bad commits root@:~/1/src # git bisect good 4c44dbde5491516eba8725dc51d39c1dcc817472 status: waiting for bad commit, 1 good commit known root@:~/1/src # git bisect bad 8824cbace389c440394bb9ea6c127d0f8f85538b Bisecting: 6284 revisions left to test after this (roughly 13 steps) [53245fe337d1c79e28a4b3837bc589b126bc17da] Note that keys must be loaded for 'zpool remove'
(In reply to Sergei Masharov from comment #21) You need git bisect --first-parent because of the way ZFS is imported you don't want to land on one of those versions (which isn't a full source tree, but instead just the source from OpenZFS).
I figured it out: git bisect start -- sys/dev/ahci/ and I think I have found the problematic commit: [3bec9180c9e70852a8ad0a56461a7ad0dd1abd92] Add IDs for ASMedia ASM116x PCIe 3.0 AHCI controllers. I will try 14.0 kernel without these extra lines and if it will work, I will also try with extra arguments like with other models above
I have finally found the problematic commit: 22b9f618e873140127570e629c19e25d13edc568 also I have found, that less than a year ago it can be turned off by setting hw.pci.enable_mps_tune with hw.pci.enable_mps_tune=0 my system works fine under the 14.0-RELEASE-p2 I think this should be set to 0 by default in all releases (12,13,14) and only who have a problems with devices should set it to the true