Bug 255072 - boot (legacy): no progress beyond 'BIOS DRIVE D: is disk1'
Summary: boot (legacy): no progress beyond 'BIOS DRIVE D: is disk1'
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: standards (show other bugs)
Version: Unspecified
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-standards (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-15 02:54 UTC by Graham Perrin
Modified: 2022-01-23 19:15 UTC (History)
5 users (show)

See Also:


Attachments
Photograph of the bug. (117.28 KB, image/jpeg)
2021-04-15 02:54 UTC, Graham Perrin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Graham Perrin 2021-04-15 02:54:16 UTC
Created attachment 224121 [details]
Photograph of the bug.

FreeBSD-13.0-RELEASE-amd64-memstick.img

HP ProBook 440 G7
<https://support.hp.com/gb-en/document/c06474914>

Comparative test results: 
<https://gist.github.com/grahamperrin/5eca8231fa7e6a94a1f55991bcd7f3c4#freebsd-130-release-amd64-memstickimg>

> below BIOS DRIVE D: is disk1, a flickering cursor.
Comment 1 Graham Perrin 2021-04-15 03:12:32 UTC
> HP ProBook 440 G7

Adjacent bug 255073 for the same hardware not booting in UEFI mode.
Comment 2 spell 2021-04-19 20:25:06 UTC
I have the analogous case.

FreeBSD-13.0-RELEASE-i386-mini-memstick.img
HP EliteBook 2570p
BIOS Mode Legacy
For a moment these strings:

Consoles: internal video/keyboard
BIOS Drive C: disk0
BIOS Drive D: disk1

are displayed and then the notebook reboots.

FreeBSD-12.2-RELEASE-i386-mini-memstick.img - the same behavior (except for the picture displayed has additional string at the start about BTX loader).

FreeBSD-11.2-RELEASE-i386-mini-memstick.img loads successfully.
Comment 3 Graham Perrin 2021-04-20 12:07:33 UTC
Thank you, 

(In reply to spell from comment #2)

> … 
> FreeBSD-13.0-RELEASE-i386-mini-memstick.img
> HP EliteBook 2570p
> BIOS Mode Legacy
> For a moment these strings:

> … are displayed and then the notebook reboots.

In the moment(s) before the reboot, is a flickering cursor visible? 


----

With the HP ProBook 440 G7, which does not automatically reboot, the flickering is _very_ rapid and barely perceptible. 


<https://h20195.www2.hp.com/v2/getpdf.aspx/c06424517.pdf> QuickSpecs
<https://support.hp.com/gb-en/document/c06474914>         specifications
<https://support.hp.com/gb-en/product/hp-probook-440-g7-notebook-pc/29090063>

----

HP EliteBook 2570p

<https://support.hp.com/gb-en/document/c03412731> specifications
<https://support.hp.com/gb-en/product/hp-elitebook-2570p-notebook-pc/5259393/>

> Intel HD Graphics 4000
Comment 4 spell 2021-04-20 12:37:44 UTC
(In reply to Graham Perrin from comment #3)

> In the moment(s) before the reboot, is a flickering cursor visible?

No at all. That is very short moment, I hardly managed to read those three strings. My cam also can't catch it.

Thank to you too for the PR.
Comment 5 Toomas Soome freebsd_committer 2021-04-20 12:40:26 UTC
(In reply to spell from comment #4)


Latest BIOS?
Comment 6 spell 2021-04-20 12:54:44 UTC
(In reply to Toomas Soome from comment #5)

> Latest BIOS?

Believe yes, I've updated it recently in a service center.
Do you need more info about the version?
Comment 7 Toomas Soome freebsd_committer 2021-04-20 13:05:20 UTC
(In reply to spell from comment #6)

We got disk names, meaning the biosdisk.c probe functions did ok (more or less). After biosdisk, the zfs probe is run, and most likely it is causing the system to hung, because there are no other messages.

Bios version does not help too much (I do not have the hw anyhow). Normally at such case we start with ellimination and inserting diag printouts.
Comment 8 Toomas Soome freebsd_committer 2021-04-20 14:01:38 UTC
(In reply to Toomas Soome from comment #7)

What we should do next is to try to investigate why exactly are we get stuck and this would require building debug loader. I can do this for you.

Before that, I'd like you to test if you can get boot: prompt -- when system is starting, on very first spinner, press space. You should get boot: prompt, there you can enter status or ?/ or ?/boot to list directory contents. Also you can enter file name like: /boot/loader to start next boot phase.
Comment 9 spell 2021-04-20 14:40:54 UTC
(In reply to Toomas Soome from comment #8)

Don't see any spinner but if tapping Space at a right moment I do enter boot prompt and can run /boot/loader or whatever.
Please build loader with eliminations/debug printouts.
Comment 10 Graham Perrin 2021-08-13 08:37:26 UTC
(In reply to Graham Perrin from comment #0)

> … Photograph of the bug. …

Compare with the photograph at bug 257722 comment 3
Comment 11 Graham Perrin 2021-12-28 03:16:10 UTC
(In reply to Graham Perrin from bug 260735 comment 4, where there was some CSM)

> … There's a little more, which I should keep separate …

As  keyword 'uefi' applies to    bug 255073, should 
any keyword        apply to this bug 255072 for legacy boot?

For what it's worth: in this case I treat legacy as distinct from CSM.
Comment 12 spell 2021-12-29 16:35:06 UTC
(In reply to Toomas Soome from comment #7)

In my case the image boots successfully when I choose IDE instead of AHCI in BIOS settings.
This works for both 13.0-RELEASE and 12.3-RELEASE.

Can you please investigate this direction?
Thank you.
Comment 13 spell 2022-01-02 22:08:57 UTC
(In reply to spell from comment #12)
It has appeared that the same loader (from 12.3-RELEASE) installed on HDD successfully loads in AHCI mode.
Comment 14 spell 2022-01-03 08:55:51 UTC
(In reply to Toomas Soome from comment #7)
> We got disk names, meaning the biosdisk.c probe functions did ok (more or less).

Eventually appeared that they are still not.
Trying to boot in any possible combinations has shown that boot process crashes exactly when all these three conditions are met:

1) Flash drive is inserted into USB port.
2) AHCI mode is chosen in BIOS settings.
3) The loader sees the Flash device (as drive D) (this occurs when "USB legacy support" is chosen in BIOS settings).

In any other cases the loader boots succseffully, whether it runs from Flash drive or HDD.
Comment 15 Toomas Soome freebsd_committer 2022-01-03 09:02:00 UTC
(In reply to spell from comment #14)

Ok, does the same happen with UEFI boot (assuming, this system does support UEFI)?

Otherwise, we would need to build boot loader with debug printouts to see what exactly is going on there.
Comment 16 spell 2022-01-03 11:08:29 UTC
(In reply to Toomas Soome from comment #15)

When UEFI:
Flash drive with 12.3-RELEASE-i386 image does not appear in BIOS boot menu at all, so no any boot occurs.
Flash drive with 12.3-RELEASE-amd64 image is visible in BIOS boot menu and does not crash when boots (but after loader's menu the video becomes corrupt, some mess of dots is displayed).
Comment 17 Toomas Soome freebsd_committer 2022-01-03 11:20:04 UTC
(In reply to spell from comment #16)

Ok, i386 image is 32-bit (I guess), so it wont work with 64-bit UEFI.

GFX mixup after kernel is loaded and started is another issue, perhaps fixed in 13/current, but that needs to be tested.

But this did prove the problem is only related to BIOS version - it does smell like we do get some bad value while attempting to identify properties (sector and device size) of that usb flash stick. You could test if you have latest BIOS version, too - it may fix it.

Otherwise - when exactly does it crash - can you get to boot: propmpt (press space when you see first spinner) or you do get crash before you even get to loader itself?
Comment 18 spell 2022-01-03 12:10:09 UTC
(In reply to Toomas Soome from comment #17)
> it does smell like we do get some bad value while attempting to identify properties (sector and device size) of that usb flash stick

Yes, but only when AHCI mode. So, who fails here - BIOS or loader?

> You could test if you have latest BIOS version, too - it may fix it.

My BIOS version seems to be the latest one.
What can I test exactly?

> when exactly does it crash - can you get to boot: propmpt (press space when you see first spinner) or you do get crash before you even get to loader itself?

When I enter boot2 prompt, I choose default loader and get:

Consoles: internal video/keyboard
BIOS Drive C: disk0
BIOS Drive D: disk1

and right here the notebook reboots.

If I choose another loader (I've added to this Flash drive the /boot11 directory with loader from 11.2-RELEASE), then no crash occurs.
Comment 19 Toomas Soome freebsd_committer 2022-01-03 13:17:16 UTC
(In reply to spell from comment #18)

Who fails, depends on the nature of actual error. Assuming the better part of the machines can boot, it points towards the BIOS, but without knowing the exact error mechanics, we can not exclude some corner case in loader code.

The disk list you see, is produced in bd_init() from stand/i386/libi386/biosdisk.c, so the crash has to happen in bd_int13probe() and that usually means something bad was happening either in bd_get_diskinfo_ext() or bd_get_diskinfo_std(), in any case, adding few printf() there would allow us to identify where exactly, and what are the values there causing the crash. Unfortunately, this has to be done on your system, where the crash is happening.
Comment 20 spell 2022-01-08 00:38:35 UTC
(In reply to Toomas Soome from comment #19)
> adding few printf() there would allow us to identify where exactly, and what are the values there causing the crash.

I've added tons of printf()'s and breakpoints through all that stack of functions and finally reached as far as bd_edd_io().
Exactly it do fail.
I've added printf() with all arguments at the beginning of bd_edd_io() and don't see obvious differences between argument sets that do work and argument sets that crush the function.
Please help me further.

Another thing that I've observed is though the crashes look everytime similar but occur not always exactly after the N's invocation of bd_edd_io(), saying more precisely not after exact bcache_ops variable value (which is incremented in bcache_strategy()).
Two adjacent boots with no code and hardware modifications can give different (but close) bcache_ops values right before crashes.
Comment 21 Warner Losh freebsd_committer 2022-01-08 00:57:11 UTC
So is it a read or a write? And is this the first such I/O or not?
And can you force it to use bd_chs_io instead to see if that helps (though if the geometry isn't quite right, chs mode will be epic fail later in the boot process).
Comment 22 spell 2022-01-08 09:24:30 UTC
(In reply to Warner Losh from comment #21)

> So is it a read or a write?

It is always read.

Replacing bd_edd_io() with bd_chs_io() didn't help.

> is this the first such I/O or not?

Not. I've added my own counters to bd_edd_io() and bd_chs_io() and see that crash may occur e.g. upon 10th or 26th invocation of any of these two functions (if counting only readings the Flash drive, not the HDD).
Comment 23 Toomas Soome freebsd_committer 2022-01-08 11:34:28 UTC
(In reply to spell from comment #22)


Could you post the disk properties -- actual ones you see from OS tool like gpart or such, and what you get from probing in biosdisk.c (sector size, number of sectors; I guess it is detecting EDD).

The disk IO in early loader is about detecting partition type and reading partition table - what type of partitioning is used on that disk? In case of GPT, we read disk start *and* disk end to be sure there is no corruption.

Secondly, disk IO is from the time we attempt to discover zfs pools, that will read every candidate partition start and end (pool config has 4 copies).

After that we have hopefully established our boot file system and will start to read loader files.

Usually, when there is a problem with disk IO, we see failure while detecting partitioning or, while probing for zfs pools.

So, what to look for: certainly sector number for read - if we do fit inside the disk. Reading past disk end can crash many BIOS systems.

Second possible issue is if the disk read will read more than we have buffer space - memory corruption. Possible way to test this guess would be to read 1 sector at a time. We use low memory buffer space for realmode INT13 calls and that memory area is 16k, so single sector read will (hopefully) not trash past that buffer end...
Comment 24 spell 2022-01-08 14:22:05 UTC
(In reply to Toomas Soome from comment #23)
gpart show /dev/da0
=>      1  2002941  da0  MBR  (978M)
        1     1600    1  efi  (800K)
     1601   803216    2  freebsd  [active]  (392M)
   804817  1198125       - free -  (585M)

This is 12.3-RELEASE-amd64 image.

disk_ioctl() returns the same 2002941 sectors and sector size 512.

According to my printf() info, probing disks appears ok.
The crash occurs on zfs probing stage, in the last iteration of the cycle:

        for (i = 0; devsw[i] != NULL; i++)

in loader's main.c, when i is 5 and devsw[i]->dv_name is zfs.

This is my printout with printf()'s in this cycle:

BTX loader 1.00  BTX version is 1.02
Consoles: internal video/keyboard
main.c: dv_name: fd dv_type=5
main.c: dv_name: cd dv_type=3
main.c: dv_name: disk dv_type=1
BIOS drive C: is disk0
BIOS drive D: is disk1
main.c: dv_name: net dv_type=2
main.c: dv_name: vdisk dv_type=1
main.c: dv_name: zfs dv_type=4

Zfs probing firstly probes HDD and here always is ok, and then probes Flash drive and crashes on it (if AHCI mode set).

>Possible way to test this guess would be to read 1 sector at a time.
How to do this?
Comment 25 Toomas Soome freebsd_committer 2022-01-08 17:40:54 UTC
(In reply to spell from comment #24)


for 1 sector  reads; bd_realstrategy() is allocating bounce buffer with:

bio_size = min(BIO_BUFFER_SIZE, size);

use 512 for BIO_BUFFER_SIZE.

It would be good to get the sector number and size for last read, however.

The curious thing is, you have GPT, with freebsd partition (zfs probe does check it), but after freebsd partition, there is still free space, so we should not get past disk end, except if zfs probe is trying "whole disk" first and we got wrong disk size from INT 13.
Comment 26 spell 2022-01-14 01:23:12 UTC
(In reply to spell from comment #20)
> though the crashes look everytime similar but occur not always exactly after the N's invocation of bd_edd_io()

Occasionaly the reason of this behavior has been revealed.
The exact moment of the crash depends on how quick I walk through all my breakpoints.
If I do it slow enough (one Enter pressing a second providing one bd_edd_io() a second) I even can pass all zfs probe stage and process further.
Otherwize the crash occurs earlier or later.

So the matter is not in geometry or layout, right?
Comment 27 spell 2022-01-14 02:49:50 UTC
Can't repeat the experiment so it was probably temporary coincidence.
So that is still the question.

(In reply to Toomas Soome from comment #25)
> bio_size = min(BIO_BUFFER_SIZE, size);
> use 512 for BIO_BUFFER_SIZE.

This helps.
(1024 does not.)
Comment 28 spell 2022-01-15 11:48:28 UTC
(In reply to Toomas Soome from comment #25)
> It would be good to get the sector number and size for last read, however.

They differ because the crash occurs in different moments.
Two last crashes have occured on sector numbers (dblk variable) 1953 and 2001640.
Read size in both cases is 4096.

> bio_size = min(BIO_BUFFER_SIZE, size);
> use 512 for BIO_BUFFER_SIZE.
Thank you for the hint, this has led me to discover that the buffer ptr does matter somehow.

I've replaced BIO_BUFFER_SIZE with V86_IO_BUFFER_SIZE, commented out bio_alloc() and bio_free() calls and used dumb "bbuf = bio_buffer;" instead (since no any LIFO queue on bio_alloc()/bio_free() presents here).

Such loader still crashes as usual, but when I just replace "bbuf = bio_buffer;" with "bbuf = PTOV(V86_IO_BUFFER);" the crash does not occur.

Please suggest what to do next.
Comment 29 spell 2022-01-15 13:27:01 UTC
Seems, this may be useful.

bio_buffer variable at my book has address 0x5a6b4 and PTOV(V86_IO_BUFFER) equals to 0xffffe000.

loader's smap gives also:

SMAP type=01 base=0000000000000000 len=000000000009dc00
SMAP type=02 base=00000000ffb00000 len=0000000000500000

So bio_buffer resides in usable memory block (type=01) and PTOV(V86_IO_BUFFER) is in reserved (type=02) memory block.
Comment 30 Toomas Soome freebsd_committer 2022-01-15 13:54:31 UTC
(In reply to spell from comment #28)


just remind me, what version of freebsd is this, current?

the bbuf assignment test is suggesting we do get some sort of buffer overrun there.

ok, V86_IO_BUFFER is at 0x8000 and with size 0x1000 (4KB), BIO_BUFFER_SIZE is 0x4000 (16KB), the buffer is allocated from bss segment (see bio.c bio_buffer[BIO_BUFFER_SIZE].

so, both areas should be safe - in low memory and therefore usable by BIOS INT calls.

Now the catch there is, the btx (our V86 mode "kernel") is at 0x9000, and loader is at 0xA000 (code start, followed by data, bss segments and then stack). So, if the INT will write past 0x8000 + 0x1000, it will corrupt BTX; if INT will write past end of bio_buffer, it will corrupt next variable in BSS.

So, if you are using IO size 512, then both buffer spaces should be just fine. If the INT call will actually use more of that memory, then we may be in trouble. I guess the only way to detect how much buffer memory was actually used, can be detected by storing know value into entire buffer, and test how big are it is where the buffer is changed. With no buffer overrun, we would expect exactly the IO size to be changed...
Comment 31 Toomas Soome freebsd_committer 2022-01-15 14:02:00 UTC
(In reply to spell from comment #29)

PTOV and VTOP are translating physical address to virtual and vice versa; physical 0xA000 is virtual 0x0. 

So virtual 0xffffe000 is physical 0x00008000
Comment 32 spell 2022-01-15 15:00:15 UTC
(In reply to Toomas Soome from comment #30)
> just remind me, what version of freebsd is this, current?
12.3. Initially I started with 13.0 and noticed that visually its loader crashes the samely as 12.2's does (and later as 12.3 does), and I've stuck with the 12.3.

> So virtual 0xffffe000 is physical 0x00008000
Got it, thank you.

> So, if the INT will write past 0x8000 + 0x1000, it will corrupt BTX;
This never happens in my experiments (or goes seamlessly). Using V86_IO_BUFFER always is successful.

> if INT will write past end of bio_buffer, it will corrupt next variable in BSS.
If no buffer overrun when using V86_IO_BUFFER (that is 4K large), how can it happen when using bio_buffer (that is 16K large) if all other conditions are the same?

Also I am trying to decrypt the symptom that the crash occurs not in the same point of loader run. Seems that the bio_buffer area is somehow used by BIOS concurrently with using it by v86int() (just to remind - the loader crashes only when AHCI mode set in BIOS settings), or the INT runs somehow differently depending on IDE/AHCI mode.
Comment 33 Toomas Soome freebsd_committer 2022-01-15 17:13:17 UTC
(In reply to spell from comment #32)

Hm. So, enforcing IO size to 1 sector (512B) does not help, but using buffer at 0x8000 does? that is interesting. 

Btw, did you see comment in bd_io()? It is about proliant and large disk, but it *may* explain the randomness factor...

I still wonder if we could determine the size of corruption - note, we can increase the buffer area in BSS for test purposes.
Comment 34 spell 2022-01-15 21:33:34 UTC
(In reply to Toomas Soome from comment #33)
> So, enforcing IO size to 1 sector (512B) does not help, but using buffer at 0x8000 does?
Enforcing IO size to 512 bytes does help. That is why I ever have paid attention to buffer location variations.

> I still wonder if we could determine the size of corruption - note, we can increase the buffer area in BSS for test purposes.
Increasing bio_buffer size to BIO_BUFFER_SIZE*4 didn't help.

Trying to work out with the bd_io_workaround() and the comment about it and to detect possible buffer overrun...
Comment 35 spell 2022-01-19 10:58:31 UTC
(In reply to Toomas Soome from comment #30)
>If the INT call will actually use more of that memory, then we may be in 
>trouble. I guess the only way to detect how much buffer memory was actually 
>used, can be detected by storing know value into entire buffer, and test 
>how big are it is where the buffer is changed.

Can't implement this test because bd_edd_io() does not return (the crash occurs inside of it), so I can't check the buffer state after this crashing INT.
Is there any way to look into the INT of itself?

>did you see comment in bd_io()?
>It is about proliant and large disk
Can you please explain what a buffer overrun happens on that Proliant and how does bd_io_workaround() solve the poblem and whether it just alleviates (as said in the comment) or totally excludes the buffer overrun?
Comment 36 spell 2022-01-23 19:15:57 UTC
Seems caught it.

The crash occurs inside bd_edd_io(), wich calls BTX-owned int 31h which in turn calls BIOS-owned int 13h, and it comes this last int is the one who do fail.
The matter why this is so difficult to catch it is it crashes randomly. With no obvious differences in environment it may succeed or crash, approximately 99/1.

11.2 loader crashes too, though very rarely.
The rule is: more int 13h during loader run - more crash chance.
By default 11.2 loader does not enter zfs probing stage and so requests only two or three int 13h per disk. With zfs probing (which is on by default in 12.3) the count of these requests is about a hundred, so the chance to crash is much bigger.

Didn't try all the functions of int 13h but at least CMD_READ_LBA, CMD_READ_CHS and one of CMD_CHECK_EDD, CMD_EXT_PARAM lead to the crash.

These are my tests that proof this statement:

12.3 loader: I've added for(i=0; i<100; i++) with identical bd_edd_io() calls right after the original bd_edd_io() call, and the crash occurs inside of this bunch of calls (every time on different i value).

11.2 loader: The same with bd_int13probe(). The loader crashes every time on this or that i value.