Bug 242341 - GEOM / GEOM_PART: silent discard MBR modification
Summary: GEOM / GEOM_PART: silent discard MBR modification
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-geom mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-01 17:17 UTC by Tomasz "CeDeROM" CEDRO
Modified: 2019-12-09 16:22 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tomasz "CeDeROM" CEDRO 2019-12-01 17:17:03 UTC
Hello world,

I am unable to clone a pendrive with `dd if=/dev/da0 of=/dev/da1` because GEOM_PART considers disk broken and silently prevents writing _all_ data (it discards MBR), thus:

1. OS / GEOM is hiding things from operator. It does not write bytes
as instructed to fix the disk, instead, it considers disk invalid and
silently discards _only_some_of_the_data_ with no clear error/warning
indication. Unacceptable!!!

2. OS / GEOM lies to operator. It does return a SUCCESS code while
_some_ data goes to /dev/null. Unacceptale!!!

3. If disk is _considered_ broken then access should be _fully_
blocked. But how am I supposed to fix it when OS silently blocks
essential part of the fix? Who allows writing over a corrupted disk
anyway? This looks typical Linux way, and we don't really like Linux for that, right?

4. OS / GEOM is broken and incoherent in this area and proves system
unreliable / not trustworthy. This needs to be fixed please.


Solution proposal:

1. Do not hide actions / never lie to the user. Do what superuser wants.

2. When disk is _considered_ broken then access should be blocked returning I/O Error with DMESG message and instructions how to use sysctl that will allow writing to a _whole_ disk in order to fix it.

3. Use clear _warnings_ but perform actions, or use _errors_ and prevent actions.

Thank you :-)
Tomek


ugen0.8: <Kingston DataTraveler 3.0> at usbus0
umass1 on uhub0
umass1: <Kingston DataTraveler 3.0, class 0/0, rev 3.20/1.10, addr 7> on usbus0
umass1:  SCSI over Bulk-Only; quirks = 0x8100
umass1:3:1: Attached to scbus3
da1 at umass-sim1 bus 1 scbus3 target 0 lun 0
da1: <Kingston DataTraveler 3.0 PMAP> Removable Direct Access SPC-4 SCSI device
da1: Serial Number BLAHBLAH
da1: 400.000MB/s transfers
da1: 118272MB (242221056 512 byte sectors)
da1: quirks=0x2<NO_6_BYTE>
GEOM_PART: integrity check failed (da1, MBR) <--- here is the only problem indication
Comment 1 Tomasz "CeDeROM" CEDRO 2019-12-01 17:19:33 UTC
Also, in this problematic situation, where GEOM_PART silently discarded writes to MBR:

1. I was unable to DD into MBR from a file.

2. I was able to DD into the same MBR from a /dev/{zero,random,md0}.
Comment 2 andrew 2019-12-02 03:10:01 UTC
Writing to da1 (and not to some partition) does not go through GEOM_PART. Why do you think that is the problem?

Failed integrity check does not prevent writing to the device, it should only prevent GEOM_PART from creating geoms for the individual partitions.
Comment 3 Tomasz "CeDeROM" CEDRO 2019-12-08 17:11:51 UTC
Hello Andrew! :-)

Thank you for your reply. There is a discussion on a mailing lists (questions andusb), where some folks say this is the pendrive problem, others suggest that invalid MBR may be the reason of discarding writes to MBR.

I think here is the best place to ask questions and solve the issue directly with authors :-)

1. I would like to clarify why I was unable to write to MBR (first 512 bytes) of the da1 while there was no problem with writhing to the rest of the disk?

2. GEOM_PART: integrity check failed (da1, MBR) <- could that block / discard writes to MBR silently?

3. Why `dd if=/dev/da0 of=/dev/da1` did copy all disk except the MBR?

4. Why I was unable to write to MBR from a file but it was possible to write that data from a file over md0?

This is the first time such situation happened to me in FreeBSD for many many years. I am working a lot on disk copies / images so reliable and trustworthy DD is a must have for me..
Comment 4 Andrey V. Elsukov freebsd_committer 2019-12-09 11:09:35 UTC
Hello,

if you want to get answers, you need to provide more useful info.
First of rumors like "I did something and it didn't work" aren't useful.
You should provide the state of your system at the time before you did some changes, then what you want to do and what you expect to get in the result. And then what result you have got.

How to obtain this info:
gpart show
gpart list
mount
dmesg | grep GEOM

all commands you have ran and the resulted output. And then we will can say where is the problem.
Comment 5 Andrey V. Elsukov freebsd_committer 2019-12-09 11:22:38 UTC
(In reply to Tomasz "CeDeROM" CEDRO from comment #3)

> 1. I would like to clarify why I was unable to write to MBR (first 512 bytes) of the da1 while there was no problem with writhing to the rest of the disk?

How did you know that? Did you compare the first 512 bytes on both disks after copying?

> 2. GEOM_PART: integrity check failed (da1, MBR) <- could that block / discard writes to MBR silently?

This usually means that parition table's metadata is wrong. If it is wrong, and kern.geom.part.check_integrity is set, then GEOM_PART just ignores the partition table.

> 3. Why `dd if=/dev/da0 of=/dev/da1` did copy all disk except the MBR?

How did you know that? Did you compare the first 512 bytes on both disks after copying? I assume you have disks with different size and GEOM_PART just rejects MBR on the copied disk, because its integrity check has failed. 

> 4. Why I was unable to write to MBR from a file but it was possible to write that data from a file over md0?

Again, how did you know that?
Comment 6 Tomasz "CeDeROM" CEDRO 2019-12-09 15:27:13 UTC
(In reply to Andrey V. Elsukov from comment #5)

Hello Andrey and thank you for your reply :-)

= ACTIONS =

1. dd if=/dev/da0 of=/dev/da1 bs=512 count=1 does not copy anything.
2. dd if=/dev/zero of=/dev/da1 bs=512 count=1 zeroes the mbr.
3. dd if=/dev/random of=/dev/da1 bs=512 count=1 randoms the mbr.
4. dd if=/dev/da0 of=da0.mbr bs=512 count1; dd if=da0.mbr bc=512
count=1 does NOT copy the mbr.
5. mdconfig -a -tvnode da0.mbr; dd if=/dev/md0 of=/dev/da1 bc=512
count=1 does put the data into mbr, but still da1 seems to have no
partitions!!!



= ANSWERS =

>> 1. I would like to clarify why I was unable to write to MBR 
>> (first 512 bytes) of the da1 while there was no problem with
>> writhing to the rest of the disk?
> How did you know that? Did you compare the first 512 bytes
> on both disks after copying?

Yes, exactly! I am using `less -f` rather than `bvi` (stalls the OS by loading 128GB into RAM lol) on a large files just to sneak peek the contents.

Because after initial failures with DD I did ZERO the whole disk just to make sure its is all clean. Zeros looks like monkeys and you can see with bare eye where are zeros and where not, 512 bytes fits nicely on one screen :-)


>> 2. GEOM_PART: integrity check failed (da1, MBR) 
>> <- could that block / discard writes to MBR silently?
> This usually means that parition table's metadata is wrong.
> If it is wrong, and kern.geom.part.check_integrity is set, 
> then GEOM_PART just ignores the partition table.

root@0xCFMX4:~ # sysctl kern.geom.part.check_integrity
kern.geom.part.check_integrity: 1

Can you please explain in detail what does it mean exactly "ignores the partition table"? I guess here is the problem.

If the "partition table is ignored", then why it is silently ignored, writes to MBR goes to /dev/null, reads from MBR does not update disk layout, while it is possible to write to the rest of the disk?

Why the whole disk is not locked against writing in a clearly distinguishable manner? Like error return code from DD and error messages in DMESG?

If the MBR is corrupt, then high chance of all partitions may be corrupt, write out of bounds may happen, so writing data to such disk would irreversibly destroy data that are stored on such drive.

This "ignore" thing makes system incoherent and unpredictable. Here is my proposal for change:

1. Either we allow writing to the WHOLE disk AND the MBR but signal _warning_ the MBR corruption via DMESG when kern.geom.part.check_integrity=1.

2. Or we signal _error_ MBR corruption over DMESG and whole disk is locked against writing with clear error message (i.e. DD writes return clear fail and at the same time we show DMESG error message like "/DEV/XXX disk is write locked because MBR is corrupt. Clear kern.geom.part.check_integrity if you want to write to that disk anyway") when kern.geom.part.check_integrity=1.

3. I know there are some additional flags that may allow writing over "ignored" disk. They are not set by default and kind of unclear where to find them. Definitely not a fix for the current "silent write ignore surprise" until you set them. I guess using the kern.geom.part.check_integrity is good enough to control the situation. Either kern.geom.part.check_integrity=1 and when MBR or GPT is corrupt then WHOLE disk is clearly locked against writing (as explained in point 2 above), or kern.geom.part.check_integrity=0 and we can write over the WHOLE disk even if MBR or GPT seems corrupted because it is not verified.


I think number 2 would be the safest choice..?



>> 3. Why `dd if=/dev/da0 of=/dev/da1` did copy all disk except the MBR?
> How did you know that? Did you compare the first 512 bytes on both disks
> after copying? I assume you have disks with different size and GEOM_PART
> just rejects MBR on the copied disk, because its integrity check has failed. 

Exactly, verified them after write. No write to MBR occured, but the whole contents was written. There was no clear error that MBR was not written. This should be signaled with an error and write to the disk should be prohibited as explained above.

So you confirm that the write to MBR was silently discared and the whole situation is indeed managed by GEOM_PART. This is exactly the problem I am reporting. Thank you :-)


>> 4. Why I was unable to write to MBR from a file but it was possible
>> to write that data from a file over md0?
> Again, how did you know that?

Verification after write :-)



= SYSTEM DETAILS =

ugen0.7: <Kingston DataTraveler 3.0> at usbus0
umass0 on uhub0
umass0: <Kingston DataTraveler 3.0, class 0/0, rev 3.20/1.10, addr 6> on usbus0
umass0:  SCSI over Bulk-Only; quirks = 0x8100
umass0:2:0: Attached to scbus2
da0 at umass-sim0 bus 0 scbus2 target 0 lun 0
da0: <Kingston DataTraveler 3.0 PMAP> Removable Direct Access SPC-4 SCSI device
da0: Serial Number BLAHBLAH
da0: 400.000MB/s transfers
da0: 118368MB (242417664 512 byte sectors)
da0: quirks=0x2<NO_6_BYTE>

ugen0.8: <Kingston DataTraveler 3.0> at usbus0
umass1 on uhub0
umass1: <Kingston DataTraveler 3.0, class 0/0, rev 3.20/1.10, addr 7> on usbus0
umass1:  SCSI over Bulk-Only; quirks = 0x8100
umass1:3:1: Attached to scbus3
da1 at umass-sim1 bus 1 scbus3 target 0 lun 0
da1: <Kingston DataTraveler 3.0 PMAP> Removable Direct Access SPC-4 SCSI device
da1: Serial Number BLAHBLAH
da1: 400.000MB/s transfers
da1: 118272MB (242221056 512 byte sectors)
da1: quirks=0x2<NO_6_BYTE>
GEOM_PART: integrity check failed (da1, MBR)

FreeBSD 0xCFMX4 12.1-RELEASE-p1 FreeBSD 12.1-RELEASE-p1 GENERIC  amd64
Comment 7 Andrey V. Elsukov freebsd_committer 2019-12-09 15:51:28 UTC
(In reply to Tomasz "CeDeROM" CEDRO from comment #6)

And again, You didn't provided any useful info.

> 1. dd if=/dev/da0 of=/dev/da1 bs=512 count=1 does not copy anything.

Please, use script(1) command if you unable to make a copy from your terminal. Should be something like this:
# dd if=/dev/da0 of=/dev/da1 count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.001958 secs (261495 bytes/sec)

You can use hexdump(1) command to check the result of your copy, e.g.
# dd if=/dev/da1 count=1 | hexdump -vC

> 2. dd if=/dev/zero of=/dev/da1 bs=512 count=1 zeroes the mbr.

This looks impossible, if first command doesn't work, this also wont work.

> Can you please explain in detail what does it mean exactly "ignores the 
> partition table"? I guess here is the problem.

There is no problem, all described in gpart(8) man page. "Ignores" means that GEOM_PART class will not create geom object of type MBR and there will no partitions appears. 

> If the "partition table is ignored", then why it is silently ignored, 
> writes to MBR goes to /dev/null, reads from MBR does not update disk
> layout, while it is possible to write to the rest of the disk?
> Why the whole disk is not locked against writing in a clearly 
> distinguishable manner? Like error return code from DD and error 
> messages in DMESG?

I don't think that something like this happens in reality :)

> So you confirm that the write to MBR was silently discared and the whole
> situation is indeed managed by GEOM_PART. This is exactly the problem I 
> am reporting. Thank you :-)

You still misunderstand how all things work. There are different GEOM objects, and when MBR is ignored, GEOM_PART discards nothing, since there is no related GEOM object. This is why I asked to show the output of `gpart show` and `gpart list` commands.
Comment 8 Tomasz "CeDeROM" CEDRO 2019-12-09 16:22:21 UTC
Okay, I am using this pendrive already with some data on it, and I don't have the other one anymore. That was supposed a quick swap and it failed with a surprise.

Will try to recreate that situation logs with script / tee / hexdump as requiested and report back.. probably I will have to buy another one or will use some SD card if I have one.. maybe first sectors would be enough :-)

I was also surprised after dd if=/dev/da0 of=/dev/da1 put data on a pendrive but no MBR. I tell you :-) :-)

ps/2: Do you know https://asciinema.org/ is it okay for documenting stuff? :-)