Running dumpfs on a UFS2 file system in a block device encrypted via geli leads to the following error message when attempting to list free fragments: dumpfs: /dev/md0.eli: cylinder group checks failed "dumpfs -s" works without issues. "dumpfs -f" immediately triggers the error. I'm running FreeBSD 13.2-RELEASE. Here are instructions that should easily reproduce the issue. I'm using md here only for easy reproducibility. The error occurs on actual hard drives as well: # truncate -s 1G image # mdconfig image md0 # geli init -s 4096 -e aes -l 256 /dev/md0 Enter new passphrase: Reenter new passphrase: Metadata backup for provider /dev/md0 can be found in /var/backups/md0.eli and can be restored with the following command: # geli restore /var/backups/md0.eli /dev/md0 # geli attach /dev/md0 Enter passphrase: # newfs /dev/md0.eli /dev/md0.eli: 1024.0MB (2097144 sectors) block size 32768, fragment size 4096 using 4 cylinder groups of 256.00MB, 8192 blks, 32768 inodes. super-block backups (for fsck_ffs -b #) at: 192, 524480, 1048768, 1573056 # dumpfs /dev/md0.eli magic 19540119 (UFS2) last mounted time Thu Jan 1 01:00:00 1970 last modified time Thu Jun 1 16:39:30 2023 superblock location 65536 id [ 6478ada2 7c30d5d9 ] ncg 4 size 262143 blocks 253862 bsize 32768 shift 15 mask 0xffff8000 fsize 4096 shift 12 mask 0xfffff000 frag 8 shift 3 fsbtodb 3 minfree 8% optim time symlinklen 120 maxbsize 32768 maxbpg 4096 maxcontig 32 contigsumsize 16 nbfree 31729 ndir 2 nifree 131068 nffree 28 bpg 8192 fpg 65536 ipg 32768 unrefs 0 nindir 4096 inopb 128 maxfilesize 2252349704110079 sbsize 4096 cgsize 16384 csaddr 2088 cssize 4096 sblkno 24 cblkno 32 iblkno 40 dblkno 2088 cgrotor 0 fmod 0 ronly 0 clean 1 metaspace 2616 avgfpdir 64 avgfilesize 16384 flags check hashes superblock cylinder-groups inodes fsmnt volname swuid 0 providersize 262143 cs[].cs_(nbfree,ndir,nifree,nffree): (7928,2,32764,21) (7934,0,32768,0) (7934,0,32768,0) (7933,0,32768,7) blocks in last group 8191 dumpfs: /dev/md0.eli: cylinder group checks failed I think it is this check that triggers the error: https://github.com/freebsd/freebsd-src/blob/525ecfdad597980ea4cd59238e24c8530dbcd31d/lib/libufs/cgroup.c#LL248C15-L248C15 fsck cannot find any issues: # fsck /dev/md0.eli ** /dev/md0.eli ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 2 files, 2 used, 253860 free (28 frags, 31729 blocks, 0.0% fragmentation) ***** FILE SYSTEM IS CLEAN *****
Couldn't reproduce (using your exact commands) on either 13.2-stable from a few weeks back, or a clean 13.2-release install.
I started experimenting on an entirely different system with a fresh installation and there seems to be another key factor to reproduce this: geli needs to use software crypto and not null crypto. To reproduce it on a system that supports AESNI I need to specify "-e camellia-cbc". In summary: * never reproducible with "-e null" * reproducible with "-e aes" and "-e aes-cbc" only on systems without AESNI * always reproducible with "-e camellia-cbc" So this might not even be an issue with dumpfs but with the software crypto in geli.
I am able to reproduce this in the Live-CD system of FreeBSD-13.2-RELEASE-amd64-disc1.iso. However, I cannot reproduce it in the Live-CD system of FreeBSD-14.0-CURRENT-amd64-20230601-4f2cc73f34eb-263302-disc1.iso. When I do the "geli init" and the "newfs" in FreeBSD current and then attach the disk in FreeBSD 13.2 and do the "dumpfs", the issue still occurs. So it seems to be unrelated to "geli init" and "newfs".
(In reply to Michael Laß from comment #2) Reproduced on 13.2-release and 13.2-stable by using hint.aesni.0.disabled="1" in loader.conf to disable AESNI. It looks like one 16-byte chunk is corrupted.
(In reply to Andrew "RhodiumToad" Gierth from comment #4) Looks like it might be related to page alignment? dumpfs isn't using an aligned buffer, and it's the 16 bytes immediately before a page boundary that come out corrupt. Almost everything else will likely be using aligned pages, which might explain why a bug would go unnoticed.
^Triage, partly at the suggestion of RhodiumToad in IRC: * kern * severity * assignment to a list that's relevant to (GELI) encryption Key phrase: data corruption
(In reply to Michael Laß from comment #3) I was wondering why I could not reproduce this in 14 Current. The dumpfs issue is hidden by changes between 1297a704d99110644af1ae32840cee1b9523cb56 (still reproducible) and b21582ee03ec1394e08173e276df311979856e54 (not reproducible anymore). Most likely it comes down to: bf24d17fda75cc89a92248715d52a73f23adc89c: Have dumpfs(8) ignore superblock check-hash failures. However, this seems to only hide the issue in dumpfs and I assume that the actual data corruption issue in geli is still present in 14 Current. However, I'm not sure how to verify that.
This just got much more serious: in trying to reproduce the bug more easily, I found that it is possible to panic the system with it; if the alignment of the user-provided buffer is not merely non-page-aligned, but also not word-aligned (for some wordsize, I don't yet know what), then: panic: general protection fault KDB: stack backtrace: [...] #3 0xffffffff807b9467 at trap_fatal+0x387 #4 0xffffffff8079378e at calltrap+0x8 #5 0xffffffff80696c53 at swcr_encdec+0x653 #6 0xffffffff8069635a at swcr_process+0x3a #7 0xffffffff806942bd at crypto_dispatch+0x15d #8 0xffffffff8124bab8 at g_eli_crypto_run+0x158 #9 0xffffffff81244c88 at g_eli_worker+0x308
(In reply to Andrew "RhodiumToad" Gierth from comment #8) Correction: I can get crashes even for buffers with 16-byte alignment. Seems to be related to crypto_cursor_segment? I will post more info shortly.
Created attachment 242577 [details] Test program I've attached the test program I'm using to reproduce. Usage: md=$(mdconfig -s 1M -t swap) geli onetime -s 4096 $md # this will panic with software crypto, or if there are data discrepancies it will report them: ./a.out /dev/$md 1 The code where it panics has changed in 14-current so it's quite possible that the bug doesn't exist there. This program should make it easy to test.
(In reply to Andrew "RhodiumToad" Gierth from comment #10) That should be ./a.out /dev/${md}.eli 1 of course.
I tested your code on 14-current and while there it does not seem to cause crashes, it still detects errors as soon as I specify strides smaller than 16.
(In reply to Andrew "RhodiumToad" Gierth from comment #9) I think there's a bug in the handling of the end-of-cursor condition in crypto_cursor_segment(). Some cursor types handle it properly (e.g., mbufs), others do not (e.g., vm_pages, used by GELI). This might be unrelated to the crash, but I think it's a real bug. I'll try the repro shortly.
Created attachment 242627 [details] address the panic This patch fixes the crash that I can reproduce on stable/13 using the test program. It does not appear to fix the dumpfs failure. On main, I cannot trigger a panic, but the test program reports discrepancies (which it does not do with a patched stable/13). I cannot reproduce the dumpfs error on main. I'm quite sure that the patch is required on main as well, i.e., the panic is not specific to stable/13.
(In reply to Mark Johnston from comment #14) Tried the patch on stable/13; it does avoid the crash, but the discrepancies still happen; the logic around spanning a page boundary is still clearly wrong. (The discrepancy is consistently the 16 bytes immediately before the page boundary.)
Created attachment 242665 [details] address the corruption This fixes the verification failures as well as the originally reported problem with dumpfs.
(In reply to Mark Johnston from comment #16) Great! That patch looks good to me, considering that src is computed from *cc->cc_vmpage and cc->cc_offset and hence the manual advance needs to be done for dst. It looks like the code causing the panic was introduced in FreeBSD 13.0 (9c0e3d3a534c3e3e7f6bfce0a150ed2a0841685a and a017868e281874261a560ba1e3069b4e14b7483e) and the code causing the data corruption was introduced in 13.0 as well (e6f6d0c9bcbf7942c390f65062054ec4784ce5b8). So I guess both patches would need to be applied to main and stable/13? I tested both reproducers (dumpfs and Andrew's code) on 12.4 and indeed there were no crashes and no errors reported. One question remains for me: Is there a chance that a mounted file system inside a GELI-encrypted block device would be affected from this and hence be corrupted or is this code path only triggered by dumpfs?
(In reply to Michael Laß from comment #17) Right. The patches will be committed to main and MFCed to stable/13. As Andrew's test case demonstrates, there may be other programs which are affected. It depends on the alignment of the buffer used to read or write to a GELI block device. Data read or written through a filesystem is unlikely to be affected, as filesystems will generally align their buffers. Note also that the bug is not triggered if aesni(4) is loaded (default in main, enabled automatically by bsdinstall for ZFS-on-GELI).
(In reply to Mark Johnston from comment #18) I would think that the patches should also be treated as errata for supported 13.x releases?
(In reply to Mark Johnston from comment #16) I also tested this on stable/13; looks good to me.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9c0467929abaab97f45fc07507b6f30c80211239 commit 9c0467929abaab97f45fc07507b6f30c80211239 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-06-12 16:11:20 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-06-12 16:52:24 +0000 geli tests: Add a regression test for PR 271766 This test case catches both of the bugs reported there. PR: 271766 Reviewed by: imp MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D40469 tests/sys/geom/class/eli/Makefile | 3 + tests/sys/geom/class/eli/misc_test.sh | 38 +++++++- tests/sys/geom/class/eli/unaligned_io.c (new) | 131 ++++++++++++++++++++++++++ 3 files changed, 170 insertions(+), 2 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9f7fdd8c1ab153104275e59b49b2d567cec95256 commit 9f7fdd8c1ab153104275e59b49b2d567cec95256 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-06-12 16:09:54 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-06-12 16:52:24 +0000 crypto: Advance the correct pointer in crypto_cursor_copydata() PR: 271766 Reported by: Michael Laß <bevan@bi-co.net> MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D40468 sys/opencrypto/criov.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=718d4a1d5643c2faf409001320c3fd64aae57638 commit 718d4a1d5643c2faf409001320c3fd64aae57638 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-06-12 16:09:34 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-06-12 16:52:24 +0000 opencrypto: Handle end-of-cursor conditions in crypto_cursor_segment() Some consumers, e.g., swcr_encdec(), may call crypto_cursor_segment() after having advanced the cursor to the end of the buffer. In this case I believe the right behaviour is to return NULL and a length of 0. When this occurs with a CRYPTO_BUF_VMPAGE buffer, the cc_vmpage pointer will point past the end of the page pointer array, so crypto_cursor_segment() ends up dereferencing a random pointer before the function returns a length of 0. The uio-backed cursor has a similar problem. Address this by keeping track of the residual buffer length and returning immediately once the length is zero. PR: 271766 Reported by: Andrew "RhodiumToad" Gierth <andrew@tao11.riddles.org.uk> Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D40428 sys/opencrypto/criov.c | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=0d2318bbb4ccd89f1e698b93dc06448b1fdffb74 commit 0d2318bbb4ccd89f1e698b93dc06448b1fdffb74 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-06-12 16:09:54 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-06-19 12:57:08 +0000 crypto: Advance the correct pointer in crypto_cursor_copydata() PR: 271766 Reported by: Michael Laß <bevan@bi-co.net> MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D40468 (cherry picked from commit 9f7fdd8c1ab153104275e59b49b2d567cec95256) sys/opencrypto/criov.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=c4aae5668c69df4205ab3df382056a71a0e23bb7 commit c4aae5668c69df4205ab3df382056a71a0e23bb7 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-06-12 16:11:20 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-06-19 12:57:17 +0000 geli tests: Add a regression test for PR 271766 This test case catches both of the bugs reported there. PR: 271766 Reviewed by: imp MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D40469 (cherry picked from commit 9c0467929abaab97f45fc07507b6f30c80211239) tests/sys/geom/class/eli/Makefile | 3 + tests/sys/geom/class/eli/misc_test.sh | 38 +++++++- tests/sys/geom/class/eli/unaligned_io.c (new) | 131 ++++++++++++++++++++++++++ 3 files changed, 170 insertions(+), 2 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=299a7961f47d84f4bcb19ca6756ae61cc2d5d756 commit 299a7961f47d84f4bcb19ca6756ae61cc2d5d756 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-06-12 16:09:34 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-06-19 12:56:56 +0000 opencrypto: Handle end-of-cursor conditions in crypto_cursor_segment() Some consumers, e.g., swcr_encdec(), may call crypto_cursor_segment() after having advanced the cursor to the end of the buffer. In this case I believe the right behaviour is to return NULL and a length of 0. When this occurs with a CRYPTO_BUF_VMPAGE buffer, the cc_vmpage pointer will point past the end of the page pointer array, so crypto_cursor_segment() ends up dereferencing a random pointer before the function returns a length of 0. The uio-backed cursor has a similar problem. Address this by keeping track of the residual buffer length and returning immediately once the length is zero. PR: 271766 Reported by: Andrew "RhodiumToad" Gierth <andrew@tao11.riddles.org.uk> Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D40428 (cherry picked from commit 718d4a1d5643c2faf409001320c3fd64aae57638) sys/opencrypto/criov.c | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-)