|Summary:||/sbin/dump crashes on larger filesystem|
|Product:||Base System||Reporter:||Stefan Thurner <thurners>|
|Component:||bin||Assignee:||freebsd-bugs (Nobody) <bugs>|
|Severity:||Affects Some People||CC:||Antonio.Trindade, ant_mail, bitbucket63-it, bob, emaste, imp, john, kailockwood, locnar, mckusick|
Description Stefan Thurner 2020-02-27 13:44:33 UTC
When running dump to create a backup it crashes. This happens only on a 4Tb filesystem (/dev/label/home, see below). dump -0 -aL -f test.0 /dev/label/home DUMP: Date of this level 0 dump: Thu Feb 27 11:33:46 2020 DUMP: Date of last level 0 dump: the epoch DUMP: Dumping snapshot of /dev/label/home (/usr/home) to test.0 DUMP: mapping (Pass I) [regular files] DUMP: mapping (Pass II) [directories] DUMP: estimated 472120814 tape blocks. DUMP: dumping (Pass III) [directories] DUMP: dumping (Pass IV) [regular files] DUMP: 5.26% done, finished in 1:30 at Thu Feb 27 13:08:56 2020 DUMP: 10.48% done, finished in 1:25 at Thu Feb 27 13:09:16 2020 DUMP: 15.44% done, finished in 1:22 at Thu Feb 27 13:11:00 2020 DUMP: 20.04% done, finished in 1:19 at Thu Feb 27 13:13:36 2020 DUMP: 26.92% done, finished in 1:07 at Thu Feb 27 13:06:43 2020 DUMP: 33.80% done, finished in 0:58 at Thu Feb 27 13:02:36 2020 DUMP: 40.61% done, finished in 0:51 at Thu Feb 27 13:00:00 2020 DUMP: 47.29% done, finished in 0:44 at Thu Feb 27 12:58:25 2020 DUMP: 54.04% done, finished in 0:38 at Thu Feb 27 12:57:06 2020 Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, file /usr/src/sbin/dump/traverse.c, line 759. DUMP: Child 60825 returns LOB status 206 df -h Filesystem Size Used Avail Capacity Mounted on /dev/label/root 1.9G 391M 1.4G 21% / devfs 1.0K 1.0K 0B 100% /dev /dev/label/var 3.9G 1.2G 2.4G 33% /var /dev/label/tmp 3.9G 116K 3.6G 0% /tmp /dev/label/usr 40G 13G 24G 35% /usr /dev/label/home 3.5T 451G 2.8T 14% /usr/home /dev/label/bkup 3.5T 920G 2.3T 28% /media/bkup Regards, Stefan
Comment 1 Stefan Thurner 2020-03-02 06:32:20 UTC
(In reply to Stefan Thurner from comment #0) In reference to bug #228807 I've compiled dump from base r331095. That's the version prior to changes of tape.c and traverse.c. Now dump works like intended. Regards, Stefan
Comment 2 Hampton Finger 2020-04-27 21:56:28 UTC
I hit this today on a ProxMox vm running FreeBSD 12.1-RELEASE-p4 system on a 351G file system in its own partition. The hypervisor is an AMD FX chip. I got the exact error reported here about 35% into the file system. It worked on 20 April 2020 and EVERY Sunday before that, but stopped working as of 26 April 2020. Exact Command: /sbin/dump -0u -L -b 64 -a -f /archive/agamemnon/usr.dump /usr /archive is nfs mounted root@agamemnon:/sbin # file dump dump: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 12.1, FreeBSD-style, stripped as it is an upgrade from source, I figured the md5sum is useless I have rebooted, and updated the system from source. I will be returning to the previous dump if my current test fails. I loaded telegraf on the system this week as the only delta to my configuration. I have disabled telegraf and am running a test dump now.
Comment 3 Hampton Finger 2020-04-29 01:26:26 UTC
Created attachment 213898 [details] Core Dump of /sbin/dump from FreeBSD 12.1-RELEASE-p4 r360175 The file system is UFS, it is /usr, it fails and dumps at 40% everytime. Just happened out of the blue. fsck says the file system is fine.
Comment 4 John Marshall 2020-06-09 01:36:17 UTC
"Me too" FreeBSD rwsrv08.gfn.riverwillow.net.au 12.1-RELEASE-p5 FreeBSD 12.1-RELEASE-p5 #0 r361272: Thu May 21 06:07:33 AEST 2020 firstname.lastname@example.org:/build/obj/john/kits/src/amd64.amd64/sys/RWSRV08 amd64 + dump -0aLu -C 256 -f /ubackup/nfsdata.dmp /dev/gpt/NFSDATA DUMP: Date of this level 0 dump: Tue Jun 9 00:28:51 2020 DUMP: Date of last level 0 dump: the epoch DUMP: Dumping snapshot of /dev/gpt/NFSDATA (/nfsdata) to /ubackup/nfsdata.dmp DUMP: mapping (Pass I) [regular files] DUMP: Cache 256 MB, blocksize = 65536 DUMP: mapping (Pass II) [directories] DUMP: estimated 27202416 tape blocks. DUMP: dumping (Pass III) [directories] DUMP: dumping (Pass IV) [regular files] DUMP: 36.85% done, finished in 0:08 at Tue Jun 9 00:42:26 2020 DUMP: 74.56% done, finished in 0:03 at Tue Jun 9 00:42:16 2020 Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, fil e /kits/src/sbin/dump/traverse.c, line 759. DUMP: Child 11854 returns LOB status 206 + dump -0aLu -C 256 -f /ubackup/rw2.dmp /dev/gpt/RW2 DUMP: Date of this level 0 dump: Tue Jun 9 01:39:07 2020 DUMP: Date of last level 0 dump: the epoch DUMP: Dumping snapshot of /dev/gpt/RW2 (/rw2) to /ubackup/rw2.dmp DUMP: mapping (Pass I) [regular files] DUMP: Cache 256 MB, blocksize = 65536 DUMP: mapping (Pass II) [directories] DUMP: estimated 18888847 tape blocks. DUMP: dumping (Pass III) [directories] DUMP: dumping (Pass IV) [regular files] Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, fil e /kits/src/sbin/dump/traverse.c, line 759. DUMP: Child 14199 returns LOB status 206 rwsrv08> df -ht ufs /nfsdata /rw2 /ubackup Filesystem Size Used Avail Capacity Mounted on /dev/gpt/NFSDATA 62G 26G 31G 46% /nfsdata /dev/gpt/RW2 22G 18G 1.7G 91% /rw2 /dev/da0p8 310G 197G 88G 69% /ubackup
Comment 5 kailockwood 2020-07-19 08:42:18 UTC
I would also like to report having this issue on a 12.1 system. I am attempting to the dump a large singular root partition to an external USB drive dump -C16 -b64 -0uanL -h0 -f /backup/root.dump / and at some point the dump fails with Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, file /usr/src/sbin/dump/traverse.c, line 759. DUMP: Child 45872 returns LOB status 206
Comment 6 Bob Eager 2020-09-27 12:30:55 UTC
I am having this issue on 12.1-RELEASE-p10. It previously happened on 12.1-RELEASE-p3. The problem only appeared about three weeks ago, and there had been no updates to the system at that time. It was on 12.1-RELEASE-P3, and the update to 12.1-RELEASE-p10 did not fix it. The file system was 320GB in size. It didn't need to be that big, so it was reduced to 50GB with the same contents copied over via rsync. The problem is still there. The message is subtly different, although that may not be relevant: 04:00:49 DUMP: Date of this level 9 dump: Fri Sep 25 04:01:02 2020 04:00:49 DUMP: Date of last level 0 dump: Sat Sep 12 12:19:28 2020 04:00:49 DUMP: Dumping snapshot of /dev/mirror/dxhome (/data/export/home) to standard output 04:00:49 DUMP: mapping (Pass I) [regular files] 04:00:49 DUMP: mapping (Pass II) [directories] 04:00:49 DUMP: estimated 330684 tape blocks. 04:00:49 DUMP: dumping (Pass III) [directories] 04:00:49 DUMP: dumping (Pass IV) [regular files] 04:00:49 Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, file /usr/src/sbin/dump/traverse.c, line 759. 04:00:49 DUMP: Child 54548 returns LOB status 206 This is a showstopper.
Comment 7 bitbucket63-it 2021-02-06 23:58:19 UTC
This happened to me as well after I added a bunch of large files. I went searching for answers and eventually noticed that the assertion in question in traverse.c was different than the other assertions implemented as part of bug #228807, comment #5 implemented in base r334969. Specifically it tested for "<" TP_NINDIR instead of "<=" TP_NINDIR where other asserts implemented a <= test. On my file system "spcl.c_count + blks" reached exactly 512 at the point of failure, which is the value of TP_NINDIR. Perhaps the assert should have tested for <= rather than <. I changed the line in traverse.c and now dump seems to be working again on the file system with lots of indirect inodes. I've uploaded a diff. FYI, I'm the equivalent of a passenger trying to land the plane after the flight crew passes out, so YMMV.
Comment 8 bitbucket63-it 2021-02-07 00:00:11 UTC
Created attachment 222230 [details] Patch to traverse.c This fixes one thing but IDK if it breaks others. Review is needed.
Comment 9 António Trindade 2021-02-08 14:53:50 UTC
*** Bug 253182 has been marked as a duplicate of this bug. ***
Comment 10 Hampton Finger 2021-02-08 17:10:09 UTC
I have made the change, compiled a new /sbin/dump, and dump now works in my use case. I will continue testing, but it looks like this is fixed with the diff supplied.
Comment 11 ant2 2021-04-30 15:32:49 UTC
I would notice that this bug must be considered as critical. I've upgraded servers from FreeBSD 10 to 12-stable in January 2021. A few month later i've discovered that all my offline backups are corrupted because of this bug. All backup scripts runs and all backup files are in place. But large backup files has less size than they have to because dump exits somewhere in the middle of the backup process. The patch is a cure but it still not in the source. Than more people will try to use they corrupted backup for restore than more attention wil be paid to this bug.
Comment 12 commit-hook 2021-05-17 23:31:57 UTC
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=efe145a7453e4208f032816ce3f80e9fb6b0e4ee commit efe145a7453e4208f032816ce3f80e9fb6b0e4ee Author: Kirk McKusick <mckusick@FreeBSD.org> AuthorDate: 2021-05-17 23:33:59 +0000 Commit: Kirk McKusick <mckusick@FreeBSD.org> CommitDate: 2021-05-17 23:34:53 +0000 Correct assert added to dump program. The dump program was exiting with the message: Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, file /usr/src/sbin/dump/traverse.c, line 759. The problem arose when dumping external attributes. This assertion was added in this commit with no review by someone with expertise in the dump program: commit 2d518c6518cdb256ff6f2c463e6b115d89c104c3 Author: Warner Losh <imp@FreeBSD.org> AuthorDate: Mon Jun 11 19:32:36 2018 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: Mon Jun 11 19:32:36 2018 +0000 Add asserts to prevent overflows of c_addr. It is clearly wrong as the statement immediately above it in the code which is deciding if the data will fit is: if (spcl.c_count + blks > TP_NINDIR) return (0); As is pointed out in the bug report, the assert should be: (spcl.c_count + blks <= TP_NINDIR) This commit corrects the assert. I am sorry that it took so long to be brought to my attention and get fixed. Reported by: Hampton Finger PR: 244470 MFC after: 3 days Sponsored by: Netflix sbin/dump/traverse.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 13 Kirk McKusick 2021-05-22 20:48:38 UTC
This fix has been MFC'ed to 13-stable and 12-stable
Comment 15 Kirk McKusick 2021-06-02 19:49:20 UTC
(In reply to ant2 from comment #14) Sorry it took so long to come to my attention. Thanks for your patience.