Bug 244470

Summary: /sbin/dump crashes on larger filesystem
Product: Base System Reporter: Stefan Thurner <thurners>
Component: binAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Some People CC: Antonio.Trindade, ant_mail, bitbucket63-it, bob, emaste, imp, john, kailockwood, locnar, mckusick
Priority: ---    
Version: 12.1-RELEASE   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=228807
Attachments:
Description Flags
Core Dump of /sbin/dump from FreeBSD 12.1-RELEASE-p4 r360175
none
Patch to traverse.c none

Description Stefan Thurner 2020-02-27 13:44:33 UTC
When running dump to create a backup it crashes. This happens only on a 4Tb filesystem (/dev/label/home, see below).

dump -0 -aL -f test.0 /dev/label/home
  DUMP: Date of this level 0 dump: Thu Feb 27 11:33:46 2020
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping snapshot of /dev/label/home (/usr/home) to test.0
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 472120814 tape blocks.
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
  DUMP: 5.26% done, finished in 1:30 at Thu Feb 27 13:08:56 2020
  DUMP: 10.48% done, finished in 1:25 at Thu Feb 27 13:09:16 2020
  DUMP: 15.44% done, finished in 1:22 at Thu Feb 27 13:11:00 2020
  DUMP: 20.04% done, finished in 1:19 at Thu Feb 27 13:13:36 2020
  DUMP: 26.92% done, finished in 1:07 at Thu Feb 27 13:06:43 2020
  DUMP: 33.80% done, finished in 0:58 at Thu Feb 27 13:02:36 2020
  DUMP: 40.61% done, finished in 0:51 at Thu Feb 27 13:00:00 2020
  DUMP: 47.29% done, finished in 0:44 at Thu Feb 27 12:58:25 2020
  DUMP: 54.04% done, finished in 0:38 at Thu Feb 27 12:57:06 2020
Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, file /usr/src/sbin/dump/traverse.c, line 759.
  DUMP: Child 60825 returns LOB status 206

df -h
Filesystem         Size    Used   Avail Capacity  Mounted on
/dev/label/root    1.9G    391M    1.4G    21%    /
devfs              1.0K    1.0K      0B   100%    /dev
/dev/label/var     3.9G    1.2G    2.4G    33%    /var
/dev/label/tmp     3.9G    116K    3.6G     0%    /tmp
/dev/label/usr      40G     13G     24G    35%    /usr
/dev/label/home    3.5T    451G    2.8T    14%    /usr/home
/dev/label/bkup    3.5T    920G    2.3T    28%    /media/bkup

Regards,
Stefan
Comment 1 Stefan Thurner 2020-03-02 06:32:20 UTC
(In reply to Stefan Thurner from comment #0)

In reference to bug #228807 I've compiled dump from base r331095. That's the version prior to changes of tape.c and traverse.c.

Now dump works like intended.

Regards,
Stefan
Comment 2 Hampton Finger 2020-04-27 21:56:28 UTC
I hit this today on a ProxMox vm running FreeBSD 12.1-RELEASE-p4 system on a 351G file system in its own partition.  The hypervisor is an AMD FX chip.  I got the exact error reported here about 35% into the file system.  It worked on 20 April 2020 and EVERY Sunday before that, but stopped working as of 26 April 2020.

Exact Command: /sbin/dump -0u -L -b 64 -a -f /archive/agamemnon/usr.dump /usr

/archive is nfs mounted

root@agamemnon:/sbin # file dump
dump: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 12.1, FreeBSD-style, stripped

as it is an upgrade from source, I figured the md5sum is useless

I have rebooted, and updated the system from source.

I will be returning to the previous dump if my current test fails.  I loaded telegraf on the system this week as the only delta to my configuration.  I have disabled telegraf and am running a test dump now.
Comment 3 Hampton Finger 2020-04-29 01:26:26 UTC
Created attachment 213898 [details]
Core Dump of /sbin/dump from FreeBSD 12.1-RELEASE-p4 r360175

The file system is UFS, it is /usr, it fails and dumps at 40% everytime.  Just happened out of the blue.  fsck says the file system is fine.
Comment 4 John Marshall 2020-06-09 01:36:17 UTC
"Me too"

FreeBSD rwsrv08.gfn.riverwillow.net.au 12.1-RELEASE-p5 FreeBSD 12.1-RELEASE-p5 #0 r361272: Thu May 21 06:07:33 AEST 2020     john@rwsrv08.gfn.riverwillow.net.au:/build/obj/john/kits/src/amd64.amd64/sys/RWSRV08  amd64

+ dump -0aLu -C 256 -f /ubackup/nfsdata.dmp /dev/gpt/NFSDATA
  DUMP: Date of this level 0 dump: Tue Jun  9 00:28:51 2020
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping snapshot of /dev/gpt/NFSDATA (/nfsdata) to /ubackup/nfsdata.dmp
  DUMP: mapping (Pass I) [regular files]
  DUMP: Cache 256 MB, blocksize = 65536
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 27202416 tape blocks.
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
  DUMP: 36.85% done, finished in 0:08 at Tue Jun  9 00:42:26 2020
  DUMP: 74.56% done, finished in 0:03 at Tue Jun  9 00:42:16 2020
Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, fil
e /kits/src/sbin/dump/traverse.c, line 759.
  DUMP: Child 11854 returns LOB status 206

+ dump -0aLu -C 256 -f /ubackup/rw2.dmp /dev/gpt/RW2
  DUMP: Date of this level 0 dump: Tue Jun  9 01:39:07 2020
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping snapshot of /dev/gpt/RW2 (/rw2) to /ubackup/rw2.dmp
  DUMP: mapping (Pass I) [regular files]
  DUMP: Cache 256 MB, blocksize = 65536
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 18888847 tape blocks.
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, fil
e /kits/src/sbin/dump/traverse.c, line 759.
  DUMP: Child 14199 returns LOB status 206

rwsrv08> df -ht ufs /nfsdata /rw2 /ubackup
Filesystem          Size    Used   Avail Capacity  Mounted on
/dev/gpt/NFSDATA     62G     26G     31G    46%    /nfsdata
/dev/gpt/RW2         22G     18G    1.7G    91%    /rw2
/dev/da0p8          310G    197G     88G    69%    /ubackup
Comment 5 kailockwood 2020-07-19 08:42:18 UTC
I would also like to report having this issue on a 12.1 system. I am attempting to the dump a large singular root partition to an external USB drive

dump -C16 -b64 -0uanL -h0 -f /backup/root.dump /

and at some point the dump fails with 

Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, file /usr/src/sbin/dump/traverse.c, line 759.
  DUMP: Child 45872 returns LOB status 206
Comment 6 Bob Eager 2020-09-27 12:30:55 UTC
I am having this issue on 12.1-RELEASE-p10. It previously happened on 12.1-RELEASE-p3.

The problem only appeared about three weeks ago, and there had been no updates to the system at that time. It was on 12.1-RELEASE-P3, and the update to 12.1-RELEASE-p10 did not fix it.

The file system was 320GB in size. It didn't need to be that big, so it was reduced to 50GB with the same contents copied over via rsync. The problem is still there.

The message is subtly different, although that may not be relevant:

04:00:49   DUMP: Date of this level 9 dump: Fri Sep 25 04:01:02 2020                                                                                                                                                                                                           
04:00:49   DUMP: Date of last level 0 dump: Sat Sep 12 12:19:28 2020                                                                                                                                                                                                           
04:00:49   DUMP: Dumping snapshot of /dev/mirror/dxhome (/data/export/home) to standard output                                                                                                                                                                                 
04:00:49   DUMP: mapping (Pass I) [regular files]                                                                                                                                                                                                                              
04:00:49   DUMP: mapping (Pass II) [directories]                                                                                                                                                                                                                               
04:00:49   DUMP: estimated 330684 tape blocks.                                                                                                                                                                                                                                 
04:00:49   DUMP: dumping (Pass III) [directories]                                                                                                                                                                                                                              
04:00:49   DUMP: dumping (Pass IV) [regular files]                                                                                                                                                                                                                             
04:00:49 Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, file /usr/src/sbin/dump/traverse.c, line 759.                                                                                                                                            
04:00:49   DUMP: Child 54548 returns LOB status 206      

This is a showstopper.
Comment 7 bitbucket63-it 2021-02-06 23:58:19 UTC
This happened to me as well after I added a bunch of large files.

I went searching for answers and eventually noticed that the assertion in question in traverse.c was different than the other assertions implemented as part of bug #228807, comment #5 implemented in base r334969. Specifically it tested for "<" TP_NINDIR instead of "<=" TP_NINDIR where other asserts implemented a <= test.

On my file system "spcl.c_count + blks" reached exactly 512 at the point of failure, which is the value of TP_NINDIR. Perhaps the assert should have tested for <= rather than <. 

I changed the line in traverse.c and now dump seems to be working again on the file system with lots of indirect inodes. I've uploaded a diff.

FYI, I'm the equivalent of a passenger trying to land the plane after the flight crew passes out, so YMMV.
Comment 8 bitbucket63-it 2021-02-07 00:00:11 UTC
Created attachment 222230 [details]
Patch to traverse.c

This fixes one thing but IDK if it breaks others. Review is needed.
Comment 9 António Trindade 2021-02-08 14:53:50 UTC
*** Bug 253182 has been marked as a duplicate of this bug. ***
Comment 10 Hampton Finger 2021-02-08 17:10:09 UTC
I have made the change, compiled a new /sbin/dump, and dump now works in my use case.

I will continue testing, but it looks like this is fixed with the diff supplied.
Comment 11 ant2 2021-04-30 15:32:49 UTC
I would notice that this bug must be considered as critical.

I've upgraded servers from FreeBSD 10 to 12-stable in January 2021. A few month later i've discovered that all my offline backups are corrupted because of this bug. 

All backup scripts runs and all backup files are in place. But large backup files has less size than they have to because dump exits somewhere in the middle of the backup process.

The patch is a cure but it still not in the source.

Than more people will try to use they corrupted backup for restore than more attention wil be paid to this bug.
Comment 12 commit-hook freebsd_committer 2021-05-17 23:31:57 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=efe145a7453e4208f032816ce3f80e9fb6b0e4ee

commit efe145a7453e4208f032816ce3f80e9fb6b0e4ee
Author:     Kirk McKusick <mckusick@FreeBSD.org>
AuthorDate: 2021-05-17 23:33:59 +0000
Commit:     Kirk McKusick <mckusick@FreeBSD.org>
CommitDate: 2021-05-17 23:34:53 +0000

    Correct assert added to dump program.

    The dump program was exiting with the message:

    Assertion failed: (spcl.c_count + blks < TP_NINDIR), function appendextdata, file /usr/src/sbin/dump/traverse.c, line 759.

    The problem arose when dumping external attributes.

    This assertion was added in this commit with no review by someone
    with expertise in the dump program:

        commit 2d518c6518cdb256ff6f2c463e6b115d89c104c3
        Author:     Warner Losh <imp@FreeBSD.org>
        AuthorDate: Mon Jun 11 19:32:36 2018 +0000
        Commit:     Warner Losh <imp@FreeBSD.org>
        CommitDate: Mon Jun 11 19:32:36 2018 +0000

            Add asserts to prevent overflows of c_addr.

    It is clearly wrong as the statement immediately above it in the
    code which is deciding if the data will fit is:

            if (spcl.c_count + blks > TP_NINDIR)
                    return (0);

    As is pointed out in the bug report, the assert should be:

            (spcl.c_count + blks <= TP_NINDIR)

    This commit corrects the assert. I am sorry that it took so long to
    be brought to my attention and get fixed.

    Reported by:  Hampton Finger
    PR:           244470
    MFC after:    3 days
    Sponsored by: Netflix

 sbin/dump/traverse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 13 Kirk McKusick freebsd_committer 2021-05-22 20:48:38 UTC
This fix has been MFC'ed to 13-stable and 12-stable
Comment 14 ant2 2021-06-02 16:32:16 UTC
(In reply to Kirk McKusick from comment #13)

Thank you!
Comment 15 Kirk McKusick freebsd_committer 2021-06-02 19:49:20 UTC
(In reply to ant2 from comment #14)
Sorry it took so long to come to my attention.
Thanks for your patience.
Comment 16 John Marshall 2021-06-03 02:16:22 UTC
(In reply to Kirk McKusick from comment #15)
…and thanks from the rest of us too! Upgraded a server to Stable-13 yesterday and now have full dumps of ALL my volumes at last.