Bug 157244 - dump(1): dump/restore: unknown tape header type -230747966
Summary: dump(1): dump/restore: unknown tape header type -230747966
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 8.0-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-22 04:30 UTC by Gene Stark
Modified: 2017-12-31 22:32 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gene Stark 2011-05-22 04:30:10 UTC
I made an 18 gig dump of a (gvinum) filesystem using the command:
"dump 0f - /dev/gvinum/A > A.dump".  There were no problems reported during
the dump, and the volume fsck'ed clean beforehand.  I newfs'ed the volume and
attempted to restore via "restore rf - < /A.dump" and it failed with the
error message: unknown tape header type -230747966.  This was quite irritating,
as I have grown to trust dump/restore over many years and due to the size
I had already destroyed the original volume without trying to to read through
the dump file with restore.

I spent substantial time analyzing the dump to try to determine the failure
mode.  It turns out that header blocks actually occur out-of-order in
the dump file, as indicated by comparing the actual offset of the block
in the dump file with the spcl.c_tapea fields of the headers.
Once the problem started (during a large file near the beginning of the
dump), the difference between the actual offset (in units of TP_BSIZE)
and the claimed offset in the spcl.c_tapea block was either -10, 0, or 20.
That is, sometimes the header blocks came earlier than expected, sometimes
they came on time, and sometimes they came later, and there were only a few
possibilities.

I wrote a program to read the dump records and reorder them so that the
headers were emitted at their claimed offsets.  This was done by queueing
the headers and data blocks separately, emitting headers when they were due,
and emitting data blocks otherwise.  This program could then verify that
the correct number of data blocks were present to match the information in
the headers.  However, when I pipe the reordered block stream into restore,
there are still some issues.  For one thing, there is no way to verify the
order of the data blocks, and it seems that reordering might also have
occurred on those.  I have other copies of some of the large files that
were in the dump, and I will attempt to determine how the data blocks
have been reordered, but I haven't done that yet.

I was at a loss to explain how this kind of reordering could have occurred,
until I read some of the source to dump and saw that it is using multiple
processes to write the dump file.  I am running on a 2-core system (4 CPUs
after hyperthreading).  I strongly suspect a concurrency issue in the way
the dump tape is written, otherwise I don't see how the header blocks could
have been reordered in the way I observed.

How-To-Repeat: Although I have (unfortunately) already destroyed the original filesystem,
I was able to repeat the behavior on another filesystem using the following
command:

home# dump 0f - /mail | restore rfN -
  DUMP: Date of this level 0 dump: Sat May 21 22:57:40 2011
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping /dev/gvinum/mail_new (/mail) to standard output
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 11291623 tape blocks.
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
unknown tape header type 1781888358
abort? [yn] y
dump core? [yn] n
  DUMP: Broken pipe
  DUMP: The ENTIRE dump is aborted.

This problem really needs to be looked into, because it is a disaster to
create an apparently successful dump with the idea of doing a simple
filesystem volume rebuild and then find out that it fails in the restore.

The reordering of the dump stream to put the header blocks back in their
proper positions helps quite a bit, but I have not been able to recover
my data at this time, because the data blocks are also apparently
reordered.  If there is a systematic mechanism to the reordering, I might
still be able to recover, but if it is a concurrency/synchronization thing
it might well be hopeless.
Comment 1 Gene Stark 2011-05-22 13:33:23 UTC
I wrote a program to compare the blocks in another copy of one of the
large files in the dump with the version extracted from restore after
applying my header reordering program.  The program read each of the
files in blocks of TP_BSIZE bytes, computed the SHA1 hash of each
block, stored the resulting <hash, offset> pairs in a hash map for
each file, unioned the key sets of the two hash maps to obtain a
single master list of block hashes, traversed the master key set
to construct a map <offset, <offset0, offset1>> that gave the
correspondence between the blocks in the two files, and printed out
the contents of that map in increasing order of offset, showing the
differences between the two files.  Here is the initial part of the
result:

Lectures.zip.bad: 52469795 bytes
Lectures.zip.good: 52469795 bytes
11612   11622   10
11613   11623   10
11614   11624   10
11615   11625   10
11616   11626   10
11617   11627   10
11618   11628   10
11619   11629   10
11620   11630   10
11621   11631   10
11622   11632   10
11623   11633   10
11624   11634   10
11625   11635   10
11626   11636   10
11627   11637   10
11628   11638   10
11629   11639   10
11630   11640   10
11631   11641   10
11632   11612   -20
11633   11613   -20
11634   11614   -20
11635   11615   -20
11636   11616   -20
11637   11617   -20
11638   11618   -20
11639   11619   -20
11640   11620   -20
11641   11621   -20
11642   11652   10
11643   11653   10
11644   11654   10
11645   11655   10
11646   11656   10
11647   11657   10
11648   11658   10
11649   11659   10
11650   11660   10
11651   11661   10
11652   11662   10
11653   11663   10
11654   11664   10
11655   11665   10
11656   11666   10
11657   11667   10
11658   11668   10
11659   11669   10
11660   11670   10
11661   11671   10
11662   11642   -20
11663   11643   -20
11664   11644   -20
11665   11645   -20
11666   11646   -20
11667   11647   -20
11668   11648   -20
11669   11649   -20
11670   11650   -20
11671   11651   -20
11672   11682   10
11673   11683   10

The pattern repeats this way for *almost* the entire file.
There are sets of 20 blocks that occur 10 blocks ahead of the
corresponding blocks in the other file, and then a set of 10
blocks that occur 20 blocks behind the corresponding blocks
in the other file.  There are occasional values of 9 and 19
for the differences, which I don't have a ready explanation for,
except that my header reordering relied on the magic number
to identify the header blocks and it is possible there were
a few blocks that were misidentified as headers that were actually
data blocks.  At the end of the files there are a few blocks
that do not correspond; these are probably due to alignment
at the end which caused some of the last data blocks to be used
as the first blocks for the next file in the dump.

To test my suspicion that it is a concurrency issue in dump,
I recompiled dump after setting #define SLAVES 1 in tape.c
(rather than the value 3 it had before).  I then was able to
complete two rounds of "dump 0f - /mail | restore rfN -"
without any errors, whereas if I use /sbin/dump it fails out
very quickly as indicated in the original PR.

I am not familiar with the locking features, etc. being used in
dump, so I don't know if I will be able to go farther than this
with a reasonable expenditure of time.  However, I strongly
suggest that the "concurrency modifications" in dump be turned
off (perhaps by setting SLAVES to 1 as I did) until somebody
can get to the bottom of this.  If this is happening to me,
then I suspect there are *massive* numbers of bad dumps out there
that people think are actually good.  It will really be a rude
awakening when people try to read them back.  Since the data
blocks don't contain any tape address information in them,
it is not possible to recover.
Comment 2 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 08:00:54 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped