Bug 17499

Summary: Can't revive VINUM RAID5
Product: Base System Reporter: cjohnson <cjohnson>
Component: kernAssignee: Greg Lehey <grog>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 4.0-RELEASE   
Hardware: Any   
OS: Any   

Description cjohnson 2000-03-20 06:40:01 UTC
After replacing a bad drive in my RAID5 I've been unable to get the SubDisk
to revive.

After reinstalling the disk drive I used "fdisk -I da1" to put a slice
map on the drive and when disklabel -w -r da1s1 auto failed I used sysinstall
to put a disklabel on the drive.  Then I used disklabel -e da1s1 to add an
"e" partition of type vinum (da1s1e)

With that done I unmounted all my vinum drives and did "vinum stop"
followed by "vinum start".  Once vinum discovered the new drive, I used 
"vinum create" to set "drive name drv3 device /dev/da1s1e".  At this point
drv3 came up and I did "start myraid.p0.s2" to start the revive.

After about a 30 minutes the vinum reported:
	 can't revive myraid.p0.s2: Invalid argument
When I tried to restart the revive with another "start myraid.p0.s2" the
start locked up trying to get a "vlock" or "vrlock" (according to ctrl-t).
I was unable to interupt vinum with ctrl-C or to suspend it with ctrl-Z or
to kill it with a SIGKILL.  After shutdown -r plus pressing the reset button.
I did a low level format on the drive, reinstalled slice and partition maps.
Started Vinum again and had same failure mode.

This time a "vinum stop" and 'vinum start" got things running after the
failure.  The drive was then tested by"
	dd if=/dev/zero of=/dev/da1 bs=1024k

This generated NO errors.

Reinstalled slice/partitions and restart vinum.  Start revive.  Same
failure at the samish point.  Tried again with a different block size
and had samefailure at the samish point

After reading source for a while I checked for debug and when I found it
was turned on, I turned on all the debug flags but for the "big drive"
and "drop into the debugger". "debug 635(?)" then "start -w myraid.p0.s2".
After verifying that it was logging I aborted the revive.  set debug 0
and started reviving again.  when we were close I stopped the revive.
Set the debug flags and started the revive once more.

This looks correct, dev 91,1 is /dev/vinum/myraid and the offset looks good
and the length is correct also.  the 13,131104 looks "ok" but I don't know
if it is "right"  Offset looks good, sd numbers are good.  We read three
"good drives" and write to the "reviving drive".  Everything cool here.

Mar 19 19:18:21 galahad /kernel: Read dev 91.1, offset 0x96c780, length 65536
Mar 19 19:18:21 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324280,
	devoffset 0x324389, length 65536
Mar 19 19:18:21 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324280,
	devoffset 0x324389, length 65536
Mar 19 19:18:21 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324280,
	devoffset 0x324389, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324280,
	devoffset 0x324389, length 65536


The next block looks good two.  The offset goes up by 64K which is right.

Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80
Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ca00, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324300,
	devoffset 0x324409, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324300,
	devoffset 0x324409, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324300,
	devoffset 0x324409, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324300,
	devoffset 0x324409, length 65536

And again, everything looks good.

Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80
Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ca80, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324380,
	devoffset 0x324489, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324380,
	devoffset 0x324489, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324380,
	devoffset 0x324489, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324380,
	devoffset 0x324489, length 65536

And again

Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80
Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ce00, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324400,
	devoffset 0x324509, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324400,
	devoffset 0x324509, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324400,
	devoffset 0x324509, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324400,
	devoffset 0x324509, length 65536

One more good block.

Mar 19 19:18:22 galahad /kernel: Request: 0xc1025c80
Mar 19 19:18:22 galahad /kernel: Read dev 91.1, offset 0x96ce80, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x324480,
	devoffset 0x324589, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131096, sd 1, offset 0x324480,
	devoffset 0x324589, length 65536
Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x324480,
	devoffset 0x324589, length 65536
Mar 19 19:18:22 galahad /kernel: Write dev 13.131084, sd 3, offset 0x324480,
	devoffset 0x324589, length 65536

Where is the "Request: .*" log mesage?  Where is the 91.1 Read request?
0x96cf00 looks like the next block to fix in the PLEX but why are we reading
it for the SUBDISK?  And where did this LENGTH come from!

Mar 19 19:18:22 galahad /kernel: Read dev 13.131104, sd 2, offset 0x96cf00,
	devoffset 0x96d009, length 2146172928
Mar 19 19:18:22 galahad /kernel: Read dev 13.131084, sd 3, offset 0x96cf00,
	devoffset 0x96d009, length 2146172928
Mar 19 19:18:22 galahad /kernel: Read dev 13.131124, sd 4, offset 0x96cf00,
	devoffset 0x96d009, length 2146172928
Mar 19 19:17:53 galahad vinum[963]: reviving myraid.p0.s2
Mar 19 19:18:15 galahad vinum[963]: can't revive myraid.p0.s2: Invalid
	argument

How-To-Repeat: 
	I don't know if this will repeat anywhere else but I can offer
	ssh access to a vouched for player that wants to test on galahad.
Comment 1 Sheldon Hearn freebsd_committer freebsd_triage 2000-03-20 10:15:31 UTC
Responsible Changed
From-To: freebsd-bugs->grog

Over to the vinum maintainer. 

Comment 2 cjohnson 2000-04-02 20:19:14 UTC
Because of delays I tried something a bit different.  I dumped the disk
to tape
and then init myraid.p0.  The initialize has run to completion with no
errors.
I'll know if the filesystem reloads in 10 or so hours....

    Chris
Comment 3 Greg Lehey freebsd_committer freebsd_triage 2001-05-03 09:48:06 UTC
State Changed
From-To: open->closed

Problem fixed some time in the past.