119868 – [geom_gpt] [patch] 7.0 kernel panic with corrupt GPT label

Bug 119868 - [geom_gpt] [patch] 7.0 kernel panic with corrupt GPT label

Summary: [geom_gpt] [patch] 7.0 kernel panic with corrupt GPT label

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	7.0-PRERELEASE
Hardware:	Any Any

Importance:	Normal Affects Only Me
Assignee:	Marcel Moolenaar

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-01-21 19:20 UTC by Johan A. van Zanten
Modified:	2008-11-22 17:55 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Johan A. van Zanten 2008-01-21 19:20:02 UTC


This is a brand new installation.

The panic also occurs with the GENERIC kernel.

Boot device is a SCSI disk on a seperate controller.

Before the drive is setup for use as a ZFS device, the kernel identifies
it as:
kernel: twed0: <Unit 1, JBOD, Normal> on twe0
kernel: twed0: 152627MB (312581808 sectors)
kernel: GEOM_LABEL: Label for provider twed0p1 is msdosfs/EFI.

This is a Western Digial, WD1600JS SATA drive, connected to a 3ware
8002-LP card (2-port SATA, PCI).

Last time the drive was used in a different computer and OS, it was in
good working order.

 After the device is setup for use with ZFS (via :zpool create ..."
command), at the next boot, the kernel panics when it begins to scan the
attached disks.  (Just after the "Waiting 5 seconds for SCSI devices to
settle" message.)

 What's interesting is that different SATA drive on the same port of the
same card does not cause the panic. The "good" drive is a Western Digital
WD360 (SATA, 36 GB).

boot-time kernel output:

twe1: 152627MB (312581808 sectors)
GEOM: new disk twed0
GEOM: new disk twed1


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 01
fault virtual address   = 0x3f80
fault code              = supervisors read, page not present
instruction pointer	= 0x20:0xc06d0e2c
stack pointer		= 0x28:0xe2fb4b60
frame pointer		= 0x28:0xe2fb4c58
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, def32 1, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 2 (g_event)
trap number		= 12
panic: page fault
Uptime: 1s
Cannot dump. No dumpdevice defined.
Automatic reboot in 15 seconds - press a key on the console to abort

Fix: 

disconnect the drive. (not much of a workaround. :)
How-To-Repeat: 
Do "zpool create poolname $dev".

Reboot the machine.

Comment 1 Remko Lodder freebsd_committer

2008-01-21 19:30:15 UTC

Johan A. van Zanten wrote:
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 01
> fault virtual address   = 0x3f80
> fault code              = supervisors read, page not present
> instruction pointer	= 0x20:0xc06d0e2c
> stack pointer		= 0x28:0xe2fb4b60
> frame pointer		= 0x28:0xe2fb4c58
> code segment		= base rx0, limit 0xfffff, type 0x1b
> 			= DPL 0, pres 1, def32 1, gran 1
> processor eflags	= interrupt enabled, resume, IOPL = 0
> current process		= 2 (g_event)
> trap number		= 12
> panic: page fault
> Uptime: 1s
> Cannot dump. No dumpdevice defined.
> Automatic reboot in 15 seconds - press a key on the console to abort
> 

Hello,

Please set a dumpdevice see 
http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html 
for more information. We need this to be able to see what is going on 
and what is going nuts. Without this we will not be able (imo) to 
resolve your problem.

Thanks for taking the time to report this though and using FreeBSD!

Cheers
remko

-- 
/"\   Best regards,                      | remko@FreeBSD.org
\ /   Remko Lodder                       | remko@EFnet
  X    http://www.evilcoder.org/          |
/ \   ASCII Ribbon Campaign              | Against HTML Mail and News

Comment 2 Johan A. van Zanten 2008-01-23 04:12:06 UTC

Remko Lodder <remko@FreeBSD.org> wrote:
> Johan A. van Zanten wrote:
> > 
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 0; apic id = 01
> > fault virtual address   = 0x3f80
> > fault code              = supervisors read, page not present
> > instruction pointer	= 0x20:0xc06d0e2c
> > stack pointer		= 0x28:0xe2fb4b60
> > frame pointer		= 0x28:0xe2fb4c58
> > code segment		= base rx0, limit 0xfffff, type 0x1b
> > 			= DPL 0, pres 1, def32 1, gran 1
> > processor eflags	= interrupt enabled, resume, IOPL = 0
> > current process		= 2 (g_event)
> > trap number		= 12
> > panic: page fault
> > Uptime: 1s
> > Cannot dump. No dumpdevice defined.
> > Automatic reboot in 15 seconds - press a key on the console to abort
> > 
> 
> Hello,
> 
> Please set a dumpdevice see 
> http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html 
> for more information. We need this to be able to see what is going on 
> and what is going nuts. Without this we will not be able (imo) to 
> resolve your problem.

Can you give and example of the syntax for specifying the dump device in
the kernel config?

 The crash seems to be ahppening before dumpon is run. According to the
web page you cite:

  Alternatively, the dump device can be hard-coded via the dump clause in
  the config(5) line of a kernel configuration file. This approach is
  deprecated and should be used only if a kernel is crashing before
  dumpon(8) can be executed.

 But i cannot find any example of the syntax for the "dump" clause in
 /usr/src/sys/conf or in config(5).

Thanks, johan

Comment 3 Harald Hanche-Olsen 2008-09-30 16:49:17 UTC

The original reporter seems to have given up on this. I have seen
something very similar, and thought I could provide some more
information.

I now have three disks all in an unusable state, causing freebsd to
panic upon seeing these disks. Common to all is that they contained
ZFS pools that were online when the computer crashed, possibly for
unrelated reasons. Upon reboot, the computer would panic when noticing
the disk; in fact, immediately after printing the standard message
giving the device name and disk type on the console.

ZFS may however be incidental to the problem: The panic happens even
if I don't have zfs.ko loaded when the problem disk is plugged in.
I wonder if it could be related to kern/127115 somehow?

I cannot get a dump unfortunately - the console says "Dumping xxx MB"
and hangs if I have activated kernel dumps (using dumpon) before
triggering the panic.

So I compiled a debug kernel and obtained a backtrace using ddb
instead. Here is output, copied by hand from a photo of the screen:


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x3f80
fault code              = supervisor read data, page not present
[...]
current process         = 2 (g_event)
[thread pid 2 tid 100007 ]
Stopped at    bcmp+0x8:    repe cmpsq  (%rsi),%es:(%rdi)
db> trace
Tracing pid 2 tid 100007 td 0xffffff0001129000
bcmp() at bcmp+0x8
g_part_taste() at g_part_taste+0x252
g_new_provider_event() at g_new_provider_event+0x75
g_run_events() at g_run_events+0x1b8
g_event_procbody() at g_event_procbody+0x57
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffffb3600d30, rbp = 0 ---


I am really not very familiar with ddb. Let me know if you wish me to
dig deeper, but then I need a pointer as to what to look for.

- Harald

Comment 4 Harald Hanche-Olsen 2008-09-30 17:18:00 UTC

For what it's worth, assuming it is the partition table that has
gotten screwed up somehow, here are the the first 34 sectors of the
disk that caused the panic described in my previous mail:

  http://www.math.ntnu.no/~hanche/tmp/baddisk.bin

(Created by attaching the disk to a mac and running dd bs=512 count=34
on the device file. Not sure if binary attachments are OK here.)

I forgot to mention that this is on 7.0-STABLE/amd64 as of 19 August
(7.0-STABLE #3). But I also see the problem on 7.0-RELEASE/i386.

- Harald

Comment 5 Harald Hanche-Olsen 2008-09-30 19:41:23 UTC

I just had my biggest "duh" moment in a veeery long time.
The above two "contributions" to this PR can probably be ignored.

For the curious: I intended to do

#; gpt create -f da2
#; gpt add -t 6a898cc3-1dd2-11b2-99a6-080020736631 da2
#; zpool create poolname da2p1

but apparently, I created the pool on da2 instead, partially
overwriting the GPT. And I managed to do this (count 'em) no less than
THREE times!

Like I said, DUH, and my apologies for the noise.

Maybe we could turn the noise into a feature request: Perhaps zpool
should be smart enough to recognize that the user is about to shoot
his own foot and refuse to cooperate?

- Harald

Comment 6 Johan A. van Zanten 2008-10-01 15:04:50 UTC

Harald Hanche-Olsen <hanche@math.ntnu.no> wrote:
> The original reporter seems to have given up on this. I have seen
> something very similar, and thought I could provide some more
> information.

Thanks for helping.  The problem for me is that the panic occured very
early in the boot process, before the dump device is normally configured,
and no one on the freebsd-help list, nor anyone reading these bug reports
seemed to know or care enough to help me get a dump device configured
earlier.  I spent some time going through the source, trying to figure out
a way to do this, but the time required for me to do this task exceeded
the amount of time i had to spend on it.

 -johan

Comment 7 Harald Hanche-Olsen 2008-10-01 16:18:31 UTC

+ Johan A. van Zanten <johan@giantfoo.org>:

> The problem for me is that the panic occured very early in the boot
> process, before the dump device is normally configured, and no one
> on the freebsd-help list, nor anyone reading these bug reports
> seemed to know or care enough to help me get a dump device
> configured earlier.

Well, the handbook gives a method that it says is "deprecated"

http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html

(specifying a dump device in the kernel config), but these lines from
/usr/src/usr.sbin/config/config.y

System_spec:
        CONFIG System_id System_parameter_list
          = { errx(1, "%s:%d: root/dump/swap specifications obsolete",
              yyfile, yyline);}

make me think that the handbook itself is obsolete at this point, and
the "deprecated" method is no longer available.

If you still have the disk and wish to resurrect it, you can try my
method: I booted from a ubuntu cd and erased the EFI partition table
using dd if=/dev/zero bs=512 count=1 seek=1 of=/dev/disk/by-id/...
(making VERY sure I did not clobber the wrong disk).

- Harald

Comment 8 Johan A. van Zanten 2008-10-01 16:33:37 UTC

Harald Hanche-Olsen <hanche@math.ntnu.no> wrote:
> + Johan A. van Zanten <johan@giantfoo.org>:
> 
> > The problem for me is that the panic occured very early in the boot
> > process, before the dump device is normally configured, and no one
> > on the freebsd-help list, nor anyone reading these bug reports
> > seemed to know or care enough to help me get a dump device
> > configured earlier.
> 
> Well, the handbook gives a method that it says is "deprecated"
> 
> http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html

Yes, i think i tried this and it did not work. :(

 -johan

Comment 9 Jaakko Heinonen 2008-10-02 08:39:52 UTC

Hi,

On 2008-09-30, Harald Hanche-Olsen wrote:
>  For the curious: I intended to do
>  
>  #; gpt create -f da2
>  #; gpt add -t 6a898cc3-1dd2-11b2-99a6-080020736631 da2
>  #; zpool create poolname da2p1
>  
>  but apparently, I created the pool on da2 instead, partially
>  overwriting the GPT.

This PR is a duplicate of kern/127115. The bug is not in zfs code but in
the gpart GPT code. It's possible that a corrupted GPT partition table
causes a panic in g_part_gpt_read().

These conditions must be true  after reading the tables in
g_part_gpt_read() to cause the panic:

table->state[GPT_ELT_PRIHDR] == GPT_STATE_OK
pritbl == NULL
table->state[GPT_ELT_SECTBL] == GPT_STATE_OK

The panic happens at line 661 in g_part_gpt.c (r183533) when tbl is NULL.

Here is a proposed  fix:

%%%
Index: sys/geom/part/g_part_gpt.c
===================================================================
--- sys/geom/part/g_part_gpt.c	(revision 183533)
+++ sys/geom/part/g_part_gpt.c	(working copy)
@@ -631,7 +631,7 @@ g_part_gpt_read(struct g_part_table *bas
 			table->state[GPT_ELT_PRIHDR] = GPT_STATE_INVALID;
 	}
 
-	if (table->state[GPT_ELT_PRIHDR] != GPT_STATE_OK) {
+	if (table->state[GPT_ELT_PRITBL] != GPT_STATE_OK) {
 		printf("GEOM: %s: the primary GPT table is corrupt or "
 		    "invalid.\n", pp->name);
 		printf("GEOM: %s: using the secondary instead -- recovery "
@@ -641,7 +641,7 @@ g_part_gpt_read(struct g_part_table *bas
 		if (pritbl != NULL)
 			g_free(pritbl);
 	} else {
-		if (table->state[GPT_ELT_SECHDR] != GPT_STATE_OK) {
+		if (table->state[GPT_ELT_SECTBL] != GPT_STATE_OK) {
 			printf("GEOM: %s: the secondary GPT table is corrupt "
 			    "or invalid.\n", pp->name);
 			printf("GEOM: %s: using the primary only -- recovery "
%%%

The patch applied this is that I get with the corrupted GPT table:

GEOM: ad0: the primary GPT table is corrupt or invalid.
GEOM: ad0: using the secondary instead -- recovery strongly advised.

-- 
Jaakko

Comment 10 Harald Hanche-Olsen 2008-10-02 10:26:39 UTC

+ Jaakko Heinonen <jh@saunalahti.fi>:

> This PR is a duplicate of kern/127115.

Like I suspected (see my earlier mail).

Unfortunately I cannot test your fix, since I have repaired my three
damaged disks.

- Harald

Comment 11 Kenneth Vestergaard Schmidt 2008-10-02 16:47:15 UTC

Hi.

I can confirm this fix works on -CURRENT as of yesterday - geom_gpt
recognizes the corrupted table, and skips it.

-- 
Kenneth Vestergaard Schmidt

Comment 12 Mark Linimon freebsd_committer

2008-10-19 14:15:19 UTC

State Changed
From-To: open->analyzed

Patch has been submitted and has been confirmed as fixing the problem.

Comment 13 Mark Linimon freebsd_committer

2008-10-19 14:15:19 UTC

Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Comment 14 Gavin Atkinson freebsd_committer

2008-11-06 11:39:24 UTC

Responsible Changed
From-To: freebsd-fs->freebsd-geom

Jaakko Heinonen points out that this is actually a bug with geom_gpt 
and not ZFS.  The PR contains a patch, confirmed to fix the issue.

Comment 15 Marcel Moolenaar freebsd_committer

2008-11-06 16:52:40 UTC

State Changed
From-To: analyzed->patched

Fix committed in -CURRENT. MFC to happen in a week. 
Thanks for the analysis and patch.

Comment 16 Marcel Moolenaar freebsd_committer

2008-11-06 16:52:40 UTC

Responsible Changed
From-To: freebsd-geom->marcel

Fix committed in -CURRENT. MFC to happen in a week. 
Thanks for the analysis and patch.

Comment 17 Marcel Moolenaar freebsd_committer

2008-11-22 17:55:00 UTC

State Changed
From-To: patched->closed

Fix committed to 7-STABLE.