Bug 117158

Summary: [zfs] [panic] zpool scrub causes panic if geli vdevs detach on last close
Product: Base System Reporter: Fabian Keil <fk>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: Closed Overcome By Events    
Severity: Affects Only Me CC: woodsb02
Priority: Normal    
Version: 7.0-CURRENT   
Hardware: Any   
OS: Any   

Description Fabian Keil 2007-10-13 20:40:00 UTC
With a zpool based on two of the geli vdevs
configured with the /etc/rc.conf lines:

   geli_devices="ad0s1 ad0s3f ad0s2"
   geli_ad0s1_flags="-k /root/ad0s1.key"
   geli_ad0s2_flags="-k /root/ad0s2.key"
   geli_ad0s3f_flags="-k /root/ad0s3f.key"

zpool scrub causes the following panic:

Unread portion of the kernel message buffer:
GEOM_ELI: Detached ad0s2.eli on last close.
GEOM_LABEL: Label for provider ad0s2 is msdosfs/ÒA.Û,{(#0.
panic: Function g_eli_orphan_spoil_assert() called for ad0s3f.eli.
KDB: enter: panic
panic: from debugger
Uptime: 5m27s
Physical memory: 1014 MB
Dumping 120 MB: 105 89 73 57 41 25 9

#0  doadump () at pcpu.h:195
195     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) bt   
#0  doadump () at pcpu.h:195
#1  0xc05db8f3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#2  0xc05dbb1c in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:563
#3  0xc04a40c7 in db_panic (addr=Could not find the frame base for "db_panic".
) at /usr/src/sys/ddb/db_command.c:433
#4  0xc04a4825 in db_command_loop () at /usr/src/sys/ddb/db_command.c:401
#5  0xc04a6255 in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:222
#6  0xc06014a4 in kdb_trap (type=3, code=0, tf=0xf4bcac1c) at /usr/src/sys/kern/subr_kdb.c:502
#7  0xc0838d2b in trap (frame=0xf4bcac1c) at /usr/src/sys/i386/i386/trap.c:621
#8  0xc082120b in calltrap () at /usr/src/sys/i386/i386/exception.s:139
#9  0xc0601602 in kdb_enter (msg=0xc0889693 "panic") at cpufunc.h:60
#10 0xc05dbb05 in panic (fmt=0xc3e9a0e8 "Function %s() called for %s.") at /usr/src/sys/kern/kern_shutdown.c:547
#11 0xc3e92d65 in ?? ()
#12 0xc3e9a0e8 in ?? ()
#13 0xc3e99d84 in ?? ()
#14 0xc3e0d890 in ?? ()
#15 0xf4bcacac in ?? ()
#16 0xc058eff5 in g_spoil_event (arg=0xc3c92940, flag=-945983104) at /usr/src/sys/geom/geom_subr.c:903
(kgdb) f 16
#16 0xc058eff5 in g_spoil_event (arg=0xc3c92940, flag=-945983104) at /usr/src/sys/geom/geom_subr.c:903
903                     cp->geom->spoiled(cp);
(kgdb) l
898                     if (!cp->spoiled)
899                             continue;
900                     cp->spoiled = 0;
901                     if (cp->geom->spoiled == NULL)
902                             continue;
903                     cp->geom->spoiled(cp);
904                     g_topology_assert();
905             }
906     }
907

Problem first reported in:
http://lists.freebsd.org/pipermail/freebsd-current/2007-October/078105.html

Fix: 

Quoting Pawel Jakub Dawidek's response to my initial report:

|GELI's detach-on-last-close mechanism is a general purpose mechanism, it
|may not work correctly with ZFS, because ZFS sometimes closes and reopen
|providers, which will make GELI to detach. In other words you shouldn't
|configure detach-on-last-close for ZFS components. It shouldn't panic
|still.

Adding geli_autodetach="NO" to /etc/rc.conf indeed prevents the panic.

I previously wasn't aware that this option exists,
so maybe it should be mentioned in geli(8).
How-To-Repeat: Create a zpool with more than one geli vdev
that detaches on last close and run zpool scrub.

Actually I am not sure if a zpool with only one
such vdev is guaranteed to work, but at least
for me the problem only started to show when I
added the second one.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2009-05-28 23:20:42 UTC
Responsible Changed
From-To: pjd->freebsd-fs

With pjd's permission, reassing ZFS-related PRs to freebsd-fs.
Comment 2 Pawel Jakub Dawidek freebsd_committer freebsd_triage 2014-06-01 06:41:13 UTC
Responsible Changed
From-To: freebsd-bugs->pjd

I'll take this one.
Comment 3 Ben Woods freebsd_committer freebsd_triage 2015-03-13 15:19:27 UTC
This bug bit me today - it is quite unnerving to watch your zpool go offline due to unavailable devices, all from initiating a simply scrub!

However, I can confirm that a reboot recovered the zpool successfully, and adding the following to /etc/rc.conf prevented it from happening again during the next zpool scrub:
geli_autodetach="NO"

Note: I rebooted once more after adding that to /etc/rc.conf and before initiating the next zpool scrub, just in case that configuration parameter is only read at boot time or the time the geli is attached. Not sure if that was required.
Comment 4 Ben Woods freebsd_committer freebsd_triage 2015-03-13 15:20:25 UTC
I should add that I am running FreeBSD 10.1 amd64 with generic kernel/world.
Comment 5 Eitan Adler freebsd_committer freebsd_triage 2018-05-28 19:43:38 UTC
batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.
Comment 6 Andriy Gapon freebsd_committer freebsd_triage 2018-05-29 09:46:18 UTC
Is this still reproducible?
There have been a number of fixes related to geom spoiling of partitions and labels.
Comment 7 Fabian Keil 2019-11-18 20:45:42 UTC
Looks like this is no longer an issue on more recent systems as ZFS
now fails to import pools on geli devices that detach on last close:

# geli attach -d /dev/md0
Enter passphrase: 
GEOM_ELI: Device md0.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: software
# geli attach -d /dev/md1
Enter passphrase: 
GEOM_ELI: Device md1.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: software
# zpool import
   pool: test
     id: 10011777910752807569
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        test        ONLINE
          md0.eli   ONLINE
          md1.eli   ONLINE
# zpool import test
GEOM_ELI: Device md0.eli destroyed.
GEOM_ELI: Detached md0.eli on last close.
GEOM_ELI: Device md1.eli destroyed.
GEOM_ELI: Detached md1.eli on last close.
g_access(944): provider md0.eli has error 6 set
g_access(944): provider md1.eli has error 6 set
cannot import 'test': one or more devices is currently unavailable