Bug 125149 - [nfs] [panic] changing into .zfs dir from nfs client causes panic
Summary: [nfs] [panic] changing into .zfs dir from nfs client causes panic
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 7.0-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: Pawel Jakub Dawidek
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-07-01 15:20 UTC by Weldon Godfrey
Modified: 2012-04-16 07:32 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Weldon Godfrey 2008-07-01 15:20:02 UTC
As soon as I tried (for the 1st time) to 'cd' into .zfs from a Red Hat NFS client, the FreeBSD server paniced and rebooted.  As soon as it comes up fully, it panics again and reboots.  I had to disable ZFS to stop endless rebooting.  There are about 7 snapshots on the file system but the system paniced on cd .zfs

If you need access to the system, I can arrange that.  It is not in production yet.

If you need for me to try to create a crash dump or anything else, please let me know.  The panic messages aren't hitting syslog.  The panic mentions it is on nfsd.  If needed, I'll re-enable ZFS and write down as much as I can from the panic message.

The ZFS file system was configured as 1 pool, 1 volume (tank/mail)

It is using a 24 300GB SAS drives configured as

zpool create tank mirror da0 da12 mirror da1 da13 mirror da2 da14 mirror da3 da15 mirror da4 da16 mirror da5 da17 mirror da6 da18 mirror da7 da19 mirror da8 da20 mirror da9 da21 mirror da10 da22 spare da11 da23
da0-11 (encl 0), da12-da23 (encl 1) IBM EXP3000's with 3ware 9690


store1# more loader.conf
vm.kmem_size_max="16106127360"
vm.kmem_size="1073741824"
vfs.zfs.cache_flush_disable="1"
kern.maxvnodes=800000
vfs.zfs.prefetch_disable=1

zfs exports file:
store1# more exports 
# !!! DO NOT EDIT THIS FILE MANUALLY !!!

/var/mail       -maproot=root -network 192.168.2.0 -mask 255.255.255.0 

I am using the ULE scheduler

dumpdev is set to AUTO, however, only minfree exists in /var/crash

store1# zdb
tank
    version=6
    name='tank'
    state=0
    txg=391
    pool_guid=9188286166961335303
    hostid=1607525555
    hostname='store1.mail.ena.net'
    vdev_tree
        type='root'
        id=0
        guid=9188286166961335303
        children[0]
                type='mirror'
                id=0
                guid=6940539032091406049
                metaslab_array=27
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=12177063734546800829
                        path='/dev/da0'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=17756148780680423243
                        path='/dev/da12'
                        whole_disk=0
        children[1]
                type='mirror'
                id=1
                guid=15553657878513052422
                metaslab_array=25
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=159312058462001267
                        path='/dev/da1'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=8427428225122586042
                        path='/dev/da13'
                        whole_disk=0
        children[2]
                type='mirror'
                id=2
                guid=8094557295401289097
                metaslab_array=24
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=3973142902767367128
                        path='/dev/da2'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=5475429582146651394
                        path='/dev/da14'
                        whole_disk=0
        children[3]
                type='mirror'
                id=3
                guid=8422371545889157332
                metaslab_array=23
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=7876869405715517022
                        path='/dev/da3'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=2311208437246479700
                        path='/dev/da15'
                        whole_disk=0
        children[4]
                type='mirror'
                id=4
                guid=13043784695933281991
                metaslab_array=22
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=2625736407033884883
                        path='/dev/da4'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=12139830734620603195
                        path='/dev/da16'
                        whole_disk=0
        children[5]
                type='mirror'
                id=5
                guid=8537975538107565110
                metaslab_array=21
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=10811496881972791559
                        path='/dev/da5'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=12467851920062622083
                        path='/dev/da17'
                        whole_disk=0
        children[6]
                type='mirror'
                id=6
                guid=6984776714311523782
                metaslab_array=20
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=17893231162439421521
                        path='/dev/da6'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=7007733400839455331
                        path='/dev/da18'
                        whole_disk=0
        children[7]
                type='mirror'
                id=7
                guid=1900649043355843336
                metaslab_array=19
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=4593823921763348600
                        path='/dev/da7'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=3227568170452807619
                        path='/dev/da19'
                        whole_disk=0
        children[8]
                type='mirror'
                id=8
                guid=4496327987998401292
                metaslab_array=18
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=9797945788221343409
                        path='/dev/da8'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=15910496011831212127
                        path='/dev/da20'
                        whole_disk=0
        children[9]
                type='mirror'
                id=9
                guid=7070511984720364207
                metaslab_array=17
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=9502463708836649545
                        path='/dev/da9'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=7131953916078442743
                        path='/dev/da21'
                        whole_disk=0
        children[10]
                type='mirror'
                id=10
                guid=13441563341814041750
                metaslab_array=15
                metaslab_shift=31
                ashift=9
                asize=294983827456
                children[0]
                        type='disk'
                        id=0
                        guid=3403318897555406651
                        path='/dev/da10'
                        whole_disk=0
                children[1]
                        type='disk'
                        id=1
                        guid=2334569660784776332
                        path='/dev/da22'
                        whole_disk=0
store1#

How-To-Repeat: with multiple snapshots done on file system
go to NFS client with mounted file system
cd .zfs
Comment 1 Weldon Godfrey 2008-07-01 15:23:18 UTC
I rechecked my buffer of where it started.  Please change that panic
started

When:

cd .zfs
ls

so panic started when I tried to stat the .zfs directory


Weldon

-----Original Message-----
From: FreeBSD-gnats-submit@FreeBSD.org
[mailto:FreeBSD-gnats-submit@FreeBSD.org]=20
Sent: Tuesday, July 01, 2008 9:20 AM
To: Weldon Godfrey
Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfs
client causes endless panic loop

Thank you very much for your problem report.
It has the internal identification `kern/125149'.
The individual assigned to look at your
report is: freebsd-bugs.=20

You can access the state of your problem report at any time
via this link:

http://www.freebsd.org/cgi/query-pr.cgi?pr=3D125149

>Category:       kern
>Responsible:    freebsd-bugs
>Synopsis:       [zfs][nfs] changing into .zfs dir from nfs client
causes endless panic loop
>Arrival-Date:   Tue Jul 01 14:20:02 UTC 2008
Comment 2 Weldon Godfrey 2008-07-01 15:38:38 UTC
Also add that endless rebooting stops if client that cd'ed into .zfs
cd's out (therefore it sounds like problem is when NFS client stats .zfs
directory

Weldon

-----Original Message-----
From: FreeBSD-gnats-submit@FreeBSD.org
[mailto:FreeBSD-gnats-submit@FreeBSD.org]=20
Sent: Tuesday, July 01, 2008 9:20 AM
To: Weldon Godfrey
Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfs
client causes endless panic loop

Thank you very much for your problem report.
It has the internal identification `kern/125149'.
The individual assigned to look at your
report is: freebsd-bugs.=20

You can access the state of your problem report at any time
via this link:

http://www.freebsd.org/cgi/query-pr.cgi?pr=3D125149

>Category:       kern
>Responsible:    freebsd-bugs
>Synopsis:       [zfs][nfs] changing into .zfs dir from nfs client
causes endless panic loop
>Arrival-Date:   Tue Jul 01 14:20:02 UTC 2008
Comment 3 Volker Werth freebsd_committer freebsd_triage 2008-10-01 22:02:14 UTC
State Changed
From-To: open->feedback

Weldon, 

are you still able to reproduce this issue? If so, please provide 
the backtrace from a dump.
Comment 4 Volker Werth freebsd_committer freebsd_triage 2008-10-05 18:05:22 UTC
Attach submitted debugging information to the PR.

-------- Original Message --------
Subject: RE: kern/125149: [zfs][nfs] changing into .zfs dir from nfs
client causes endless panic loop
Date: Fri, 3 Oct 2008 08:58:42 -0500
From: Weldon Godfrey <wgodfrey@ena.com>
To: Volker Werth <vwe@freebsd.org>
CC: <freebsd-bugs@freebsd.org>
References: <200810012106.m91L6jq2007417@freefall.freebsd.org>
<A7B0A9F02975A74A845FE85D0B95B8FA0A1107A6@misex01.ena.com>
<48E535D8.4030101@freebsd.org>


No problem, here is the result.  Thanks!
Weldon


store1# kgdb /usr/obj/usr/src/sys/GENERIC/kernel.debug vmcore.27
[GDB will not be able to debug user-mode threads:
/usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd".

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 5; apic id = 05
fault virtual address   = 0x108
fault code              = supervisor write data, page not present
instruction pointer     = 0x8:0xffffffff804f06fa
stack pointer           = 0x10:0xffffffffdf761590
frame pointer           = 0x10:0x4
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 807 (nfsd)
trap number             = 12
panic: page fault
cpuid = 5
Uptime: 1m19s
Physical memory: 16367 MB
Dumping 891 MB: 876 860 844 828 812 796 780 764 748 732 716 700 684 668
652 636 620 604 588 572 556 540 524 508 492 476 460 444 428 412 396 380
364 348 332 316 300 284 268 252 236 220 204 188 172 156 140 124 108 92
76 60 44 28 12

#0  doadump () at pcpu.h:194
194     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) frame 9
#9  0xffffffff8060670d in nfsrv_readdirplus (nfsd=0xffffff000584f100,
slp=0xffffff0005725900,
    td=0xffffff00059a0340, mrq=0xffffffffdf761af0) at
/usr/src/sys/nfsserver/nfs_serv.c:3613
3613            vput(nvp);
(kgdb) list
3608                    nfsm_reply(NFSX_V3POSTOPATTR);
3609                    nfsm_srvpostop_attr(getret, &at);
3610                    error = 0;
3611                    goto nfsmout;
3612            }
3613            vput(nvp);
3614            nvp = NULL;
3615
3616            dirlen = len = NFSX_V3POSTOPATTR + NFSX_V3COOKIEVERF +
3617                2 * NFSX_UNSIGNED;
(kgdb) p *vp
$1 = {v_type = VDIR, v_tag = 0xffffffffdf8a7647 "zfs", v_op =
0xffffffffdf8ab4e0, v_data = 0xffffff0005958d00,
  v_mount = 0xffffff0005908978, v_nmntvnodes = {tqe_next =
0xffffff0005aed1f0, tqe_prev = 0xffffff0005a117e8},
  v_un = {vu_mount = 0x0, vu_socket = 0x0, vu_cdev = 0x0, vu_fifoinfo =
0x0}, v_hashlist = {le_next = 0x0,
    le_prev = 0x0}, v_hash = 0, v_cache_src = {lh_first = 0x0},
v_cache_dst = {tqh_first = 0x0,
    tqh_last = 0xffffff0005aed440}, v_dd = 0x0, v_cstart = 0, v_lasta =
0, v_lastw = 0, v_clen = 0, v_lock = {
    lk_object = {lo_name = 0xffffffffdf8a7647 "zfs", lo_type =
0xffffffffdf8a7647 "zfs", lo_flags = 70844416,
      lo_witness_data = {lod_list = {stqe_next = 0x0}, lod_witness =
0x0}}, lk_interlock = 0xffffffff80a49ed0,
    lk_flags = 128, lk_sharecount = 0, lk_waitcount = 0,
lk_exclusivecount = 0, lk_prio = 80, lk_timo = 51,
    lk_lockholder = 0xffffffffffffffff, lk_newlock = 0x0}, v_interlock =
{lock_object = {
      lo_name = 0xffffffff807ee47a "vnode interlock", lo_type =
0xffffffff807ee47a "vnode interlock",
      lo_flags = 16973824, lo_witness_data = {lod_list = {stqe_next =
0x0}, lod_witness = 0x0}}, mtx_lock = 4,
    mtx_recurse = 0}, v_vnlock = 0xffffff0005aed478, v_holdcnt = 2,
v_usecount = 2, v_iflag = 0, v_vflag = 0,
  v_writecount = 0, v_freelist = {tqe_next = 0x0, tqe_prev = 0x0},
v_bufobj = {bo_mtx = 0xffffff0005aed4c8,
    bo_clean = {bv_hd = {tqh_first = 0x0, tqh_last =
0xffffff0005aed538}, bv_root = 0x0, bv_cnt = 0}, bo_dirty = {
      bv_hd = {tqh_first = 0x0, tqh_last = 0xffffff0005aed558}, bv_root
= 0x0, bv_cnt = 0}, bo_numoutput = 0,
    bo_flag = 0, bo_ops = 0xffffffff809cc320, bo_bsize = 0, bo_object =
0x0, bo_synclist = {le_next = 0x0,
      le_prev = 0x0}, bo_private = 0xffffff0005aed3e0, __bo_vnode =
0xffffff0005aed3e0}, v_pollinfo = 0x0,
  v_label = 0x0}
(kgdb) p *dp
$2 = {d_fileno = 1, d_reclen = 12, d_type = 4 '\004', d_namlen = 1 '\001',
  d_name =
".\000\000\000\001\000\000\000\f\000\004\002..\000\000\002\000\000\000\024\000\004\bsnapshot\000\000\000\000\000\000\000\000@s'\n\000ÿÿÿ\004\000\000\000\003\000\000\000\022\000\000\000\000\000\000\000|D~\200ÿÿÿÿ|D~\200ÿÿÿÿ\000\000:\002",
'\0' <repeats 12 times>, "\006", '\0' <repeats 32 times>,
"à\224\005\000ÿÿÿ\000à\224\005\000ÿÿÿ\000à\224\005\000ÿÿÿ\000\000\000\000\000\000\000\000\030Ö\224\005\000ÿÿÿ",
'\0' <repeats 87 times>}
(kgdb) frame 8
#8  0xffffffff804f06fa in vput (vp=0x0) at atomic.h:142
142     atomic.h: No such file or directory.
        in atomic.h
(kgdb) list
137     in atomic.h
(kgdb)

Weldon


-----Original Message-----
From: Volker Werth [mailto:vwe@freebsd.org]
Sent: Thursday, October 02, 2008 3:58 PM
To: Weldon Godfrey
Cc: freebsd-bugs@freebsd.org
Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfs
client causes endless panic loop

On 10/02/08 21:05, Weldon Godfrey wrote:
> Yes, I can replicate statting .zfs dir from NFS client causes FreeBSD to
> panic and reboot, this time from CentOS 5.0 box.  ...
> 
> 
> Replicate:
> 
> [root@asmtp2 ~]# df
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/mapper/VolGroup00-LogVol00
>                       60817412   2814548  54863692   5% /
> /dev/sda1               101086     28729     67138  30% /boot
> tmpfs                  2008628         0   2008628   0% /dev/shm
> 192.168.2.22:/vol/enamail
>                      1286702144 1032758816 253943328  81%
> /var/spool/mail
> 192.168.2.21:/vol/exports/gaggle
>                      400959408 144327584 256631824  36%
> /var/spool/mail/archive/gaggle
> 192.168.2.36:/export/store1-1
>                      1413955712   4619136 1409336576   1%
> /var/spool/mail/store1-1
> [root@asmtp2 ~]# 
> [root@asmtp2 ~]# 
> [root@asmtp2 ~]# cd /var/spool/mail/store1-1
> [root@asmtp2 store1-1]# ls
> 1  2  3  4  5  6  7  8  9  crap
> [root@asmtp2 store1-1]# cd .zfs
> [root@asmtp2 .zfs]# ls
> (FreeBSD ZFS server panics here)
> 
> Weldon
> 
> Backtrace:
> 
> store1# kgdb /usr/obj/usr/src/sys/GENERIC/kernel.debug vmcore.27
> [GDB will not be able to debug user-mode threads:
> /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you
> are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for
> details.
> This GDB was configured as "amd64-marcel-freebsd".
> 
> Unread portion of the kernel message buffer:
> 
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 5; apic id = 05
> fault virtual address   = 0x108
> fault code              = supervisor write data, page not present
> instruction pointer     = 0x8:0xffffffff804f06fa
> stack pointer           = 0x10:0xffffffffdf761590
> frame pointer           = 0x10:0x4
> code segment            = base rx0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 807 (nfsd)
> trap number             = 12
> panic: page fault
> cpuid = 5
> Uptime: 1m19s
> Physical memory: 16367 MB
> Dumping 891 MB: 876 860 844 828 812 796 780 764 748 732 716 700 684 668
> 652 636 620 604 588 572 556 540 524 508 492 476 460 444 428 412 396 380
> 364 348 332 316 300 284 268 252 236 220 204 188 172 156 140 124 108 92
> 76 60 44 28 12
> 
> #0  doadump () at pcpu.h:194
> 194     pcpu.h: No such file or directory.
>         in pcpu.h
> (kgdb) vt
> Undefined command: "vt".  Try "help".
> (kgdb) bt
> #0  doadump () at pcpu.h:194
> #1  0x0000000000000004 in ?? ()
> #2  0xffffffff80477699 in boot (howto=260) at
> /usr/src/sys/kern/kern_shutdown.c:409
> #3  0xffffffff80477a9d in panic (fmt=0x104 <Address 0x104 out of
> bounds>) at /usr/src/sys/kern/kern_shutdown.c:563
> #4  0xffffffff8072ed24 in trap_fatal (frame=0xffffff00059a0340,
> eva=18446742974291977320)
>     at /usr/src/sys/amd64/amd64/trap.c:724
> #5  0xffffffff8072f0f5 in trap_pfault (frame=0xffffffffdf7614e0,
> usermode=0) at /usr/src/sys/amd64/amd64/trap.c:641
> #6  0xffffffff8072fa38 in trap (frame=0xffffffffdf7614e0) at
> /usr/src/sys/amd64/amd64/trap.c:410
> #7  0xffffffff807156ae in calltrap () at
> /usr/src/sys/amd64/amd64/exception.S:169
> #8  0xffffffff804f06fa in vput (vp=0x0) at atomic.h:142
> #9  0xffffffff8060670d in nfsrv_readdirplus (nfsd=0xffffff000584f100,
> slp=0xffffff0005725900, 
>     td=0xffffff00059a0340, mrq=0xffffffffdf761af0) at
> /usr/src/sys/nfsserver/nfs_serv.c:3613
> #10 0xffffffff80615a5d in nfssvc (td=Variable "td" is not available.
> ) at /usr/src/sys/nfsserver/nfs_syscalls.c:461
> #11 0xffffffff8072f377 in syscall (frame=0xffffffffdf761c70) at
> /usr/src/sys/amd64/amd64/trap.c:852
> #12 0xffffffff807158bb in Xfast_syscall () at
> /usr/src/sys/amd64/amd64/exception.S:290
> #13 0x000000080068746c in ?? ()
> Previous frame inner to this frame (corrupt stack?)
> 
> 

Weldon,

can you please try the following from kgdb and send the output:

(kgdb) frame 9
(kgdb) list
(kgdb) p *vp
(kgdb) p *dp
(kgdb) frame 8
(kgdb) list

Please keep the core dump as we might need to check some variable values
later.

I think the problem is the NULL pointer to vput. A maintainer needs to
check how nvp can get a NULL pointer (judging by assuming my fresh
codebase is not too different from yours).

Thanks

Volker
Comment 5 Volker Werth freebsd_committer freebsd_triage 2008-10-05 18:24:30 UTC
State Changed
From-To: feedback->open


Over to maintainer(s). 


Comment 6 Volker Werth freebsd_committer freebsd_triage 2008-10-05 18:24:30 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs


Over to maintainer(s).
Comment 7 Jaakko Heinonen 2008-10-07 16:36:30 UTC
Hi,

On 2008-10-02, Volker Werth wrote:
> > #8  0xffffffff804f06fa in vput (vp=0x0) at atomic.h:142
> > #9  0xffffffff8060670d in nfsrv_readdirplus (nfsd=0xffffff000584f100,
> > slp=0xffffff0005725900, 
> >     td=0xffffff00059a0340, mrq=0xffffffffdf761af0) at
> > /usr/src/sys/nfsserver/nfs_serv.c:3613
> > #10 0xffffffff80615a5d in nfssvc (td=Variable "td" is not available.
> > ) at /usr/src/sys/nfsserver/nfs_syscalls.c:461
> > #11 0xffffffff8072f377 in syscall (frame=0xffffffffdf761c70) at
> > /usr/src/sys/amd64/amd64/trap.c:852
> > #12 0xffffffff807158bb in Xfast_syscall () at
> > /usr/src/sys/amd64/amd64/exception.S:290
> > #13 0x000000080068746c in ?? ()
> > Previous frame inner to this frame (corrupt stack?)
> 
> I think the problem is the NULL pointer to vput. A maintainer needs to
> check how nvp can get a NULL pointer (judging by assuming my fresh
> codebase is not too different from yours).

The bug is reproducible with nfs clients using readdirplus. FreeBSD
client doesn't use readdirplus by default but you can enable it with -l
mount option. Here are steps to reproduce the panic with FreeBSD nfs
client:

- nfs export a zfs file system
- on client mount the file system with -l mount option and list the
  zfs control directory
# mount_nfs -l x.x.x.x:/tank /mnt
# ls /mnt/.zfs

I see two bugs here:

1) nfsrv_readdirplus() doesn't check VFS_VGET() error status properly.
   It only checks for EOPNOTSUPP but other errors are ignored. This is the
   final reason for the panic and in theory it could happen for other
   file systems too. In this case VFS_VGET() returns EINVAL and results
   NULL nvp.
2) zfs VFS_VGET() returns EINVAL for .zfs control directory entries.
   Looking at zfs_vget() it tries find corresponding znode to fulfill
   the request. However control directory entries don't have backing
   znodes.

Here is a patch which fixes 1). The patch prevents system from panicing
but a fix for 2) is needed to make readdirplus work with .zfs directory.

%%%
Index: sys/nfsserver/nfs_serv.c
===================================================================
--- sys/nfsserver/nfs_serv.c	(revision 183511)
+++ sys/nfsserver/nfs_serv.c	(working copy)
@@ -3597,9 +3597,12 @@ again:
 	 * Probe one of the directory entries to see if the filesystem
 	 * supports VGET.
 	 */
-	if (VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp) ==
-	    EOPNOTSUPP) {
-		error = NFSERR_NOTSUPP;
+	error = VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp);
+	if (error) {
+		if (error == EOPNOTSUPP)
+			error = NFSERR_NOTSUPP;
+		else
+			error = NFSERR_SERVERFAULT;
 		vrele(vp);
 		vp = NULL;
 		free((caddr_t)cookies, M_TEMP);
%%%

And here's an attempt to add support for .zfs control directory entries
(bug 2)) in zfs_vget(). The patch is very experimental and it only works
for snapshots which are already active (mounted).

%%%
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	(revision 183587)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	(working copy)
@@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 		VN_RELE(ZTOV(zp));
 		err = EINVAL;
 	}
-	if (err != 0)
-		*vpp = NULL;
-	else {
+	if (err != 0) {
+		/* try .zfs control directory */
+		err = zfsctl_vget(vfsp, ino, flags, vpp);
+	} else {
 		*vpp = ZTOV(zp);
 		vn_lock(*vpp, flags);
 	}
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	(revision 183587)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	(working copy)
@@ -1047,6 +1047,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64
 	return (error);
 }
 
+int
+zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp)
+{
+	zfsvfs_t *zfsvfs = vfsp->vfs_data;
+	vnode_t *dvp, *vp;
+	zfsctl_snapdir_t *sdp;
+	zfsctl_node_t *zcp;
+	zfs_snapentry_t *sep;
+	int error;
+
+	*vpp = NULL;
+
+	ASSERT(zfsvfs->z_ctldir != NULL);
+	error = zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp,
+	    NULL, 0, NULL, kcred);
+	if (error != 0)
+		return (error);
+
+	if (nodeid == ZFSCTL_INO_ROOT || nodeid == ZFSCTL_INO_SNAPDIR) {
+		if (nodeid == ZFSCTL_INO_SNAPDIR)
+			*vpp = dvp;
+		else {
+			VN_RELE(dvp);
+			*vpp = zfsvfs->z_ctldir;
+			VN_HOLD(*vpp);
+		}
+		/* XXX: LK_RETRY? */
+		vn_lock(*vpp, flags | LK_RETRY);
+		return (0);
+	}
+		
+	sdp = dvp->v_data;
+
+	mutex_enter(&sdp->sd_lock);
+	sep = avl_first(&sdp->sd_snaps);
+	while (sep != NULL) {
+		vp = sep->se_root;
+		zcp = vp->v_data;
+		if (zcp->zc_id == nodeid)
+			break;
+
+		sep = AVL_NEXT(&sdp->sd_snaps, sep);
+	}
+
+	if (sep != NULL) {
+		VN_HOLD(vp);
+		*vpp = vp;
+		vn_lock(*vpp, flags);
+	} else
+		error = EINVAL;
+
+	mutex_exit(&sdp->sd_lock);
+
+	VN_RELE(dvp);
+
+	return (error);
+}
 /*
  * Unmount any snapshots for the given filesystem.  This is called from
  * zfs_umount() - if we have a ctldir, then go through and unmount all the
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h	(revision 183587)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h	(working copy)
@@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha
     int flags, vnode_t *rdir, cred_t *cr);
 
 int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t **zfsvfsp);
+int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp);
 
 #define	ZFSCTL_INO_ROOT		0x1
 #define	ZFSCTL_INO_SNAPDIR	0x2
%%%

-- 
Jaakko
Comment 8 Weldon Godfrey 2008-10-08 22:06:50 UTC
Thanks! I will apply these patches tomorrow.

Weldon

-----Original Message-----
From: Jaakko Heinonen [mailto:jh@saunalahti.fi]=20
Sent: Tuesday, October 07, 2008 10:37 AM
To: Volker Werth
Cc: Weldon Godfrey; bug-followup@freebsd.org
Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfs
clientcauses endless panic loop


Hi,

On 2008-10-02, Volker Werth wrote:
> > #8  0xffffffff804f06fa in vput (vp=3D0x0) at atomic.h:142
> > #9  0xffffffff8060670d in nfsrv_readdirplus
(nfsd=3D0xffffff000584f100,
> > slp=3D0xffffff0005725900,=20
> >     td=3D0xffffff00059a0340, mrq=3D0xffffffffdf761af0) at
> > /usr/src/sys/nfsserver/nfs_serv.c:3613
> > #10 0xffffffff80615a5d in nfssvc (td=3DVariable "td" is not =
available.
> > ) at /usr/src/sys/nfsserver/nfs_syscalls.c:461
> > #11 0xffffffff8072f377 in syscall (frame=3D0xffffffffdf761c70) at
> > /usr/src/sys/amd64/amd64/trap.c:852
> > #12 0xffffffff807158bb in Xfast_syscall () at
> > /usr/src/sys/amd64/amd64/exception.S:290
> > #13 0x000000080068746c in ?? ()
> > Previous frame inner to this frame (corrupt stack?)
>=20
> I think the problem is the NULL pointer to vput. A maintainer needs to
> check how nvp can get a NULL pointer (judging by assuming my fresh
> codebase is not too different from yours).

The bug is reproducible with nfs clients using readdirplus. FreeBSD
client doesn't use readdirplus by default but you can enable it with -l
mount option. Here are steps to reproduce the panic with FreeBSD nfs
client:

- nfs export a zfs file system
- on client mount the file system with -l mount option and list the
  zfs control directory
# mount_nfs -l x.x.x.x:/tank /mnt
# ls /mnt/.zfs

I see two bugs here:

1) nfsrv_readdirplus() doesn't check VFS_VGET() error status properly.
   It only checks for EOPNOTSUPP but other errors are ignored. This is
the
   final reason for the panic and in theory it could happen for other
   file systems too. In this case VFS_VGET() returns EINVAL and results
   NULL nvp.
2) zfs VFS_VGET() returns EINVAL for .zfs control directory entries.
   Looking at zfs_vget() it tries find corresponding znode to fulfill
   the request. However control directory entries don't have backing
   znodes.

Here is a patch which fixes 1). The patch prevents system from panicing
but a fix for 2) is needed to make readdirplus work with .zfs directory.

%%%
Index: sys/nfsserver/nfs_serv.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/nfsserver/nfs_serv.c	(revision 183511)
+++ sys/nfsserver/nfs_serv.c	(working copy)
@@ -3597,9 +3597,12 @@ again:
 	 * Probe one of the directory entries to see if the filesystem
 	 * supports VGET.
 	 */
-	if (VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp) =3D=3D
-	    EOPNOTSUPP) {
-		error =3D NFSERR_NOTSUPP;
+	error =3D VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp);
+	if (error) {
+		if (error =3D=3D EOPNOTSUPP)
+			error =3D NFSERR_NOTSUPP;
+		else
+			error =3D NFSERR_SERVERFAULT;
 		vrele(vp);
 		vp =3D NULL;
 		free((caddr_t)cookies, M_TEMP);
%%%

And here's an attempt to add support for .zfs control directory entries
(bug 2)) in zfs_vget(). The patch is very experimental and it only works
for snapshots which are already active (mounted).

%%%
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
(revision 183587)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	(working
copy)
@@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 		VN_RELE(ZTOV(zp));
 		err =3D EINVAL;
 	}
-	if (err !=3D 0)
-		*vpp =3D NULL;
-	else {
+	if (err !=3D 0) {
+		/* try .zfs control directory */
+		err =3D zfsctl_vget(vfsp, ino, flags, vpp);
+	} else {
 		*vpp =3D ZTOV(zp);
 		vn_lock(*vpp, flags);
 	}
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
(revision 183587)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	(working
copy)
@@ -1047,6 +1047,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64
 	return (error);
 }
=20
+int
+zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp)
+{
+	zfsvfs_t *zfsvfs =3D vfsp->vfs_data;
+	vnode_t *dvp, *vp;
+	zfsctl_snapdir_t *sdp;
+	zfsctl_node_t *zcp;
+	zfs_snapentry_t *sep;
+	int error;
+
+	*vpp =3D NULL;
+
+	ASSERT(zfsvfs->z_ctldir !=3D NULL);
+	error =3D zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp,
+	    NULL, 0, NULL, kcred);
+	if (error !=3D 0)
+		return (error);
+
+	if (nodeid =3D=3D ZFSCTL_INO_ROOT || nodeid =3D=3D ZFSCTL_INO_SNAPDIR) =
{
+		if (nodeid =3D=3D ZFSCTL_INO_SNAPDIR)
+			*vpp =3D dvp;
+		else {
+			VN_RELE(dvp);
+			*vpp =3D zfsvfs->z_ctldir;
+			VN_HOLD(*vpp);
+		}
+		/* XXX: LK_RETRY? */
+		vn_lock(*vpp, flags | LK_RETRY);
+		return (0);
+	}
+	=09
+	sdp =3D dvp->v_data;
+
+	mutex_enter(&sdp->sd_lock);
+	sep =3D avl_first(&sdp->sd_snaps);
+	while (sep !=3D NULL) {
+		vp =3D sep->se_root;
+		zcp =3D vp->v_data;
+		if (zcp->zc_id =3D=3D nodeid)
+			break;
+
+		sep =3D AVL_NEXT(&sdp->sd_snaps, sep);
+	}
+
+	if (sep !=3D NULL) {
+		VN_HOLD(vp);
+		*vpp =3D vp;
+		vn_lock(*vpp, flags);
+	} else
+		error =3D EINVAL;
+
+	mutex_exit(&sdp->sd_lock);
+
+	VN_RELE(dvp);
+
+	return (error);
+}
 /*
  * Unmount any snapshots for the given filesystem.  This is called from
  * zfs_umount() - if we have a ctldir, then go through and unmount all
the
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
(revision 183587)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
(working copy)
@@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha
     int flags, vnode_t *rdir, cred_t *cr);
=20
 int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t
**zfsvfsp);
+int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t
**vpp);
=20
 #define	ZFSCTL_INO_ROOT		0x1
 #define	ZFSCTL_INO_SNAPDIR	0x2
%%%

--=20
Jaakko
Comment 9 Weldon Godfrey 2008-10-09 14:19:38 UTC
I am rebuilding right now.

FYI --- I modified the patch (corrected number of lines)

-@@ -1047,6 +1047,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64
+@@ -1047,6 +1047,62 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64

Weldon
Comment 10 Weldon Godfrey 2008-10-09 17:23:12 UTC
Is this patch based on 8-CURRENT or 7-RELEASE?  If 8-CURRENT, I don't
know if I can test as I would like to stick with 7-RELEASE for now.
However, I would like to move to ZFS11 so if there is a patch for 7 for
ZFS11 (assuming your patch is based in the v11 code), I would like to
apply that.



/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf
s/zfs_ctldir.c:1073:33: error: macro "vn_lock" requires 3 arguments, but
only 2 given
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf
s/zfs_ctldir.c: In function 'zfsctl_vget':
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf
s/zfs_ctldir.c:1073: error: 'vn_lock' undeclared (first use in this
function)
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf
s/zfs_ctldir.c:1073: error: (Each undeclared identifier is reported only
once
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf
s/zfs_ctldir.c:1073: error: for each function it appears in.)
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zf
s/zfs_ctldir.c:1093:22: error: macro "vn_lock" requires 3 arguments, but
only 2 given

Weldon

-----Original Message-----
From: Jaakko Heinonen [mailto:jh@saunalahti.fi]=20
Sent: Tuesday, October 07, 2008 10:37 AM
To: Volker Werth
Cc: Weldon Godfrey; bug-followup@freebsd.org
Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from nfs
clientcauses endless panic loop


Hi,

On 2008-10-02, Volker Werth wrote:
> > #8  0xffffffff804f06fa in vput (vp=3D0x0) at atomic.h:142
> > #9  0xffffffff8060670d in nfsrv_readdirplus
(nfsd=3D0xffffff000584f100,
> > slp=3D0xffffff0005725900,=20
> >     td=3D0xffffff00059a0340, mrq=3D0xffffffffdf761af0) at
> > /usr/src/sys/nfsserver/nfs_serv.c:3613
> > #10 0xffffffff80615a5d in nfssvc (td=3DVariable "td" is not =
available.
> > ) at /usr/src/sys/nfsserver/nfs_syscalls.c:461
> > #11 0xffffffff8072f377 in syscall (frame=3D0xffffffffdf761c70) at
> > /usr/src/sys/amd64/amd64/trap.c:852
> > #12 0xffffffff807158bb in Xfast_syscall () at
> > /usr/src/sys/amd64/amd64/exception.S:290
> > #13 0x000000080068746c in ?? ()
> > Previous frame inner to this frame (corrupt stack?)
>=20
> I think the problem is the NULL pointer to vput. A maintainer needs to
> check how nvp can get a NULL pointer (judging by assuming my fresh
> codebase is not too different from yours).

The bug is reproducible with nfs clients using readdirplus. FreeBSD
client doesn't use readdirplus by default but you can enable it with -l
mount option. Here are steps to reproduce the panic with FreeBSD nfs
client:

- nfs export a zfs file system
- on client mount the file system with -l mount option and list the
  zfs control directory
# mount_nfs -l x.x.x.x:/tank /mnt
# ls /mnt/.zfs

I see two bugs here:

1) nfsrv_readdirplus() doesn't check VFS_VGET() error status properly.
   It only checks for EOPNOTSUPP but other errors are ignored. This is
the
   final reason for the panic and in theory it could happen for other
   file systems too. In this case VFS_VGET() returns EINVAL and results
   NULL nvp.
2) zfs VFS_VGET() returns EINVAL for .zfs control directory entries.
   Looking at zfs_vget() it tries find corresponding znode to fulfill
   the request. However control directory entries don't have backing
   znodes.

Here is a patch which fixes 1). The patch prevents system from panicing
but a fix for 2) is needed to make readdirplus work with .zfs directory.

%%%
Index: sys/nfsserver/nfs_serv.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/nfsserver/nfs_serv.c	(revision 183511)
+++ sys/nfsserver/nfs_serv.c	(working copy)
@@ -3597,9 +3597,12 @@ again:
 	 * Probe one of the directory entries to see if the filesystem
 	 * supports VGET.
 	 */
-	if (VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp) =3D=3D
-	    EOPNOTSUPP) {
-		error =3D NFSERR_NOTSUPP;
+	error =3D VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE, &nvp);
+	if (error) {
+		if (error =3D=3D EOPNOTSUPP)
+			error =3D NFSERR_NOTSUPP;
+		else
+			error =3D NFSERR_SERVERFAULT;
 		vrele(vp);
 		vp =3D NULL;
 		free((caddr_t)cookies, M_TEMP);
%%%

And here's an attempt to add support for .zfs control directory entries
(bug 2)) in zfs_vget(). The patch is very experimental and it only works
for snapshots which are already active (mounted).

%%%
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
(revision 183587)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	(working
copy)
@@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 		VN_RELE(ZTOV(zp));
 		err =3D EINVAL;
 	}
-	if (err !=3D 0)
-		*vpp =3D NULL;
-	else {
+	if (err !=3D 0) {
+		/* try .zfs control directory */
+		err =3D zfsctl_vget(vfsp, ino, flags, vpp);
+	} else {
 		*vpp =3D ZTOV(zp);
 		vn_lock(*vpp, flags);
 	}
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
(revision 183587)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	(working
copy)
@@ -1047,6 +1047,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64
 	return (error);
 }
=20
+int
+zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp)
+{
+	zfsvfs_t *zfsvfs =3D vfsp->vfs_data;
+	vnode_t *dvp, *vp;
+	zfsctl_snapdir_t *sdp;
+	zfsctl_node_t *zcp;
+	zfs_snapentry_t *sep;
+	int error;
+
+	*vpp =3D NULL;
+
+	ASSERT(zfsvfs->z_ctldir !=3D NULL);
+	error =3D zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp,
+	    NULL, 0, NULL, kcred);
+	if (error !=3D 0)
+		return (error);
+
+	if (nodeid =3D=3D ZFSCTL_INO_ROOT || nodeid =3D=3D ZFSCTL_INO_SNAPDIR) =
{
+		if (nodeid =3D=3D ZFSCTL_INO_SNAPDIR)
+			*vpp =3D dvp;
+		else {
+			VN_RELE(dvp);
+			*vpp =3D zfsvfs->z_ctldir;
+			VN_HOLD(*vpp);
+		}
+		/* XXX: LK_RETRY? */
+		vn_lock(*vpp, flags | LK_RETRY);
+		return (0);
+	}
+	=09
+	sdp =3D dvp->v_data;
+
+	mutex_enter(&sdp->sd_lock);
+	sep =3D avl_first(&sdp->sd_snaps);
+	while (sep !=3D NULL) {
+		vp =3D sep->se_root;
+		zcp =3D vp->v_data;
+		if (zcp->zc_id =3D=3D nodeid)
+			break;
+
+		sep =3D AVL_NEXT(&sdp->sd_snaps, sep);
+	}
+
+	if (sep !=3D NULL) {
+		VN_HOLD(vp);
+		*vpp =3D vp;
+		vn_lock(*vpp, flags);
+	} else
+		error =3D EINVAL;
+
+	mutex_exit(&sdp->sd_lock);
+
+	VN_RELE(dvp);
+
+	return (error);
+}
 /*
  * Unmount any snapshots for the given filesystem.  This is called from
  * zfs_umount() - if we have a ctldir, then go through and unmount all
the
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
(revision 183587)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
(working copy)
@@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha
     int flags, vnode_t *rdir, cred_t *cr);
=20
 int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t
**zfsvfsp);
+int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t
**vpp);
=20
 #define	ZFSCTL_INO_ROOT		0x1
 #define	ZFSCTL_INO_SNAPDIR	0x2
%%%

--=20
Jaakko
Comment 11 Jaakko Heinonen 2008-10-09 20:44:38 UTC
On 2008-10-09, Weldon Godfrey wrote:
> Is this patch based on 8-CURRENT or 7-RELEASE?  If 8-CURRENT, I don't
> know if I can test as I would like to stick with 7-RELEASE for now.

Patches are against head. Sorry that I didn't mention that. The nfs
patch applies against RELENG_7 with offset and here's the zfs patch
against RELENG_7. (Disclaimer: compile tested only)

%%%
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	(revision 183727)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	(working copy)
@@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 		VN_RELE(ZTOV(zp));
 		err = EINVAL;
 	}
-	if (err != 0)
-		*vpp = NULL;
-	else {
+	if (err != 0) {
+		/* try .zfs control directory */
+		err = zfsctl_vget(vfsp, ino, flags, vpp);
+	} else {
 		*vpp = ZTOV(zp);
 		vn_lock(*vpp, flags, curthread);
 	}
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	(revision 183727)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	(working copy)
@@ -1044,6 +1044,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64
 	return (error);
 }
 
+int
+zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp)
+{
+	zfsvfs_t *zfsvfs = vfsp->vfs_data;
+	vnode_t *dvp, *vp;
+	zfsctl_snapdir_t *sdp;
+	zfsctl_node_t *zcp;
+	zfs_snapentry_t *sep;
+	int error;
+
+	*vpp = NULL;
+
+	ASSERT(zfsvfs->z_ctldir != NULL);
+	error = zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp,
+	    NULL, 0, NULL, kcred);
+	if (error != 0)
+		return (error);
+
+	if (nodeid == ZFSCTL_INO_ROOT || nodeid == ZFSCTL_INO_SNAPDIR) {
+		if (nodeid == ZFSCTL_INO_SNAPDIR)
+			*vpp = dvp;
+		else {
+			VN_RELE(dvp);
+			*vpp = zfsvfs->z_ctldir;
+			VN_HOLD(*vpp);
+		}
+		/* XXX: LK_RETRY? */
+		vn_lock(*vpp, flags | LK_RETRY, curthread);
+		return (0);
+	}
+		
+	sdp = dvp->v_data;
+
+	mutex_enter(&sdp->sd_lock);
+	sep = avl_first(&sdp->sd_snaps);
+	while (sep != NULL) {
+		vp = sep->se_root;
+		zcp = vp->v_data;
+		if (zcp->zc_id == nodeid)
+			break;
+
+		sep = AVL_NEXT(&sdp->sd_snaps, sep);
+	}
+
+	if (sep != NULL) {
+		VN_HOLD(vp);
+		*vpp = vp;
+		vn_lock(*vpp, flags, curthread);
+	} else
+		error = EINVAL;
+
+	mutex_exit(&sdp->sd_lock);
+
+	VN_RELE(dvp);
+
+	return (error);
+}
 /*
  * Unmount any snapshots for the given filesystem.  This is called from
  * zfs_umount() - if we have a ctldir, then go through and unmount all the
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h	(revision 183727)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h	(working copy)
@@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha
     int flags, vnode_t *rdir, cred_t *cr);
 
 int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t **zfsvfsp);
+int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp);
 
 #define	ZFSCTL_INO_ROOT		0x1
 #define	ZFSCTL_INO_SNAPDIR	0x2
%%%

-- 
Jaakko
Comment 12 Weldon Godfrey 2008-10-10 14:11:17 UTC
That's okay, although I won't be able to help test since I am close to
using the system in production.  We can live without needing to go to
.zfs directory from a client.  Also, I have set the nordirplus option on
the clients now. =20

Which, btw, could this also be the other issue I was seeing?  When we
tested rigoriously from CentOS 3.x clients, after 2-3 hrs of testing,
the system would panic.  From the fbsd-fs list, it was noted from the
backtrace that the vnode was becoming invalid.  This seemed to be less
of a case with CentOS 5.x clients (by a lot, although I did get 1 panic
recently).  I am rerunning the tests over this weekend.

Thank you for helping!

Weldon

-----Original Message-----
From: Jaakko Heinonen [mailto:jh@saunalahti.fi]=20
Sent: Thursday, October 09, 2008 2:45 PM
To: Weldon Godfrey
Cc: bug-followup@freebsd.org
Subject: Re: kern/125149: [zfs][nfs] changing into .zfs dir from
nfsclientcauses endless panic loop

On 2008-10-09, Weldon Godfrey wrote:
> Is this patch based on 8-CURRENT or 7-RELEASE?  If 8-CURRENT, I don't
> know if I can test as I would like to stick with 7-RELEASE for now.

Patches are against head. Sorry that I didn't mention that. The nfs
patch applies against RELENG_7 with offset and here's the zfs patch
against RELENG_7. (Disclaimer: compile tested only)

%%%
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
(revision 183727)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	(working
copy)
@@ -759,9 +759,10 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 		VN_RELE(ZTOV(zp));
 		err =3D EINVAL;
 	}
-	if (err !=3D 0)
-		*vpp =3D NULL;
-	else {
+	if (err !=3D 0) {
+		/* try .zfs control directory */
+		err =3D zfsctl_vget(vfsp, ino, flags, vpp);
+	} else {
 		*vpp =3D ZTOV(zp);
 		vn_lock(*vpp, flags, curthread);
 	}
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
(revision 183727)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	(working
copy)
@@ -1044,6 +1044,63 @@ zfsctl_lookup_objset(vfs_t *vfsp, uint64
 	return (error);
 }
=20
+int
+zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t **vpp)
+{
+	zfsvfs_t *zfsvfs =3D vfsp->vfs_data;
+	vnode_t *dvp, *vp;
+	zfsctl_snapdir_t *sdp;
+	zfsctl_node_t *zcp;
+	zfs_snapentry_t *sep;
+	int error;
+
+	*vpp =3D NULL;
+
+	ASSERT(zfsvfs->z_ctldir !=3D NULL);
+	error =3D zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp,
+	    NULL, 0, NULL, kcred);
+	if (error !=3D 0)
+		return (error);
+
+	if (nodeid =3D=3D ZFSCTL_INO_ROOT || nodeid =3D=3D ZFSCTL_INO_SNAPDIR) =
{
+		if (nodeid =3D=3D ZFSCTL_INO_SNAPDIR)
+			*vpp =3D dvp;
+		else {
+			VN_RELE(dvp);
+			*vpp =3D zfsvfs->z_ctldir;
+			VN_HOLD(*vpp);
+		}
+		/* XXX: LK_RETRY? */
+		vn_lock(*vpp, flags | LK_RETRY, curthread);
+		return (0);
+	}
+	=09
+	sdp =3D dvp->v_data;
+
+	mutex_enter(&sdp->sd_lock);
+	sep =3D avl_first(&sdp->sd_snaps);
+	while (sep !=3D NULL) {
+		vp =3D sep->se_root;
+		zcp =3D vp->v_data;
+		if (zcp->zc_id =3D=3D nodeid)
+			break;
+
+		sep =3D AVL_NEXT(&sdp->sd_snaps, sep);
+	}
+
+	if (sep !=3D NULL) {
+		VN_HOLD(vp);
+		*vpp =3D vp;
+		vn_lock(*vpp, flags, curthread);
+	} else
+		error =3D EINVAL;
+
+	mutex_exit(&sdp->sd_lock);
+
+	VN_RELE(dvp);
+
+	return (error);
+}
 /*
  * Unmount any snapshots for the given filesystem.  This is called from
  * zfs_umount() - if we have a ctldir, then go through and unmount all
the
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
(revision 183727)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ctldir.h
(working copy)
@@ -60,6 +60,7 @@ int zfsctl_root_lookup(vnode_t *dvp, cha
     int flags, vnode_t *rdir, cred_t *cr);
=20
 int zfsctl_lookup_objset(vfs_t *vfsp, uint64_t objsetid, zfsvfs_t
**zfsvfsp);
+int zfsctl_vget(vfs_t *vfsp, uint64_t nodeid, int flags, vnode_t
**vpp);
=20
 #define	ZFSCTL_INO_ROOT		0x1
 #define	ZFSCTL_INO_SNAPDIR	0x2
%%%

--=20
Jaakko
Comment 13 Jaakko Heinonen 2008-10-13 16:11:46 UTC
On 2008-10-10, Weldon Godfrey wrote:
> Which, btw, could this also be the other issue I was seeing?  When we
> tested rigoriously from CentOS 3.x clients, after 2-3 hrs of testing,
> the system would panic.  From the fbsd-fs list, it was noted from the
> backtrace that the vnode was becoming invalid.

Well, if you mean this message
http://lists.freebsd.org/pipermail/freebsd-fs/2008-August/005120.html
and Rick's analysis is correct I am quite certain that they are
different issues.

-- 
Jaakko
Comment 14 Jaakko Heinonen 2009-09-10 07:44:28 UTC
Hi,

On 2009-09-09, pjd@FreeBSD.org wrote:
> Is this still a problem with FreeBSD 8? I'm not able to reproduce it.

The NFS part has been committed (see r186165). However you still can't
list the .zfs control directory with readdirplus enabled NFS clients (2)
in my earlier message). Now NFS server just returns NFSERR_SERVERFAULT
if you try to list the .zfs directory:

09:37:49.696845 IP localhost.2376948419 > localhost.nfs: 136 readdirplus [|nfs]
09:37:49.696947 IP localhost.nfs > localhost.2376948419: reply ok 116 readdirplus ERROR: Unspecified error on server

-- 
Jaakko
Comment 15 Pawel Jakub Dawidek freebsd_committer freebsd_triage 2009-09-10 07:47:02 UTC
On Thu, Sep 10, 2009 at 09:44:28AM +0300, Jaakko Heinonen wrote:
> 
> Hi,
> 
> On 2009-09-09, pjd@FreeBSD.org wrote:
> > Is this still a problem with FreeBSD 8? I'm not able to reproduce it.
> 
> The NFS part has been committed (see r186165). However you still can't
> list the .zfs control directory with readdirplus enabled NFS clients (2)
> in my earlier message). Now NFS server just returns NFSERR_SERVERFAULT
> if you try to list the .zfs directory:
> 
> 09:37:49.696845 IP localhost.2376948419 > localhost.nfs: 136 readdirplus [|nfs]
> 09:37:49.696947 IP localhost.nfs > localhost.2376948419: reply ok 116 readdirplus ERROR: Unspecified error on server


I was trying to test this by using the following command:

	# mount -t nfs -o rdirplus 127.0.0.1:/tank /mnt

Everything worked fine, but maybe there is a bug in passing rdirplus
flag to the kernel somewhere and I wasn't actually using rdirplus?

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd@FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
Comment 16 Jaakko Heinonen 2009-09-11 10:24:42 UTC
On 2009-09-11, pjd@FreeBSD.org wrote:
> I was trying to test this by using the following command:
>
> # mount -t nfs -o rdirplus 127.0.0.1:/tank /mnt
>
> Everything worked fine, but maybe there is a bug in passing rdirplus
> flag to the kernel somewhere and I wasn't actually using rdirplus?

I tried that exact command and readdirplus worked for me on recent
current. You can use tcpdump(1) to see if it's really used.

-- 
Jaakko
Comment 17 dfilter service freebsd_committer freebsd_triage 2009-09-13 17:05:34 UTC
Author: pjd
Date: Sun Sep 13 16:05:20 2009
New Revision: 197167
URL: http://svn.freebsd.org/changeset/base/197167

Log:
  Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
  by just returning EOPNOTSUPP. This will allow NFS server to fall back to
  regular READDIR.
  
  Note that converting inode number to snapshot's vnode is expensive operation.
  Snapshots are stored in AVL tree, but based on their names, not inode numbers,
  so to convert inode to snapshot vnode we have to interate over all snalshots.
  
  This is not a problem in OpenSolaris, because in their READDIRPLUS
  implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
  d_fileno as we do.
  
  PR:		kern/125149
  Reported by:	Weldon Godfrey <wgodfrey@ena.com>
  Analysis by:	Jaakko Heinonen <jh@saunalahti.fi>
  MFC after:	3 days

Modified:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c

Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
==============================================================================
--- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Sun Sep 13 15:42:19 2009	(r197166)
+++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Sun Sep 13 16:05:20 2009	(r197167)
@@ -1114,6 +1114,20 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 	znode_t		*zp;
 	int 		err;
 
+	/*
+	 * XXXPJD: zfs_zget() can't operate on virtual entires like .zfs/ or
+	 * .zfs/snapshot/ directories, so for now just return EOPNOTSUPP.
+	 * This will make NFS to fall back to using READDIR instead of
+	 * READDIRPLUS.
+	 * Also snapshots are stored in AVL tree, but based on their names,
+	 * not inode numbers, so it will be very inefficient to iterate
+	 * over all snapshots to find the right one.
+	 * Note that OpenSolaris READDIRPLUS implementation does LOOKUP on
+	 * d_name, and not VGET on d_fileno as we do.
+	 */
+	if (ino == ZFSCTL_INO_ROOT || ino == ZFSCTL_INO_SNAPDIR)
+		return (EOPNOTSUPP);
+
 	ZFS_ENTER(zfsvfs);
 	err = zfs_zget(zfsvfs, ino, &zp);
 	if (err == 0 && zp->z_unlinked) {
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 18 dfilter service freebsd_committer freebsd_triage 2009-09-15 12:14:05 UTC
Author: pjd
Date: Tue Sep 15 11:13:40 2009
New Revision: 197215
URL: http://svn.freebsd.org/changeset/base/197215

Log:
  MFC r196456,r196457,r196458,r196662,r196702,r196703,r196919,r196927,r196928,
  r196943,r196944,r196947,r196950,r196953,r196954,r196965,r196978,r196979,
  r196980,r196982,r196985,r196992,r197131,r197133,r197150,r197151,r197152,
  r197153,r197167,r197172,r197177,r197200,r197201:
  
  r196456:
  - Give minclsyspri and maxclsyspri real values (consulted with kmacy).
  - Honour 'pri' argument for thread_create().
  
  r196457:
  Set priority of vdev_geom threads and zvol threads to PRIBIO.
  
  r196458:
  - Hide ZFS kernel threads under zfskern process.
  - Use better (shorter) threads names:
  	'zvol:worker zvol/tank/vol00' -> 'zvol tank/vol00'
  	'vdev:worker da0' -> 'vdev da0'
  
  r196662:
  Add missing mountpoint vnode locking.
  This fixes panic on assertion with DEBUG_VFS_LOCKS and vfs.usermount=1 when
  regular user tries to mount dataset owned by him.
  
  r196702:
  Remove empty directory.
  
  r196703:
  Backport the 'dirtying dbuf' panic fix from newer ZFS version.
  
  Reported by:	Thomas Backman <serenity@exscape.org>
  
  r196919:
  bzero() on-stack argument, so mutex_init() won't misinterpret that the
  lock is already initialized if we have some garbage on the stack.
  
  PR:	kern/135480
  Reported by:	Emil Mikulic <emikulic@gmail.com>
  
  r196927:
  Changing provider size is not really supported by GEOM, but doing so when
  provider is closed should be ok.
  When administrator requests to change ZVOL size do it immediately if ZVOL
  is closed or do it on last ZVOL close.
  
  PR:	kern/136942
  Requested by:	Bernard Buri <bsd@ask-us.at>
  
  r196928:
  Teach zdb(8) how to obtain GEOM provider size.
  
  PR:	kern/133134
  Reported by:	Philipp Wuensche <cryx-freebsd@h3q.com>
  
  r196943:
  - Avoid holding mutex around M_WAITOK allocations.
  - Add locking for mnt_opt field.
  
  r196944:
  Don't recheck ownership on update mount. This will eliminate LOR between
  vfs_busy() and mount mutex. We check ownership in vfs_domount() anyway.
  
  Noticed by:	kib
  Reviewed by:	kib
  
  r196947:
  Defer thread start until we set priority.
  
  Reviewed by:	kib
  
  r196950:
  Fix detection of file system being shared. Now zfs unshare/destroy/rename
  command will properly remove exported file systems.
  
  r196953:
  When snapshot mount point is busy (for example we are still in it)
  we will fail to unmount it, but it won't be removed from the tree,
  so in that case there is no need to reinsert it.
  
  Reported by:	trasz
  
  r196954:
  If we have to use avl_find(), optimize a bit and use avl_insert() instead of
  avl_add() (the latter is actually a wrapper around avl_find() + avl_insert()).
  Fix similar case in the code that is currently commented out.
  
  r196965:
  Fix reference count leak for a case where snapshot's mount point is updated.
  
  r196978:
  Call ZFS_EXIT() after locking the vnode.
  
  r196979:
  On FreeBSD we don't have to look for snapshot's mount point,
  because fhtovp method is already called with proper mount point.
  
  r196980:
  When we automatically mount snapshot we want to return vnode of the mount point
  from the lookup and not covered vnode. This is one of the fixes for using .zfs/
  over NFS.
  
  r196982:
  We don't export individual snapshots, so mnt_export field in snapshot's
  mount point is NULL. That's why when we try to access snapshots over NFS
  use mnt_export field from the parent file system.
  
  r196985:
  Only log successful commands! Without this fix we log even unsuccessful
  commands executed by unprivileged users. Action is not really taken, but it is
  logged to pool history, which might be confusing.
  
  Reported by:	Denis Ahrens <denis@h3q.com>
  
  r196992:
  Implement __assert() for Solaris-specific code. Until now Solaris code was
  using Solaris prototype for __assert(), but FreeBSD's implementation.
  Both take different arguments, so we were either core-dumping in assert()
  or printing garbage.
  
  Reported by:	avg
  
  r197131:
  Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
  NULL, but also can point to dead vnode, take that into account.
  
  PR:	kern/132068
  Reported by:	Edward Fisk <7ogcg7g02@sneakemail.com>, kris
  Fix based on patch from:	Jaakko Heinonen <jh@saunalahti.fi>
  
  r197133:
  - Protect reclaim with z_teardown_inactive_lock.
  - Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
    z_dbuf field is NULL - this might happen in case of rollback or forced
    unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
  - On forced unmount wait for all znodes to be destroyed - destruction can be
    done asynchronously via zfs_reclaim_complete().
  
  r197150:
  There is a bug where mze_insert() can trigger an assert() of inserting
  the same entry twice. This bug is not fixed yet, but leads to situation
  where when try to access corrupted directory the kernel will panic.
  Until the bug is properly fixed, try to recover from it and log that it
  happened.
  
  Reported by:	marck
  OpenSolaris bug:	6709336
  
  r197151:
  Be sure not to overflow struct fid.
  
  r197152:
  Extend scope of the z_teardown_lock lock for consistency and "just in case".
  
  r197153:
  When zfs.ko is compiled with debug, make sure that znode and vnode point at
  each other.
  
  r197167:
  Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
  by just returning EOPNOTSUPP. This will allow NFS server to fall back to
  regular READDIR.
  Note that converting inode number to snapshot's vnode is expensive operation.
  Snapshots are stored in AVL tree, but based on their names, not inode numbers,
  so to convert inode to snapshot vnode we have to interate over all snalshots.
  This is not a problem in OpenSolaris, because in their READDIRPLUS
  implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
  d_fileno as we do.
  
  PR:	kern/125149
  Reported by:	Weldon Godfrey <wgodfrey@ena.com>
  Analysis by:	Jaakko Heinonen <jh@saunalahti.fi>
  
  r197172:
  Add missing \n.
  
  Reported by:	marck
  
  r197177:
  Support both case: when snapshot is already mounted and when it is not yet
  mounted.
  
  r197200:
  Modify mount(8) to skip MNT_IGNORE file systems by default, just like df(1)
  does. This is not POLA violation, because there is no single file system in the
  base that use MNT_IGNORE currently, although ZFS snapshots will be mounted with
  MNT_IGNORE after next commit.
  
  Reviewed by:	kib
  
  r197201:
  - Mount ZFS snapshots with MNT_IGNORE flag, so they are not visible in regular
    df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
    ZFS route of not listing snapshots by default with 'zfs list' command.
  - Add UPDATING entry to note that ZFS snapshots are no longer visible in
    mount(8) and df(1) output by default.
  
  Reviewed by:	kib
  
  Approved by:	re (bz)

Added:
  stable/8/cddl/compat/opensolaris/include/assert.h
     - copied unchanged from r196992, head/cddl/compat/opensolaris/include/assert.h
Deleted:
  stable/8/cddl/contrib/opensolaris/head/assert.h
  stable/8/sys/cddl/contrib/opensolaris/uts/common/rpc/
Modified:
  stable/8/UPDATING
  stable/8/cddl/compat/opensolaris/   (props changed)
  stable/8/cddl/contrib/opensolaris/   (props changed)
  stable/8/cddl/contrib/opensolaris/cmd/zdb/zdb.c
  stable/8/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
  stable/8/sbin/mount/   (props changed)
  stable/8/sbin/mount/mount.8
  stable/8/sbin/mount/mount.c
  stable/8/sys/   (props changed)
  stable/8/sys/amd64/include/xen/   (props changed)
  stable/8/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c
  stable/8/sys/cddl/compat/opensolaris/sys/mutex.h
  stable/8/sys/cddl/compat/opensolaris/sys/proc.h
  stable/8/sys/cddl/compat/opensolaris/sys/vfs.h
  stable/8/sys/cddl/contrib/opensolaris/   (props changed)
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/sys/callb.h
  stable/8/sys/contrib/dev/acpica/   (props changed)
  stable/8/sys/contrib/pf/   (props changed)
  stable/8/sys/dev/xen/xenpci/   (props changed)

Modified: stable/8/UPDATING
==============================================================================
--- stable/8/UPDATING	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/UPDATING	Tue Sep 15 11:13:40 2009	(r197215)
@@ -22,6 +22,10 @@ NOTE TO PEOPLE WHO THINK THAT FreeBSD 8.
 	to maximize performance.  (To disable malloc debugging, run
 	ln -s aj /etc/malloc.conf.)
 
+20090915:
+	ZFS snapshots are now mounted with MNT_IGNORE flag. Use -v option for
+	mount(8) and -a option for df(1) to see them.
+
 20090813:
 	Remove the option STOP_NMI.  The default action is now to use NMI
 	only for KDB via the newly introduced function stop_cpus_hard()

Copied: stable/8/cddl/compat/opensolaris/include/assert.h (from r196992, head/cddl/compat/opensolaris/include/assert.h)
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ stable/8/cddl/compat/opensolaris/include/assert.h	Tue Sep 15 11:13:40 2009	(r197215, copy of r196992, head/cddl/compat/opensolaris/include/assert.h)
@@ -0,0 +1,55 @@
+/*-
+ * Copyright (c) 2009 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#undef assert
+#undef _assert
+
+#ifdef NDEBUG
+#define	assert(e)	((void)0)
+#define	_assert(e)	((void)0)
+#else
+#define	_assert(e)	assert(e)
+
+#define	assert(e)	((e) ? (void)0 : __assert(#e, __FILE__, __LINE__))
+#endif /* NDEBUG */
+
+#ifndef _ASSERT_H_
+#define _ASSERT_H_
+#include <stdio.h>
+#include <stdlib.h>
+
+static __inline void
+__assert(const char *expr, const char *file, int line)
+{
+
+	(void)fprintf(stderr, "Assertion failed: (%s), file %s, line %d.\n",
+	    expr, file, line);
+	abort();
+	/* NOTREACHED */
+}
+#endif /* !_ASSERT_H_ */

Modified: stable/8/cddl/contrib/opensolaris/cmd/zdb/zdb.c
==============================================================================
--- stable/8/cddl/contrib/opensolaris/cmd/zdb/zdb.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/cddl/contrib/opensolaris/cmd/zdb/zdb.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -1322,6 +1322,14 @@ dump_label(const char *dev)
 		exit(1);
 	}
 
+	if (S_ISCHR(statbuf.st_mode)) {
+		if (ioctl(fd, DIOCGMEDIASIZE, &statbuf.st_size) == -1) {
+			(void) printf("failed to get size of '%s': %s\n", dev,
+			    strerror(errno));
+			exit(1);
+		}
+	}
+
 	psize = statbuf.st_size;
 	psize = P2ALIGN(psize, (uint64_t)sizeof (vdev_label_t));
 

Modified: stable/8/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
==============================================================================
--- stable/8/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -172,6 +172,7 @@ is_shared(libzfs_handle_t *hdl, const ch
 
 		*tab = '\0';
 		if (strcmp(buf, mountpoint) == 0) {
+#if defined(sun)
 			/*
 			 * the protocol field is the third field
 			 * skip over second field
@@ -194,6 +195,10 @@ is_shared(libzfs_handle_t *hdl, const ch
 					return (0);
 				}
 			}
+#else
+			if (proto == PROTO_NFS)
+				return (SHARED_NFS);
+#endif
 		}
 	}
 

Modified: stable/8/sbin/mount/mount.8
==============================================================================
--- stable/8/sbin/mount/mount.8	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sbin/mount/mount.8	Tue Sep 15 11:13:40 2009	(r197215)
@@ -469,6 +469,12 @@ or
 option.
 .It Fl v
 Verbose mode.
+If the
+.Fl v
+is used alone, show all file systems, including those that were mounted with the
+.Dv MNT_IGNORE
+flag and show additional information about each file system (including fsid
+when run by root).
 .It Fl w
 The file system object is to be read and write.
 .El

Modified: stable/8/sbin/mount/mount.c
==============================================================================
--- stable/8/sbin/mount/mount.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sbin/mount/mount.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -348,6 +348,9 @@ main(int argc, char *argv[])
 				if (checkvfsname(mntbuf[i].f_fstypename,
 				    vfslist))
 					continue;
+				if (!verbose &&
+				    (mntbuf[i].f_flags & MNT_IGNORE) != 0)
+					continue;
 				prmount(&mntbuf[i]);
 			}
 		}

Modified: stable/8/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c
==============================================================================
--- stable/8/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -45,20 +45,33 @@ vfs_setmntopt(vfs_t *vfsp, const char *n
 {
 	struct vfsopt *opt;
 	size_t namesize;
+	int locked;
+
+	if (!(locked = mtx_owned(MNT_MTX(vfsp))))
+		MNT_ILOCK(vfsp);
 
 	if (vfsp->mnt_opt == NULL) {
-		vfsp->mnt_opt = malloc(sizeof(*vfsp->mnt_opt), M_MOUNT, M_WAITOK);
-		TAILQ_INIT(vfsp->mnt_opt);
+		void *opts;
+
+		MNT_IUNLOCK(vfsp);
+		opts = malloc(sizeof(*vfsp->mnt_opt), M_MOUNT, M_WAITOK);
+		MNT_ILOCK(vfsp);
+		if (vfsp->mnt_opt == NULL) {
+			vfsp->mnt_opt = opts;
+			TAILQ_INIT(vfsp->mnt_opt);
+		} else {
+			free(opts, M_MOUNT);
+		}
 	}
 
-	opt = malloc(sizeof(*opt), M_MOUNT, M_WAITOK);
+	MNT_IUNLOCK(vfsp);
 
+	opt = malloc(sizeof(*opt), M_MOUNT, M_WAITOK);
 	namesize = strlen(name) + 1;
 	opt->name = malloc(namesize, M_MOUNT, M_WAITOK);
 	strlcpy(opt->name, name, namesize);
 	opt->pos = -1;
 	opt->seen = 1;
-
 	if (arg == NULL) {
 		opt->value = NULL;
 		opt->len = 0;
@@ -67,16 +80,23 @@ vfs_setmntopt(vfs_t *vfsp, const char *n
 		opt->value = malloc(opt->len, M_MOUNT, M_WAITOK);
 		bcopy(arg, opt->value, opt->len);
 	}
-	/* TODO: Locking. */
+
+	MNT_ILOCK(vfsp);
 	TAILQ_INSERT_TAIL(vfsp->mnt_opt, opt, link);
+	if (!locked)
+		MNT_IUNLOCK(vfsp);
 }
 
 void
 vfs_clearmntopt(vfs_t *vfsp, const char *name)
 {
+	int locked;
 
-	/* TODO: Locking. */
+	if (!(locked = mtx_owned(MNT_MTX(vfsp))))
+		MNT_ILOCK(vfsp);
 	vfs_deleteopt(vfsp->mnt_opt, name);
+	if (!locked)
+		MNT_IUNLOCK(vfsp);
 }
 
 int
@@ -92,12 +112,13 @@ vfs_optionisset(const vfs_t *vfsp, const
 }
 
 int
-domount(kthread_t *td, vnode_t *vp, const char *fstype, char *fspath,
+mount_snapshot(kthread_t *td, vnode_t **vpp, const char *fstype, char *fspath,
     char *fspec, int fsflags)
 {
 	struct mount *mp;
 	struct vfsconf *vfsp;
 	struct ucred *cr;
+	vnode_t *vp;
 	int error;
 
 	/*
@@ -112,23 +133,28 @@ domount(kthread_t *td, vnode_t *vp, cons
 	if (vfsp == NULL)
 		return (ENODEV);
 
+	vp = *vpp;
 	if (vp->v_type != VDIR)
 		return (ENOTDIR);
+	/*
+	 * We need vnode lock to protect v_mountedhere and vnode interlock
+	 * to protect v_iflag.
+	 */
+	vn_lock(vp, LK_SHARED | LK_RETRY);
 	VI_LOCK(vp);
-	if ((vp->v_iflag & VI_MOUNT) != 0 ||
-	    vp->v_mountedhere != NULL) {
+	if ((vp->v_iflag & VI_MOUNT) != 0 || vp->v_mountedhere != NULL) {
 		VI_UNLOCK(vp);
+		VOP_UNLOCK(vp, 0);
 		return (EBUSY);
 	}
 	vp->v_iflag |= VI_MOUNT;
 	VI_UNLOCK(vp);
+	VOP_UNLOCK(vp, 0);
 
 	/*
 	 * Allocate and initialize the filesystem.
 	 */
-	vn_lock(vp, LK_SHARED | LK_RETRY);
 	mp = vfs_mount_alloc(vp, vfsp, fspath, td->td_ucred);
-	VOP_UNLOCK(vp, 0);
 
 	mp->mnt_optnew = NULL;
 	vfs_setmntopt(mp, "from", fspec, 0);
@@ -138,11 +164,18 @@ domount(kthread_t *td, vnode_t *vp, cons
 	/*
 	 * Set the mount level flags.
 	 */
-	if (fsflags & MNT_RDONLY)
-		mp->mnt_flag |= MNT_RDONLY;
-	mp->mnt_flag &=~ MNT_UPDATEMASK;
+	mp->mnt_flag &= ~MNT_UPDATEMASK;
 	mp->mnt_flag |= fsflags & (MNT_UPDATEMASK | MNT_FORCE | MNT_ROOTFS);
 	/*
+	 * Snapshots are always read-only.
+	 */
+	mp->mnt_flag |= MNT_RDONLY;
+	/*
+	 * We don't want snapshots to be visible in regular
+	 * mount(8) and df(1) output.
+	 */
+	mp->mnt_flag |= MNT_IGNORE;
+	/*
 	 * Unprivileged user can trigger mounting a snapshot, but we don't want
 	 * him to unmount it, so we switch to privileged of original mount.
 	 */
@@ -150,11 +183,6 @@ domount(kthread_t *td, vnode_t *vp, cons
 	mp->mnt_cred = crdup(vp->v_mount->mnt_cred);
 	mp->mnt_stat.f_owner = mp->mnt_cred->cr_uid;
 	/*
-	 * Mount the filesystem.
-	 * XXX The final recipients of VFS_MOUNT just overwrite the ndp they
-	 * get.  No freeing of cn_pnbuf.
-	 */
-	/*
 	 * XXX: This is evil, but we can't mount a snapshot as a regular user.
 	 * XXX: Is is safe when snapshot is mounted from within a jail?
 	 */
@@ -163,7 +191,7 @@ domount(kthread_t *td, vnode_t *vp, cons
 	error = VFS_MOUNT(mp);
 	td->td_ucred = cr;
 
-	if (!error) {
+	if (error == 0) {
 		if (mp->mnt_opt != NULL)
 			vfs_freeopts(mp->mnt_opt);
 		mp->mnt_opt = mp->mnt_optnew;
@@ -175,42 +203,33 @@ domount(kthread_t *td, vnode_t *vp, cons
 	*/
 	mp->mnt_optnew = NULL;
 	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
-	/*
-	 * Put the new filesystem on the mount list after root.
-	 */
 #ifdef FREEBSD_NAMECACHE
 	cache_purge(vp);
 #endif
-	if (!error) {
+	VI_LOCK(vp);
+	vp->v_iflag &= ~VI_MOUNT;
+	VI_UNLOCK(vp);
+	if (error == 0) {
 		vnode_t *mvp;
 
-		VI_LOCK(vp);
-		vp->v_iflag &= ~VI_MOUNT;
-		VI_UNLOCK(vp);
 		vp->v_mountedhere = mp;
+		/*
+		 * Put the new filesystem on the mount list.
+		 */
 		mtx_lock(&mountlist_mtx);
 		TAILQ_INSERT_TAIL(&mountlist, mp, mnt_list);
 		mtx_unlock(&mountlist_mtx);
 		vfs_event_signal(NULL, VQ_MOUNT, 0);
 		if (VFS_ROOT(mp, LK_EXCLUSIVE, &mvp))
 			panic("mount: lost mount");
-		mountcheckdirs(vp, mvp);
-		vput(mvp);
-		VOP_UNLOCK(vp, 0);
-		if ((mp->mnt_flag & MNT_RDONLY) == 0)
-			error = vfs_allocate_syncvnode(mp);
+		vput(vp);
 		vfs_unbusy(mp);
-		if (error)
-			vrele(vp);
-		else
-			vfs_mountedfrom(mp, fspec);
+		*vpp = mvp;
 	} else {
-		VI_LOCK(vp);
-		vp->v_iflag &= ~VI_MOUNT;
-		VI_UNLOCK(vp);
-		VOP_UNLOCK(vp, 0);
+		vput(vp);
 		vfs_unbusy(mp);
 		vfs_mount_destroy(mp);
+		*vpp = NULL;
 	}
 	return (error);
 }

Modified: stable/8/sys/cddl/compat/opensolaris/sys/mutex.h
==============================================================================
--- stable/8/sys/cddl/compat/opensolaris/sys/mutex.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/compat/opensolaris/sys/mutex.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -32,9 +32,9 @@
 #ifdef _KERNEL
 
 #include <sys/param.h>
-#include <sys/proc.h>
 #include <sys/lock.h>
 #include_next <sys/mutex.h>
+#include <sys/proc.h>
 #include <sys/sx.h>
 
 typedef enum {

Modified: stable/8/sys/cddl/compat/opensolaris/sys/proc.h
==============================================================================
--- stable/8/sys/cddl/compat/opensolaris/sys/proc.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/compat/opensolaris/sys/proc.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -34,13 +34,17 @@
 #include_next <sys/proc.h>
 #include <sys/stdint.h>
 #include <sys/smp.h>
+#include <sys/sched.h>
+#include <sys/lock.h>
+#include <sys/mutex.h>
+#include <sys/unistd.h>
 #include <sys/debug.h>
 
 #ifdef _KERNEL
 
 #define	CPU		curcpu
-#define	minclsyspri	0
-#define	maxclsyspri	0
+#define	minclsyspri	PRIBIO
+#define	maxclsyspri	PVM
 #define	max_ncpus	mp_ncpus
 #define	boot_max_ncpus	mp_ncpus
 
@@ -54,11 +58,13 @@ typedef	struct thread	kthread_t;
 typedef struct thread	*kthread_id_t;
 typedef struct proc	proc_t;
 
+extern struct proc *zfsproc;
+
 static __inline kthread_t *
 thread_create(caddr_t stk, size_t stksize, void (*proc)(void *), void *arg,
     size_t len, proc_t *pp, int state, pri_t pri)
 {
-	proc_t *p;
+	kthread_t *td = NULL;
 	int error;
 
 	/*
@@ -67,13 +73,20 @@ thread_create(caddr_t stk, size_t stksiz
 	ASSERT(stk == NULL);
 	ASSERT(len == 0);
 	ASSERT(state == TS_RUN);
+	ASSERT(pp == &p0);
 
-	error = kproc_create(proc, arg, &p, 0, stksize / PAGE_SIZE,
-	    "solthread %p", proc);
-	return (error == 0 ? FIRST_THREAD_IN_PROC(p) : NULL);
+	error = kproc_kthread_add(proc, arg, &zfsproc, &td, RFSTOPPED,
+	    stksize / PAGE_SIZE, "zfskern", "solthread %p", proc);
+	if (error == 0) {
+		thread_lock(td);
+		sched_prio(td, pri);
+		sched_add(td, SRQ_BORING);
+		thread_unlock(td);
+	}
+	return (td);
 }
 
-#define	thread_exit()	kproc_exit(0)
+#define	thread_exit()	kthread_exit()
 
 #endif	/* _KERNEL */
 

Modified: stable/8/sys/cddl/compat/opensolaris/sys/vfs.h
==============================================================================
--- stable/8/sys/cddl/compat/opensolaris/sys/vfs.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/compat/opensolaris/sys/vfs.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -110,8 +110,8 @@ void vfs_setmntopt(vfs_t *vfsp, const ch
     int flags __unused);
 void vfs_clearmntopt(vfs_t *vfsp, const char *name);
 int vfs_optionisset(const vfs_t *vfsp, const char *opt, char **argp);
-int domount(kthread_t *td, vnode_t *vp, const char *fstype, char *fspath,
-    char *fspec, int fsflags);
+int mount_snapshot(kthread_t *td, vnode_t **vpp, const char *fstype,
+    char *fspath, char *fspec, int fsflags);
 
 typedef	uint64_t	vfs_feature_t;
 

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -19,7 +19,7 @@
  * CDDL HEADER END
  */
 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
@@ -864,10 +864,11 @@ restore_object(struct restorearg *ra, ob
 		/* currently allocated, want to be allocated */
 		dmu_tx_hold_bonus(tx, drro->drr_object);
 		/*
-		 * We may change blocksize, so need to
-		 * hold_write
+		 * We may change blocksize and delete old content,
+		 * so need to hold_write and hold_free.
 		 */
 		dmu_tx_hold_write(tx, drro->drr_object, 0, 1);
+		dmu_tx_hold_free(tx, drro->drr_object, 0, DMU_OBJECT_END);
 		err = dmu_tx_assign(tx, TXG_WAIT);
 		if (err) {
 			dmu_tx_abort(tx);

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -415,7 +415,7 @@ void
 dnode_reallocate(dnode_t *dn, dmu_object_type_t ot, int blocksize,
     dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
 {
-	int i, old_nblkptr;
+	int i, nblkptr;
 	dmu_buf_impl_t *db = NULL;
 
 	ASSERT3U(blocksize, >=, SPA_MINBLOCKSIZE);
@@ -445,6 +445,8 @@ dnode_reallocate(dnode_t *dn, dmu_object
 		dnode_free_range(dn, 0, -1ULL, tx);
 	}
 
+	nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
+
 	/* change blocksize */
 	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
 	if (blocksize != dn->dn_datablksz &&
@@ -457,6 +459,8 @@ dnode_reallocate(dnode_t *dn, dmu_object
 	dnode_setdirty(dn, tx);
 	dn->dn_next_bonuslen[tx->tx_txg&TXG_MASK] = bonuslen;
 	dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = blocksize;
+	if (dn->dn_nblkptr != nblkptr)
+		dn->dn_next_nblkptr[tx->tx_txg&TXG_MASK] = nblkptr;
 	rw_exit(&dn->dn_struct_rwlock);
 	if (db)
 		dbuf_rele(db, FTAG);
@@ -466,19 +470,15 @@ dnode_reallocate(dnode_t *dn, dmu_object
 
 	/* change bonus size and type */
 	mutex_enter(&dn->dn_mtx);
-	old_nblkptr = dn->dn_nblkptr;
 	dn->dn_bonustype = bonustype;
 	dn->dn_bonuslen = bonuslen;
-	dn->dn_nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
+	dn->dn_nblkptr = nblkptr;
 	dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
 	dn->dn_compress = ZIO_COMPRESS_INHERIT;
 	ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);
 
-	/* XXX - for now, we can't make nblkptr smaller */
-	ASSERT3U(dn->dn_nblkptr, >=, old_nblkptr);
-
-	/* fix up the bonus db_size if dn_nblkptr has changed */
-	if (dn->dn_bonus && dn->dn_bonuslen != old_nblkptr) {
+	/* fix up the bonus db_size */
+	if (dn->dn_bonus) {
 		dn->dn_bonus->db.db_size =
 		    DN_MAX_BONUSLEN - (dn->dn_nblkptr-1) * sizeof (blkptr_t);
 		ASSERT(dn->dn_bonuslen <= dn->dn_bonus->db.db_size);

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -19,12 +19,10 @@
  * CDDL HEADER END
  */
 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
-#pragma ident	"%Z%%M%	%I%	%E% SMI"
-
 #include <sys/zfs_context.h>
 #include <sys/dbuf.h>
 #include <sys/dnode.h>
@@ -534,18 +532,12 @@ dnode_sync(dnode_t *dn, dmu_tx_t *tx)
 			/* XXX shouldn't the phys already be zeroed? */
 			bzero(dnp, DNODE_CORE_SIZE);
 			dnp->dn_nlevels = 1;
+			dnp->dn_nblkptr = dn->dn_nblkptr;
 		}
 
-		if (dn->dn_nblkptr > dnp->dn_nblkptr) {
-			/* zero the new blkptrs we are gaining */
-			bzero(dnp->dn_blkptr + dnp->dn_nblkptr,
-			    sizeof (blkptr_t) *
-			    (dn->dn_nblkptr - dnp->dn_nblkptr));
-		}
 		dnp->dn_type = dn->dn_type;
 		dnp->dn_bonustype = dn->dn_bonustype;
 		dnp->dn_bonuslen = dn->dn_bonuslen;
-		dnp->dn_nblkptr = dn->dn_nblkptr;
 	}
 
 	ASSERT(dnp->dn_nlevels > 1 ||
@@ -605,6 +597,30 @@ dnode_sync(dnode_t *dn, dmu_tx_t *tx)
 		return;
 	}
 
+	if (dn->dn_next_nblkptr[txgoff]) {
+		/* this should only happen on a realloc */
+		ASSERT(dn->dn_allocated_txg == tx->tx_txg);
+		if (dn->dn_next_nblkptr[txgoff] > dnp->dn_nblkptr) {
+			/* zero the new blkptrs we are gaining */
+			bzero(dnp->dn_blkptr + dnp->dn_nblkptr,
+			    sizeof (blkptr_t) *
+			    (dn->dn_next_nblkptr[txgoff] - dnp->dn_nblkptr));
+#ifdef ZFS_DEBUG
+		} else {
+			int i;
+			ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
+			/* the blkptrs we are losing better be unallocated */
+			for (i = dn->dn_next_nblkptr[txgoff];
+			    i < dnp->dn_nblkptr; i++)
+				ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));
+#endif
+		}
+		mutex_enter(&dn->dn_mtx);
+		dnp->dn_nblkptr = dn->dn_next_nblkptr[txgoff];
+		dn->dn_next_nblkptr[txgoff] = 0;
+		mutex_exit(&dn->dn_mtx);
+	}
+
 	if (dn->dn_next_nlevels[txgoff]) {
 		dnode_increase_indirection(dn, tx);
 		dn->dn_next_nlevels[txgoff] = 0;

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -1419,6 +1419,7 @@ dsl_dataset_drain_refs(dsl_dataset_t *ds
 {
 	struct refsarg arg;
 
+	bzero(&arg, sizeof(arg));
 	mutex_init(&arg.lock, NULL, MUTEX_DEFAULT, NULL);
 	cv_init(&arg.cv, NULL, CV_DEFAULT, NULL);
 	arg.gone = FALSE;

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -19,7 +19,7 @@
  * CDDL HEADER END
  */
 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
@@ -160,6 +160,7 @@ typedef struct dnode {
 	uint16_t dn_datablkszsec;	/* in 512b sectors */
 	uint32_t dn_datablksz;		/* in bytes */
 	uint64_t dn_maxblkid;
+	uint8_t dn_next_nblkptr[TXG_SIZE];
 	uint8_t dn_next_nlevels[TXG_SIZE];
 	uint8_t dn_next_indblkshift[TXG_SIZE];
 	uint16_t dn_next_bonuslen[TXG_SIZE];

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -231,8 +231,27 @@ typedef struct znode {
 /*
  * Convert between znode pointers and vnode pointers
  */
+#ifdef DEBUG
+static __inline vnode_t *
+ZTOV(znode_t *zp)
+{
+	vnode_t *vp = zp->z_vnode;
+
+	ASSERT(vp == NULL || vp->v_data == NULL || vp->v_data == zp);
+	return (vp);
+}
+static __inline znode_t *
+VTOZ(vnode_t *vp)
+{
+	znode_t *zp = (znode_t *)vp->v_data;
+
+	ASSERT(zp == NULL || zp->z_vnode == NULL || zp->z_vnode == vp);
+	return (zp);
+}
+#else
 #define	ZTOV(ZP)	((ZP)->z_vnode)
 #define	VTOZ(VP)	((znode_t *)(VP)->v_data)
+#endif
 
 /*
  * ZFS_ENTER() is called on entry to each ZFS vnode and vfs operation.

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -194,6 +194,10 @@ vdev_geom_worker(void *arg)
 	zio_t *zio;
 	struct bio *bp;
 
+	thread_lock(curthread);
+	sched_prio(curthread, PRIBIO);
+	thread_unlock(curthread);
+
 	ctx = arg;
 	for (;;) {
 		mtx_lock(&ctx->gc_queue_mtx);
@@ -203,7 +207,7 @@ vdev_geom_worker(void *arg)
 				ctx->gc_state = 2;
 				wakeup_one(&ctx->gc_state);
 				mtx_unlock(&ctx->gc_queue_mtx);
-				kproc_exit(0);
+				kthread_exit();
 			}
 			msleep(&ctx->gc_queue, &ctx->gc_queue_mtx,
 			    PRIBIO | PDROP, "vgeom:io", 0);
@@ -530,8 +534,8 @@ vdev_geom_open(vdev_t *vd, uint64_t *psi
 	vd->vdev_tsd = ctx;
 	pp = cp->provider;
 
-	kproc_create(vdev_geom_worker, ctx, NULL, 0, 0, "vdev:worker %s",
-	    pp->name);
+	kproc_kthread_add(vdev_geom_worker, ctx, &zfsproc, NULL, 0, 0,
+	    "zfskern", "vdev %s", pp->name);
 
 	/*
 	 * Determine the actual size of the device.

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -181,10 +181,11 @@ mze_compare(const void *arg1, const void
 	return (0);
 }
 
-static void
+static int
 mze_insert(zap_t *zap, int chunkid, uint64_t hash, mzap_ent_phys_t *mzep)
 {
 	mzap_ent_t *mze;
+	avl_index_t idx;
 
 	ASSERT(zap->zap_ismicro);
 	ASSERT(RW_WRITE_HELD(&zap->zap_rwlock));
@@ -194,7 +195,12 @@ mze_insert(zap_t *zap, int chunkid, uint
 	mze->mze_chunkid = chunkid;
 	mze->mze_hash = hash;
 	mze->mze_phys = *mzep;
-	avl_add(&zap->zap_m.zap_avl, mze);
+	if (avl_find(&zap->zap_m.zap_avl, mze, &idx) != NULL) {
+		kmem_free(mze, sizeof (mzap_ent_t));
+		return (EEXIST);
+	}
+	avl_insert(&zap->zap_m.zap_avl, mze, idx);
+	return (0);
 }
 
 static mzap_ent_t *
@@ -329,10 +335,15 @@ mzap_open(objset_t *os, uint64_t obj, dm
 			if (mze->mze_name[0]) {
 				zap_name_t *zn;
 
-				zap->zap_m.zap_num_entries++;
 				zn = zap_name_alloc(zap, mze->mze_name,
 				    MT_EXACT);
-				mze_insert(zap, i, zn->zn_hash, mze);
+				if (mze_insert(zap, i, zn->zn_hash, mze) == 0)
+					zap->zap_m.zap_num_entries++;
+				else {
+					printf("ZFS WARNING: Duplicated ZAP "
+					    "entry detected (%s).\n",
+					    mze->mze_name);
+				}
 				zap_name_free(zn);
 			}
 		}
@@ -771,7 +782,7 @@ again:
 			if (zap->zap_m.zap_alloc_next ==
 			    zap->zap_m.zap_num_chunks)
 				zap->zap_m.zap_alloc_next = 0;
-			mze_insert(zap, i, zn->zn_hash, mze);
+			VERIFY(0 == mze_insert(zap, i, zn->zn_hash, mze));
 			return;
 		}
 	}

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -669,9 +669,12 @@ zfsctl_snapdir_remove(vnode_t *dvp, char
 	if (sep) {
 		avl_remove(&sdp->sd_snaps, sep);
 		err = zfsctl_unmount_snap(sep, MS_FORCE, cr);
-		if (err)
-			avl_add(&sdp->sd_snaps, sep);
-		else
+		if (err) {
+			avl_index_t where;
+
+			if (avl_find(&sdp->sd_snaps, sep, &where) == NULL)
+				avl_insert(&sdp->sd_snaps, sep, where);
+		} else
 			err = dmu_objset_destroy(snapname);
 	} else {
 		err = ENOENT;
@@ -877,20 +880,20 @@ domount:
 	mountpoint = kmem_alloc(mountpoint_len, KM_SLEEP);
 	(void) snprintf(mountpoint, mountpoint_len, "%s/.zfs/snapshot/%s",
 	    dvp->v_vfsp->mnt_stat.f_mntonname, nm);
-	err = domount(curthread, *vpp, "zfs", mountpoint, snapname, 0);
+	err = mount_snapshot(curthread, vpp, "zfs", mountpoint, snapname, 0);
 	kmem_free(mountpoint, mountpoint_len);
-	/* FreeBSD: This line was moved from below to avoid a lock recursion. */
-	if (err == 0)
-		vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY);
-	mutex_exit(&sdp->sd_lock);
-	/*
-	 * If we had an error, drop our hold on the vnode and
-	 * zfsctl_snapshot_inactive() will clean up.
-	 */
-	if (err) {
-		VN_RELE(*vpp);
-		*vpp = NULL;
+	if (err == 0) {
+		/*
+		 * Fix up the root vnode mounted on .zfs/snapshot/<snapname>.
+		 *
+		 * This is where we lie about our v_vfsp in order to
+		 * make .zfs/snapshot/<snapname> accessible over NFS
+		 * without requiring manual mounts of <snapname>.
+		 */
+		ASSERT(VTOZ(*vpp)->z_zfsvfs != zfsvfs);
+		VTOZ(*vpp)->z_zfsvfs->z_parent = zfsvfs;
 	}
+	mutex_exit(&sdp->sd_lock);
 	ZFS_EXIT(zfsvfs);
 	return (err);
 }
@@ -1344,7 +1347,17 @@ zfsctl_umount_snapshots(vfs_t *vfsp, int
 		if (vn_ismntpt(sep->se_root)) {
 			error = zfsctl_unmount_snap(sep, fflags, cr);
 			if (error) {
-				avl_add(&sdp->sd_snaps, sep);
+				avl_index_t where;
+
+				/*
+				 * Before reinserting snapshot to the tree,
+				 * check if it was actually removed. For example
+				 * when snapshot mount point is busy, we will
+				 * have an error here, but there will be no need
+				 * to reinsert snapshot.
+				 */
+				if (avl_find(&sdp->sd_snaps, sep, &where) == NULL)
+					avl_insert(&sdp->sd_snaps, sep, where);
 				break;
 			}
 		}

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -3021,8 +3021,10 @@ zfsdev_ioctl(struct cdev *dev, u_long cm
 	if (error == 0)
 		error = zfs_ioc_vec[vec].zvec_func(zc);
 
-	if (zfs_ioc_vec[vec].zvec_his_log == B_TRUE)
-		zfs_log_history(zc);
+	if (error == 0) {
+		if (zfs_ioc_vec[vec].zvec_his_log == B_TRUE)
+			zfs_log_history(zc);
+	}
 
 	return (error);
 }
@@ -3057,6 +3059,7 @@ zfsdev_fini(void)
 }
 
 static struct root_hold_token *zfs_root_token;
+struct proc *zfsproc;
 
 uint_t zfs_fsyncer_key;
 extern uint_t rrw_tsd_key;

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -97,6 +97,8 @@ static int zfs_root(vfs_t *vfsp, int fla
 static int zfs_statfs(vfs_t *vfsp, struct statfs *statp);
 static int zfs_vget(vfs_t *vfsp, ino_t ino, int flags, vnode_t **vpp);
 static int zfs_sync(vfs_t *vfsp, int waitfor);
+static int zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp,
+    struct ucred **credanonp, int *numsecflavors, int **secflavors);
 static int zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vnode_t **vpp);
 static void zfs_objset_close(zfsvfs_t *zfsvfs);
 static void zfs_freevfs(vfs_t *vfsp);
@@ -108,6 +110,7 @@ static struct vfsops zfs_vfsops = {
 	.vfs_statfs =		zfs_statfs,
 	.vfs_vget =		zfs_vget,
 	.vfs_sync =		zfs_sync,
+	.vfs_checkexp =		zfs_checkexp,
 	.vfs_fhtovp =		zfs_fhtovp,
 };
 
@@ -337,6 +340,13 @@ zfs_register_callbacks(vfs_t *vfsp)
 	os = zfsvfs->z_os;
 
 	/*
+	 * This function can be called for a snapshot when we update snapshot's
+	 * mount point, which isn't really supported.
+	 */
+	if (dmu_objset_is_snapshot(os))
+		return (EOPNOTSUPP);
+
+	/*
 	 * The act of registering our callbacks will destroy any mount
 	 * options we may have.  In order to enable temporary overrides
 	 * of mount options, we stash away the current values and
@@ -719,7 +729,10 @@ zfs_mount(vfs_t *vfsp)
 	error = secpolicy_fs_mount(cr, mvp, vfsp);
 	if (error) {
 		error = dsl_deleg_access(osname, ZFS_DELEG_PERM_MOUNT, cr);
-		if (error == 0) {
+		if (error != 0)
+			goto out;
+
+		if (!(vfsp->vfs_flag & MS_REMOUNT)) {
 			vattr_t		vattr;
 
 			/*
@@ -729,7 +742,9 @@ zfs_mount(vfs_t *vfsp)
 
 			vattr.va_mask = AT_UID;
 
+			vn_lock(mvp, LK_SHARED | LK_RETRY);
 			if (error = VOP_GETATTR(mvp, &vattr, cr)) {
+				VOP_UNLOCK(mvp, 0);
 				goto out;
 			}
 
@@ -741,18 +756,19 @@ zfs_mount(vfs_t *vfsp)
 			}
 #else
 			if (error = secpolicy_vnode_owner(mvp, cr, vattr.va_uid)) {
+				VOP_UNLOCK(mvp, 0);
 				goto out;
 			}
 
 			if (error = VOP_ACCESS(mvp, VWRITE, cr, td)) {
+				VOP_UNLOCK(mvp, 0);
 				goto out;
 			}
+			VOP_UNLOCK(mvp, 0);
 #endif
-
-			secpolicy_fs_mount_clearopts(cr, vfsp);
-		} else {
-			goto out;
 		}
+
+		secpolicy_fs_mount_clearopts(cr, vfsp);
 	}
 
 	/*
@@ -931,6 +947,18 @@ zfsvfs_teardown(zfsvfs_t *zfsvfs, boolea
 		zfsvfs->z_unmounted = B_TRUE;
 		rrw_exit(&zfsvfs->z_teardown_lock, FTAG);
 		rw_exit(&zfsvfs->z_teardown_inactive_lock);
+
+#ifdef __FreeBSD__
+		/*
+		 * Some znodes might not be fully reclaimed, wait for them.
+		 */
+		mutex_enter(&zfsvfs->z_znodes_lock);
+		while (list_head(&zfsvfs->z_all_znodes) != NULL) {
+			msleep(zfsvfs, &zfsvfs->z_znodes_lock, 0,
+			    "zteardown", 0);
+		}
+		mutex_exit(&zfsvfs->z_znodes_lock);
+#endif
 	}
 
 	/*
@@ -1086,6 +1114,20 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 	znode_t		*zp;
 	int 		err;
 
+	/*
+	 * XXXPJD: zfs_zget() can't operate on virtual entires like .zfs/ or
+	 * .zfs/snapshot/ directories, so for now just return EOPNOTSUPP.
+	 * This will make NFS to fall back to using READDIR instead of
+	 * READDIRPLUS.
+	 * Also snapshots are stored in AVL tree, but based on their names,
+	 * not inode numbers, so it will be very inefficient to iterate
+	 * over all snapshots to find the right one.
+	 * Note that OpenSolaris READDIRPLUS implementation does LOOKUP on
+	 * d_name, and not VGET on d_fileno as we do.
+	 */
+	if (ino == ZFSCTL_INO_ROOT || ino == ZFSCTL_INO_SNAPDIR)
+		return (EOPNOTSUPP);
+
 	ZFS_ENTER(zfsvfs);
 	err = zfs_zget(zfsvfs, ino, &zp);
 	if (err == 0 && zp->z_unlinked) {
@@ -1103,6 +1145,28 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 }
 
 static int
+zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp,
+    struct ucred **credanonp, int *numsecflavors, int **secflavors)
+{
+	zfsvfs_t *zfsvfs = vfsp->vfs_data;
+
+	/*
+	 * If this is regular file system vfsp is the same as
+	 * zfsvfs->z_parent->z_vfs, but if it is snapshot,
+	 * zfsvfs->z_parent->z_vfs represents parent file system
+	 * which we have to use here, because only this file system
+	 * has mnt_export configured.
+	 */
+	vfsp = zfsvfs->z_parent->z_vfs;
+
+	return (vfs_stdcheckexp(zfsvfs->z_parent->z_vfs, nam, extflagsp,
+	    credanonp, numsecflavors, secflavors));
+}
+
+CTASSERT(SHORT_FID_LEN <= sizeof(struct fid));
+CTASSERT(LONG_FID_LEN <= sizeof(struct fid));
+
+static int
 zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vnode_t **vpp)
 {
 	zfsvfs_t	*zfsvfs = vfsp->vfs_data;
@@ -1117,7 +1181,11 @@ zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vno
 
 	ZFS_ENTER(zfsvfs);
 
-	if (fidp->fid_len == LONG_FID_LEN) {
+	/*
+	 * On FreeBSD we can get snapshot's mount point or its parent file
+	 * system mount point depending if snapshot is already mounted or not.
+	 */
+	if (zfsvfs->z_parent == zfsvfs && fidp->fid_len == LONG_FID_LEN) {
 		zfid_long_t	*zlfid = (zfid_long_t *)fidp;
 		uint64_t	objsetid = 0;
 		uint64_t	setgen = 0;
@@ -1160,9 +1228,8 @@ zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vno
 		} else {
 			VN_HOLD(*vpp);
 		}
-		ZFS_EXIT(zfsvfs);
-		/* XXX: LK_RETRY? */
 		vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY);
+		ZFS_EXIT(zfsvfs);
 		return (0);
 	}
 
@@ -1184,7 +1251,6 @@ zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vno
 	}
 
 	*vpp = ZTOV(zp);
-	/* XXX: LK_RETRY? */
 	vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY);
 	vnode_create_vobject(*vpp, zp->z_phys->zp_size, curthread);
 	ZFS_EXIT(zfsvfs);

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 19 dfilter service freebsd_committer freebsd_triage 2010-01-06 16:10:16 UTC
Author: netchild
Date: Wed Jan  6 16:09:58 2010
New Revision: 201651
URL: http://svn.freebsd.org/changeset/base/201651

Log:
  MFC several ZFS related commits:
  
  r196980:
  ---snip---
      When we automatically mount snapshot we want to return vnode of the mount point
      from the lookup and not covered vnode. This is one of the fixes for using .zfs/
      over NFS.
  ---snip---
  
  r196982:
  ---snip---
      We don't export individual snapshots, so mnt_export field in snapshot's
      mount point is NULL. That's why when we try to access snapshots over NFS
      use mnt_export field from the parent file system.
  ---snip---
  
  r197131:
  ---snip---
      Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
      NULL, but also can point to dead vnode, take that into account.
  
      PR:				kern/132068
      Reported by:		Edward Fisk" <7ogcg7g02@sneakemail.com>, kris
      Fix based on patch from:	Jaakko Heinonen <jh@saunalahti.fi>
  ---snip---
  
  r197133:
  ---snip---
      - Protect reclaim with z_teardown_inactive_lock.
      - Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
        z_dbuf field is NULL - this might happen in case of rollback or forced
        unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
      - On forced unmount wait for all znodes to be destroyed - destruction can be
        done asynchronously via zfs_reclaim_complete().
  ---snip---
  
  r197153:
  ---snip---
      When zfs.ko is compiled with debug, make sure that znode and vnode point at
      each other.
  ---snip---
  
  r197167:
  ---snip---
      Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
      by just returning EOPNOTSUPP. This will allow NFS server to fall back to
      regular READDIR.
  
      Note that converting inode number to snapshot's vnode is expensive operation.
      Snapshots are stored in AVL tree, but based on their names, not inode numbers,
      so to convert inode to snapshot vnode we have to interate over all snalshots.
  
      This is not a problem in OpenSolaris, because in their READDIRPLUS
      implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
      d_fileno as we do.
  
      PR:			kern/125149
      Reported by:	Weldon Godfrey <wgodfrey@ena.com>
      Analysis by:	Jaakko Heinonen <jh@saunalahti.fi>
  ---snip---
  
  r197177:
  ---snip---
      Support both case: when snapshot is already mounted and when it is not yet
      mounted.
  ---snip---
  
  r197201:
  ---snip---
      - Mount ZFS snapshots with MNT_IGNORE flag, so they are not visible in regular
        df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
        ZFS route of not listing snapshots by default with 'zfs list' command.
      - Add UPDATING entry to note that ZFS snapshots are no longer visible in
        mount(8) and df(1) output by default.
  
      Reviewed by:	kib
  ---snip---
  Note: the MNT_IGNORE part is commented out in this commit and the UPDATING
  entry is not merged, as this would be a POLA violation on a stable branch.
  This revision is included here, as it also makes locking changes and makes
  sure that a snapshot is mounted RO.
  
  r197426:
  ---snip---
      Restore BSD behaviour - when creating new directory entry use parent directory
      gid to set group ownership and not process gid.
  
      This was overlooked during v6 -> v13 switch.
  
      PR:			kern/139076
      Reported by:	Sean Winn <sean@gothic.net.au>
  ---snip---
  
  r197458:
  ---snip---
      Close race in zfs_zget(). We have to increase usecount first and then
      check for VI_DOOMED flag. Before this change vnode could be reclaimed
      between checking for the flag and increasing usecount.
  ---snip---

Modified:
  stable/7/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c
  stable/7/sys/cddl/compat/opensolaris/sys/vfs.h
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
Directory Properties:
  stable/7/sys/   (props changed)
  stable/7/sys/cddl/contrib/opensolaris/   (props changed)
  stable/7/sys/contrib/dev/acpica/   (props changed)
  stable/7/sys/contrib/pf/   (props changed)

Modified: stable/7/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c
==============================================================================
--- stable/7/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -115,12 +115,13 @@ extern struct mount *vfs_mount_alloc(str
     const char *fspath, struct thread *td);
 
 int
-domount(kthread_t *td, vnode_t *vp, const char *fstype, char *fspath,
+mount_snapshot(kthread_t *td, vnode_t **vpp, const char *fstype, char *fspath,
     char *fspec, int fsflags)
 {
 	struct mount *mp;
 	struct vfsconf *vfsp;
 	struct ucred *cr;
+	vnode_t *vp;
 	int error;
 
 	/*
@@ -135,23 +136,28 @@ domount(kthread_t *td, vnode_t *vp, cons
 	if (vfsp == NULL)
 		return (ENODEV);
 
+	vp = *vpp;
 	if (vp->v_type != VDIR)
 		return (ENOTDIR);
+	/*
+	 * We need vnode lock to protect v_mountedhere and vnode interlock
+	 * to protect v_iflag.
+	 */
+	vn_lock(vp, LK_SHARED | LK_RETRY, td);
 	VI_LOCK(vp);
-	if ((vp->v_iflag & VI_MOUNT) != 0 ||
-	    vp->v_mountedhere != NULL) {
+	if ((vp->v_iflag & VI_MOUNT) != 0 || vp->v_mountedhere != NULL) {
 		VI_UNLOCK(vp);
+		VOP_UNLOCK(vp, 0, td);
 		return (EBUSY);
 	}
 	vp->v_iflag |= VI_MOUNT;
 	VI_UNLOCK(vp);
+	VOP_UNLOCK(vp, 0, td);
 
 	/*
 	 * Allocate and initialize the filesystem.
 	 */
-	vn_lock(vp, LK_SHARED | LK_RETRY, td);
 	mp = vfs_mount_alloc(vp, vfsp, fspath, td);
-	VOP_UNLOCK(vp, 0,td);
 
 	mp->mnt_optnew = NULL;
 	vfs_setmntopt(mp, "from", fspec, 0);
@@ -161,11 +167,20 @@ domount(kthread_t *td, vnode_t *vp, cons
 	/*
 	 * Set the mount level flags.
 	 */
-	if (fsflags & MNT_RDONLY)
-		mp->mnt_flag |= MNT_RDONLY;
-	mp->mnt_flag &=~ MNT_UPDATEMASK;
+	mp->mnt_flag &= ~MNT_UPDATEMASK;
 	mp->mnt_flag |= fsflags & (MNT_UPDATEMASK | MNT_FORCE | MNT_ROOTFS);
 	/*
+	 * Snapshots are always read-only.
+	 */
+	mp->mnt_flag |= MNT_RDONLY;
+#if 0
+	/*
+	 * We don't want snapshots to be visible in regular
+	 * mount(8) and df(1) output.
+	 */
+	mp->mnt_flag |= MNT_IGNORE;
+#endif
+	/*
 	 * Unprivileged user can trigger mounting a snapshot, but we don't want
 	 * him to unmount it, so we switch to privileged of original mount.
 	 */
@@ -173,11 +188,6 @@ domount(kthread_t *td, vnode_t *vp, cons
 	mp->mnt_cred = crdup(vp->v_mount->mnt_cred);
 	mp->mnt_stat.f_owner = mp->mnt_cred->cr_uid;
 	/*
-	 * Mount the filesystem.
-	 * XXX The final recipients of VFS_MOUNT just overwrite the ndp they
-	 * get.  No freeing of cn_pnbuf.
-	 */
-	/*
 	 * XXX: This is evil, but we can't mount a snapshot as a regular user.
 	 * XXX: Is is safe when snapshot is mounted from within a jail?
 	 */
@@ -186,7 +196,7 @@ domount(kthread_t *td, vnode_t *vp, cons
 	error = VFS_MOUNT(mp, td);
 	td->td_ucred = cr;
 
-	if (!error) {
+	if (error == 0) {
 		if (mp->mnt_opt != NULL)
 			vfs_freeopts(mp->mnt_opt);
 		mp->mnt_opt = mp->mnt_optnew;
@@ -198,42 +208,33 @@ domount(kthread_t *td, vnode_t *vp, cons
 	*/
 	mp->mnt_optnew = NULL;
 	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
-	/*
-	 * Put the new filesystem on the mount list after root.
-	 */
 #ifdef FREEBSD_NAMECACHE
 	cache_purge(vp);
 #endif
-	if (!error) {
+	VI_LOCK(vp);
+	vp->v_iflag &= ~VI_MOUNT;
+	VI_UNLOCK(vp);
+	if (error == 0) {
 		vnode_t *mvp;
 
-		VI_LOCK(vp);
-		vp->v_iflag &= ~VI_MOUNT;
-		VI_UNLOCK(vp);
 		vp->v_mountedhere = mp;
+		/*
+		 * Put the new filesystem on the mount list.
+		 */
 		mtx_lock(&mountlist_mtx);
 		TAILQ_INSERT_TAIL(&mountlist, mp, mnt_list);
 		mtx_unlock(&mountlist_mtx);
 		vfs_event_signal(NULL, VQ_MOUNT, 0);
 		if (VFS_ROOT(mp, LK_EXCLUSIVE, &mvp, td))
 			panic("mount: lost mount");
-		mountcheckdirs(vp, mvp);
-		vput(mvp);
-		VOP_UNLOCK(vp, 0, td);
-		if ((mp->mnt_flag & MNT_RDONLY) == 0)
-			error = vfs_allocate_syncvnode(mp);
+		vput(vp);
 		vfs_unbusy(mp, td);
-		if (error)
-			vrele(vp);
-		else
-			vfs_mountedfrom(mp, fspec);
+		*vpp = mvp;
 	} else {
-		VI_LOCK(vp);
-		vp->v_iflag &= ~VI_MOUNT;
-		VI_UNLOCK(vp);
-		VOP_UNLOCK(vp, 0, td);
+		vput(vp);
 		vfs_unbusy(mp, td);
 		vfs_mount_destroy(mp);
+		*vpp = NULL;
 	}
 	return (error);
 }

Modified: stable/7/sys/cddl/compat/opensolaris/sys/vfs.h
==============================================================================
--- stable/7/sys/cddl/compat/opensolaris/sys/vfs.h	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/compat/opensolaris/sys/vfs.h	Wed Jan  6 16:09:58 2010	(r201651)
@@ -110,8 +110,8 @@ void vfs_setmntopt(vfs_t *vfsp, const ch
     int flags __unused);
 void vfs_clearmntopt(vfs_t *vfsp, const char *name);
 int vfs_optionisset(const vfs_t *vfsp, const char *opt, char **argp);
-int domount(kthread_t *td, vnode_t *vp, const char *fstype, char *fspath,
-    char *fspec, int fsflags);
+int mount_snapshot(kthread_t *td, vnode_t **vpp, const char *fstype,
+    char *fspath, char *fspec, int fsflags);
 
 typedef	uint64_t	vfs_feature_t;
 

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	Wed Jan  6 16:09:58 2010	(r201651)
@@ -231,8 +231,27 @@ typedef struct znode {
 /*
  * Convert between znode pointers and vnode pointers
  */
+#ifdef DEBUG
+static __inline vnode_t *
+ZTOV(znode_t *zp)
+{
+	vnode_t *vp = zp->z_vnode;
+
+	ASSERT(vp == NULL || vp->v_data == NULL || vp->v_data == zp);
+	return (vp);
+}
+static __inline znode_t *
+VTOZ(vnode_t *vp)
+{
+	znode_t *zp = (znode_t *)vp->v_data;
+
+	ASSERT(zp == NULL || zp->z_vnode == NULL || zp->z_vnode == vp);
+	return (zp);
+}
+#else
 #define	ZTOV(ZP)	((ZP)->z_vnode)
 #define	VTOZ(VP)	((znode_t *)(VP)->v_data)
+#endif
 
 /*
  * ZFS_ENTER() is called on entry to each ZFS vnode and vfs operation.

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -1841,7 +1841,7 @@ zfs_perm_init(znode_t *zp, znode_t *pare
 				fgid = zfs_fuid_create_cred(zfsvfs,
 				    ZFS_GROUP, tx, cr, fuidp);
 #ifdef __FreeBSD__
-				gid = parent->z_phys->zp_gid;
+				gid = fgid = parent->z_phys->zp_gid;
 #else
 				gid = crgetgid(cr);
 #endif

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -879,20 +879,20 @@ domount:
 	mountpoint = kmem_alloc(mountpoint_len, KM_SLEEP);
 	(void) snprintf(mountpoint, mountpoint_len, "%s/.zfs/snapshot/%s",
 	    dvp->v_vfsp->mnt_stat.f_mntonname, nm);
-	err = domount(curthread, *vpp, "zfs", mountpoint, snapname, 0);
+	err = mount_snapshot(curthread, vpp, "zfs", mountpoint, snapname, 0);
 	kmem_free(mountpoint, mountpoint_len);
-	/* FreeBSD: This line was moved from below to avoid a lock recursion. */
-	if (err == 0)
-		vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY, curthread);
-	mutex_exit(&sdp->sd_lock);
-	/*
-	 * If we had an error, drop our hold on the vnode and
-	 * zfsctl_snapshot_inactive() will clean up.
-	 */
-	if (err) {
-		VN_RELE(*vpp);
-		*vpp = NULL;
+	if (err == 0) {
+		/*
+		 * Fix up the root vnode mounted on .zfs/snapshot/<snapname>.
+		 *
+		 * This is where we lie about our v_vfsp in order to
+		 * make .zfs/snapshot/<snapname> accessible over NFS
+		 * without requiring manual mounts of <snapname>.
+		 */
+		ASSERT(VTOZ(*vpp)->z_zfsvfs != zfsvfs);
+		VTOZ(*vpp)->z_zfsvfs->z_parent = zfsvfs;
 	}
+	mutex_exit(&sdp->sd_lock);
 	ZFS_EXIT(zfsvfs);
 	return (err);
 }

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -97,6 +97,8 @@ static int zfs_root(vfs_t *vfsp, int fla
 static int zfs_statfs(vfs_t *vfsp, struct statfs *statp, kthread_t *td);
 static int zfs_vget(vfs_t *vfsp, ino_t ino, int flags, vnode_t **vpp);
 static int zfs_sync(vfs_t *vfsp, int waitfor, kthread_t *td);
+static int zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp,
+    struct ucred **credanonp);
 static int zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vnode_t **vpp);
 static void zfs_objset_close(zfsvfs_t *zfsvfs);
 static void zfs_freevfs(vfs_t *vfsp);
@@ -108,6 +110,7 @@ static struct vfsops zfs_vfsops = {
 	.vfs_statfs =		zfs_statfs,
 	.vfs_vget =		zfs_vget,
 	.vfs_sync =		zfs_sync,
+	.vfs_checkexp =		zfs_checkexp,
 	.vfs_fhtovp =		zfs_fhtovp,
 };
 
@@ -955,6 +958,18 @@ zfsvfs_teardown(zfsvfs_t *zfsvfs, boolea
 		zfsvfs->z_unmounted = B_TRUE;
 		rrw_exit(&zfsvfs->z_teardown_lock, FTAG);
 		rw_exit(&zfsvfs->z_teardown_inactive_lock);
+
+#ifdef __FreeBSD__
+		/*
+		 * Some znodes might not be fully reclaimed, wait for them.
+		 */
+		mutex_enter(&zfsvfs->z_znodes_lock);
+		while (list_head(&zfsvfs->z_all_znodes) != NULL) {
+			msleep(zfsvfs, &zfsvfs->z_znodes_lock, 0,
+			    "zteardown", 0);
+		}
+		mutex_exit(&zfsvfs->z_znodes_lock);
+#endif
 	}
 
 	/*
@@ -1114,6 +1129,20 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 	znode_t		*zp;
 	int 		err;
 
+	/*
+	 * XXXPJD: zfs_zget() can't operate on virtual entires like .zfs/ or
+	 * .zfs/snapshot/ directories, so for now just return EOPNOTSUPP.
+	 * This will make NFS to fall back to using READDIR instead of
+	 * READDIRPLUS.
+	 * Also snapshots are stored in AVL tree, but based on their names,
+	 * not inode numbers, so it will be very inefficient to iterate
+	 * over all snapshots to find the right one.
+	 * Note that OpenSolaris READDIRPLUS implementation does LOOKUP on
+	 * d_name, and not VGET on d_fileno as we do.
+	 */
+	if (ino == ZFSCTL_INO_ROOT || ino == ZFSCTL_INO_SNAPDIR)
+		return (EOPNOTSUPP);
+
 	ZFS_ENTER(zfsvfs);
 	err = zfs_zget(zfsvfs, ino, &zp);
 	if (err == 0 && zp->z_unlinked) {
@@ -1134,6 +1163,26 @@ CTASSERT(SHORT_FID_LEN <= sizeof(struct 
 CTASSERT(LONG_FID_LEN <= sizeof(struct fid));
 
 static int
+zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp,
+    struct ucred **credanonp)
+{
+	zfsvfs_t *zfsvfs = vfsp->vfs_data;
+
+	/*
+	 * If this is regular file system vfsp is the same as
+	 * zfsvfs->z_parent->z_vfs, but if it is snapshot,
+	 * zfsvfs->z_parent->z_vfs represents parent file system
+	 * which we have to use here, because only this file system
+	 * has mnt_export configured.
+	 */
+	vfsp = zfsvfs->z_parent->z_vfs;
+
+	return (vfs_stdcheckexp(zfsvfs->z_parent->z_vfs, nam, extflagsp,
+	    credanonp));
+}
+
+
+static int
 zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vnode_t **vpp)
 {
 	zfsvfs_t	*zfsvfs = vfsp->vfs_data;
@@ -1148,7 +1197,11 @@ zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vno
 
 	ZFS_ENTER(zfsvfs);
 
-	if (fidp->fid_len == LONG_FID_LEN) {
+	/*
+	 * On FreeBSD we can get snapshot's mount point or its parent file
+	 * system mount point depending if snapshot is already mounted or not.
+	 */
+	if (zfsvfs->z_parent == zfsvfs && fidp->fid_len == LONG_FID_LEN) {
 		zfid_long_t	*zlfid = (zfid_long_t *)fidp;
 		uint64_t	objsetid = 0;
 		uint64_t	setgen = 0;

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -4340,11 +4340,20 @@ zfs_reclaim_complete(void *arg, int pend
 	znode_t	*zp = arg;
 	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 
-	ZFS_LOG(1, "zp=%p", zp);
-	ZFS_OBJ_HOLD_ENTER(zfsvfs, zp->z_id);
-	zfs_znode_dmu_fini(zp);
-	ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
+	rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
+	if (zp->z_dbuf != NULL) {
+		ZFS_OBJ_HOLD_ENTER(zfsvfs, zp->z_id);
+		zfs_znode_dmu_fini(zp);
+		ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
+	}
 	zfs_znode_free(zp);
+	rw_exit(&zfsvfs->z_teardown_inactive_lock);
+	/*
+	 * If the file system is being unmounted, there is a process waiting
+	 * for us, wake it up.
+	 */
+	if (zfsvfs->z_unmounted)
+		wakeup_one(zfsvfs);
 }
 
 static int
@@ -4356,6 +4365,9 @@ zfs_freebsd_reclaim(ap)
 {
 	vnode_t	*vp = ap->a_vp;
 	znode_t	*zp = VTOZ(vp);
+	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+
+	rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
 
 	ASSERT(zp != NULL);
 
@@ -4366,7 +4378,7 @@ zfs_freebsd_reclaim(ap)
 
 	mutex_enter(&zp->z_lock);
 	ASSERT(zp->z_phys != NULL);
-	ZTOV(zp) = NULL;
+	zp->z_vnode = NULL;
 	mutex_exit(&zp->z_lock);
 
 	if (zp->z_unlinked)
@@ -4374,7 +4386,6 @@ zfs_freebsd_reclaim(ap)
 	else if (zp->z_dbuf == NULL)
 		zfs_znode_free(zp);
 	else /* if (!zp->z_unlinked && zp->z_dbuf != NULL) */ {
-		zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 		int locked;
 
 		locked = MUTEX_HELD(ZFS_OBJ_MUTEX(zfsvfs, zp->z_id)) ? 2 :
@@ -4397,6 +4408,7 @@ zfs_freebsd_reclaim(ap)
 	vp->v_data = NULL;
 	ASSERT(vp->v_holdcnt >= 1);
 	VI_UNLOCK(vp);
+	rw_exit(&zfsvfs->z_teardown_inactive_lock);
 	return (0);
 }
 

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -110,7 +110,7 @@ znode_evict_error(dmu_buf_t *dbuf, void 
 		mutex_exit(&zp->z_lock);
 		zfs_znode_free(zp);
 	} else if (vp->v_count == 0) {
-		ZTOV(zp) = NULL;
+		zp->z_vnode = NULL;
 		vhold(vp);
 		mutex_exit(&zp->z_lock);
 		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, curthread);
@@ -896,9 +896,25 @@ again:
 		if (zp->z_unlinked) {
 			err = ENOENT;
 		} else {
-			if (ZTOV(zp) != NULL)
-				VN_HOLD(ZTOV(zp));
+			int dying = 0;
+
+			vp = ZTOV(zp);
+			if (vp == NULL)
+				dying = 1;
 			else {
+				VN_HOLD(vp);
+				if ((vp->v_iflag & VI_DOOMED) != 0) {
+					dying = 1;
+					/*
+					 * Don't VN_RELE() vnode here, because
+					 * it can call vn_lock() which creates
+					 * LOR between vnode lock and znode
+					 * lock. We will VN_RELE() the vnode
+					 * after droping znode lock.
+					 */
+				}
+			}
+			if (dying) {
 				if (first) {
 					ZFS_LOG(1, "dying znode detected (zp=%p)", zp);
 					first = 0;
@@ -910,6 +926,8 @@ again:
 				dmu_buf_rele(db, NULL);
 				mutex_exit(&zp->z_lock);
 				ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num);
+				if (vp != NULL)
+					VN_RELE(vp);
 				tsleep(zp, 0, "zcollide", 1);
 				goto again;
 			}
@@ -1531,7 +1549,7 @@ zfs_create_fs(objset_t *os, cred_t *cr, 
 	ZTOV(rootzp)->v_data = NULL;
 	ZTOV(rootzp)->v_count = 0;
 	ZTOV(rootzp)->v_holdcnt = 0;
-	ZTOV(rootzp) = NULL;
+	rootzp->z_vnode = NULL;
 	VOP_UNLOCK(vp, 0, curthread);
 	vdestroy(vp);
 	dmu_buf_rele(rootzp->z_dbuf, NULL);
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 20 Martin Matuska freebsd_committer freebsd_triage 2011-07-18 12:55:44 UTC
Is this problem already resolved? Can we close this PR?

-- 
Martin Matuska
FreeBSD committer
http://blog.vx.sk
Comment 21 Marcelo Araujo freebsd_committer freebsd_triage 2012-04-16 02:37:53 UTC
State Changed
From-To: suspended->feedback

It doesn't happens anymore on CURRENT(10), but I faced out some problem 
on 8.2-RELEASE, now I'm doing some more debugs to figure out whats 
happen with it.
Comment 22 Marcelo Araujo freebsd_committer freebsd_triage 2012-04-16 07:31:59 UTC
State Changed
From-To: feedback->closed

It was solved on PR number 150544.
Comment 23 Pawel Jakub Dawidek freebsd_committer freebsd_triage 2014-06-01 06:50:15 UTC
State Changed
From-To: open->feedback

Is this still a problem with FreeBSD 8? I'm not able to reproduce it. 


Comment 24 Pawel Jakub Dawidek freebsd_committer freebsd_triage 2014-06-01 06:50:15 UTC
Responsible Changed
From-To: freebsd-fs->pjd

I'll take this one.
Comment 25 Pawel Jakub Dawidek freebsd_committer freebsd_triage 2014-06-01 06:50:15 UTC
State Changed
From-To: feedback->suspended

I committed a work-around to this problem, to just return EOPNOTSUPP 
and let the server to fall back to regular READDIR. 

There are two problems with the patch: 
- It doesn't automount the snapshot (as you mentioned). 
- It doesn't scale - AVL snapshots tree is based on names, not inode 
numbers, we it is quite expensive to look through all the snapshot 
when zfs_zget() returns an error. 

In OpenSolaris they use LOOKUP on d_name, instead of VGET on d_fileno, 
which handles automounting and snapshots lookup. 

I've a prototype patch for our NFS server to use LOOKUP if VGET fails, 
but its too risky so early before release and needs for serious review, 
so we are sure we can't cross mount points when we don't want to, etc.