132068 – [zfs] page fault when using ZFS over NFS on 7.1-RELEASE/amd64

Bug 132068 - [zfs] page fault when using ZFS over NFS on 7.1-RELEASE/amd64

Summary: [zfs] page fault when using ZFS over NFS on 7.1-RELEASE/amd64

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	7.1-RELEASE
Hardware:	Any Any

Importance:	Normal Affects Only Me
Assignee:	Pawel Jakub Dawidek

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-02-24 14:40 UTC by Edward Fisk
Modified:	2010-01-06 16:20 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Edward Fisk 2009-02-24 14:40:03 UTC

FreeBSD 7.1-RELEASE/amd64, GENERIC kernel, 16GB RAM, 2x dual-core AMD 2.8GH=
z CPUs

Standard install on UFS volume, but with one 500GB ZFS volume, exported
via NFS to various clients (FreeBSD, Debian Linux, Mac OS X).

No ZFS snapshots in use.

After anywhere from a few hours to a few days of use, the system will
either hang or panic.  Running iozone on the clients on the NFS mounted
volume doesn't appear to have any effect one way or the other.

If you require any more information or testing, please let me know.  --Ed

# dmesg
CPU: Dual-Core AMD Opteron(tm) Processor 2220 (2800.12-MHz K8-class CPU)
usable memory =3D 17166274560 (16371 MB)

# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  aacd0s1f  ONLINE       0     0     0

errors: No known data errors

# cat /etc/sysctl.conf
kern.maxvnodes=3D400000

# cat /boot/loader.conf
vm.kmem_size_max=3D"1024M"
vm.kmem_size=3D"1024M"
vfs.zfs.arc_max=3D"100M"

# kgdb kernel.debug /var/crash/vmcore.0
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain condition=
s.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid =3D 1; apic id =3D 01
fault virtual address	=3D 0x1c4
fault code		=3D supervisor read data, page not present
instruction pointer	=3D 0x8:0xffffffff80651673
stack pointer	        =3D 0x10:0xffffffffdd9d6600
frame pointer	        =3D 0x10:0xffffff00341efa38
code segment		=3D base rx0, limit 0xfffff, type 0x1b
			=3D DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	=3D interrupt enabled, resume, IOPL =3D 0
current process		=3D 754 (nfsd)
trap number		=3D 12
panic: page fault
cpuid =3D 1
Uptime: 19d19h23m22s
Physical memory: 16371 MB
Dumping 1425 MB: 1410 1394 1378 1362 1346 1330 1314 1298 1282 1266 1250 123=
4 1218 1202 1186 1170 1154 1138 1122 1106 1090 1074 1058 1042 1026 1010 994=
 978 962 946 930 914 898 882 866 850 834 818 802 786 770 754 738 722 706 69=
0 674 658 642 626 610 594 578 562 546 530 514 498 482 466 450 434 418 402 3=
86 370 354 338 322 306 290 274 258 242 226 210 194 178 162 146 130 114 98 8=
2 66 50 34 18 2

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /boot/kerne=
l/zfs.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /bo=
ot/kernel/opensolaris.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/kernel/pf.ko...Reading symbols from /boot/kernel=
/pf.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/pf.ko
#0  doadump () at pcpu.h:195
195		__asm __volatile("movq %%gs:0,%0" : "=3Dr" (td));
(kgdb) bt
#0  doadump () at pcpu.h:195
#1  0x0000000000000004 in ?? ()
#2  0xffffffff804b4ce9 in boot (howto=3D260) at /usr/src/sys/kern/kern_shut=
down.c:418
#3  0xffffffff804b50f2 in panic (fmt=3D0x104 <Address 0x104 out of bounds>)=
 at /usr/src/sys/kern/kern_shutdown.c:574
#4  0xffffffff8078a173 in trap_fatal (frame=3D0xffffff0003793370, eva=3DVar=
iable "eva" is not available.
) at /usr/src/sys/amd64/amd64/trap.c:764
#5  0xffffffff8078a545 in trap_pfault (frame=3D0xffffffffdd9d6550, usermode=
=3D0) at /usr/src/sys/amd64/amd64/trap.c:680
#6  0xffffffff8078ae88 in trap (frame=3D0xffffffffdd9d6550) at /usr/src/sys=
/amd64/amd64/trap.c:449
#7  0xffffffff8077067e in calltrap () at /usr/src/sys/amd64/amd64/exception=
.S:209
#8  0xffffffff80651673 in nfsrv_readdirplus (nfsd=3D0xffffff0041546700, slp=
=3D0xffffff0079a7ad00, td=3D0xffffff0003793370, mrq=3D0xffffffffdd9d6b00) a=
t /usr/src/sys/nfsserver/nfs_serv.c:3645
#9  0xffffffff8065e6dd in nfssvc (td=3DVariable "td" is not available.
) at /usr/src/sys/nfsserver/nfs_syscalls.c:456
#10 0xffffffff8078a7c7 in syscall (frame=3D0xffffffffdd9d6c80) at /usr/src/=
sys/amd64/amd64/trap.c:907
#11 0xffffffff8077088b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exce=
ption.S:330
#12 0x0000000800687d5c in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) list *0xffffffff80651673
0xffffffff80651673 is in nfsrv_readdirplus (/usr/src/sys/nfsserver/nfs_serv=
.c:3645).
3640				 */
3641				if (VFS_VGET(vp->v_mount, dp->d_fileno, LK_EXCLUSIVE,
3642				    &nvp))
3643					goto invalid;
3644				bzero((caddr_t)nfhp, NFSX_V3FH);
3645				nfhp->fh_fsid =3D
3646					nvp->v_mount->mnt_stat.f_fsid;
3647				/*
3648				 * XXXRW: Assert the mountpoints are the same so that
3649				 * we know that acquiring Giant based on the
(kgdb) frame 8
#8  0xffffffff80651673 in nfsrv_readdirplus (nfsd=3D0xffffff0041546700, slp=
=3D0xffffff0079a7ad00, td=3D0xffffff0003793370, mrq=3D0xffffffffdd9d6b00) a=
t /usr/src/sys/nfsserver/nfs_serv.c:3645
3645				nfhp->fh_fsid =3D
(kgdb) p *vp
$1 =3D {v_type =3D VDIR, v_tag =3D 0xffffffffdd978d4d "zfs", v_op =3D 0xfff=
fffffdd97bda0, v_data =3D 0xffffff00240d4870, v_mount =3D 0xffffff0003d4f6f=
0, v_nmntvnodes =3D {tqe_next =3D 0xffffff004cb92bd0,=20
    tqe_prev =3D 0xffffff0003d9a610}, v_un =3D {vu_mount =3D 0x0, vu_socket=
 =3D 0x0, vu_cdev =3D 0x0, vu_fifoinfo =3D 0x0, vu_yield =3D 0}, v_hashlist=
 =3D {le_next =3D 0x0, le_prev =3D 0x0}, v_hash =3D 0,=20
  v_cache_src =3D {lh_first =3D 0xffffff003b891148}, v_cache_dst =3D {tqh_f=
irst =3D 0x0, tqh_last =3D 0xffffff0051425a38}, v_dd =3D 0x0, v_cstart =3D =
0, v_lasta =3D 0, v_lastw =3D 0, v_clen =3D 0, v_lock =3D {
    lk_object =3D {lo_name =3D 0xffffffffdd978d4d "zfs", lo_type =3D 0xffff=
ffffdd978d4d "zfs", lo_flags =3D 70844416, lo_witness_data =3D {lod_list =
=3D {stqe_next =3D 0x0}, lod_witness =3D 0x0}},=20
    lk_interlock =3D 0xffffffff80ab32b0, lk_flags =3D 64, lk_sharecount =3D=
 0, lk_waitcount =3D 0, lk_exclusivecount =3D 0, lk_prio =3D 80, lk_timo =
=3D 51, lk_lockholder =3D 0xffffffffffffffff,=20
    lk_newlock =3D 0x0}, v_interlock =3D {lock_object =3D {lo_name =3D 0xff=
ffffff8085132a "vnode interlock", lo_type =3D 0xffffffff8085132a "vnode int=
erlock", lo_flags =3D 16973824,=20
      lo_witness_data =3D {lod_list =3D {stqe_next =3D 0x0}, lod_witness =
=3D 0x0}}, mtx_lock =3D 4, mtx_recurse =3D 0}, v_vnlock =3D 0xffffff0051425=
a70, v_holdcnt =3D 2, v_usecount =3D 1, v_iflag =3D 0,=20
  v_vflag =3D 0, v_writecount =3D 0, v_freelist =3D {tqe_next =3D 0xffffff0=
045f415e8, tqe_prev =3D 0xffffff0045f12f08}, v_bufobj =3D {bo_mtx =3D 0xfff=
fff0051425ac0, bo_clean =3D {bv_hd =3D {
        tqh_first =3D 0x0, tqh_last =3D 0xffffff0051425b30}, bv_root =3D 0x=
0, bv_cnt =3D 0}, bo_dirty =3D {bv_hd =3D {tqh_first =3D 0x0, tqh_last =3D =
0xffffff0051425b50}, bv_root =3D 0x0, bv_cnt =3D 0},=20
    bo_numoutput =3D 0, bo_flag =3D 0, bo_ops =3D 0xffffffff80a336a0, bo_bs=
ize =3D 131072, bo_object =3D 0xffffff00511165b0, bo_synclist =3D {le_next =
=3D 0x0, le_prev =3D 0x0},=20
    bo_private =3D 0xffffff00514259d8, __bo_vnode =3D 0xffffff00514259d8}, =
v_pollinfo =3D 0x0, v_label =3D 0x0, v_lockf =3D 0x0}
(kgdb) p *dp
$3 =3D {d_fileno =3D 69115, d_reclen =3D 52, d_type =3D 8 '\b', d_namlen =
=3D 43 '+',=20
  d_name =3D "1202493068.863_1.mailserver1.XXXXXXXXXXX:2,\000=A8 \002\0008\=
000\b,1234513680.8945_1.mailserver1.XXXXXXXXXXX:2,\000:2,=C1<\001\0008\000\=
b.1223469304.21605_1.mailserver1.XXXXXXXXXXX:2,S\0002\034\005\001\0008\000\=
b,1199341929.2724_2.mailserver1.XXXXXX"...}

Comment 1 Mark Linimon freebsd_committer

2009-02-24 17:47:30 UTC

Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).

Comment 2 Jaakko Heinonen 2009-02-24 19:32:42 UTC

Hi,

On 2009-02-24, Edward Fisk wrote:
> If you require any more information or testing, please let me know.

> (kgdb) frame 8
> #8  0xffffffff80651673 in nfsrv_readdirplus (nfsd=3D0xffffff0041546700, slp=
> =3D0xffffff0079a7ad00, td=3D0xffffff0003793370, mrq=3D0xffffffffdd9d6b00) a=
> t /usr/src/sys/nfsserver/nfs_serv.c:3645
> 3645				nfhp->fh_fsid =3D

Can you give output of these commands in frame 8:

p *nvp
p *nvp->v_mount

If nvp is NULL it's likely a bug in zfs_zget(). Looks like the zfs
version in head has updated zfs_zget() which might fix the issue.

-- 
Jaakko

Comment 3 Edward Fisk 2009-02-24 20:25:13 UTC

(kgdb) frame 8
#8  0xffffffff80651673 in nfsrv_readdirplus (nfsd=0xffffff0041546700, slp=0xffffff0079a7ad00, td=0xffffff0003793370, 
    mrq=0xffffffffdd9d6b00) at /usr/src/sys/nfsserver/nfs_serv.c:3645
3645				nfhp->fh_fsid =
(kgdb) p *nvp
$1 = {v_type = VBAD, v_tag = 0xffffffff807e5627 "none", v_op = 0xffffffff80a18220, v_data = 0x0, v_mount = 0x0, v_nmntvnodes = {
    tqe_next = 0xffffff00529b2bd0, tqe_prev = 0xffffff000af92028}, v_un = {vu_mount = 0x0, vu_socket = 0x0, vu_cdev = 0x0, 
    vu_fifoinfo = 0x0, vu_yield = 0}, v_hashlist = {le_next = 0x0, le_prev = 0x0}, v_hash = 0, v_cache_src = {lh_first = 0x0}, 
  v_cache_dst = {tqh_first = 0x0, tqh_last = 0xffffff0051130258}, v_dd = 0x0, v_cstart = 0, v_lasta = 0, v_lastw = 0, v_clen = 0, 
  v_lock = {lk_object = {lo_name = 0xffffffffdd978d4d "zfs", lo_type = 0xffffffffdd978d4d "zfs", lo_flags = 70844416, 
      lo_witness_data = {lod_list = {stqe_next = 0x0}, lod_witness = 0x0}}, lk_interlock = 0xffffffff80ab2c80, lk_flags = 64, 
    lk_sharecount = 0, lk_waitcount = 0, lk_exclusivecount = 0, lk_prio = 80, lk_timo = 51, lk_lockholder = 0xffffffffffffffff, 
    lk_newlock = 0x0}, v_interlock = {lock_object = {lo_name = 0xffffffff8085132a "vnode interlock", 
      lo_type = 0xffffffff8085132a "vnode interlock", lo_flags = 16973824, lo_witness_data = {lod_list = {stqe_next = 0x0}, 
        lod_witness = 0x0}}, mtx_lock = 4, mtx_recurse = 0}, v_vnlock = 0xffffff0051130290, v_holdcnt = 1, v_usecount = 1, 
  v_iflag = 128, v_vflag = 0, v_writecount = 0, v_freelist = {tqe_next = 0xffffff00529b2bd0, tqe_prev = 0xffffffff80aca090}, 
  v_bufobj = {bo_mtx = 0xffffff00511302e0, bo_clean = {bv_hd = {tqh_first = 0x0, tqh_last = 0xffffff0051130350}, bv_root = 0x0, 
      bv_cnt = 0}, bo_dirty = {bv_hd = {tqh_first = 0x0, tqh_last = 0xffffff0051130370}, bv_root = 0x0, bv_cnt = 0}, 
    bo_numoutput = 0, bo_flag = 0, bo_ops = 0xffffffff80a336a0, bo_bsize = 131072, bo_object = 0x0, bo_synclist = {le_next = 0x0, 
      le_prev = 0x0}, bo_private = 0xffffff00511301f8, __bo_vnode = 0xffffff00511301f8}, v_pollinfo = 0x0, v_label = 0x0, 
  v_lockf = 0x0}
(kgdb) p *nvp->v_mount
Cannot access memory at address 0x0

---
> If nvp is NULL it's likely a bug in zfs_zget(). Looks like the zfs
> version in head has updated zfs_zget() which might fix the issue.

Thanks for the tip.  I will try building a kernel with revision 1.15.2.3 of zfs_znode.c and see what happens.

Comment 4 Edward Fisk 2009-02-25 12:03:12 UTC

I GENERIC kernel built from RELENG_7 sources (20090225) doesn't appear to have helped.  The system locked up hard this time though, so I was unfortunately unable to obtain a dump.

Comment 5 Jaakko Heinonen 2009-02-26 17:44:21 UTC

On 2009-02-24, Edward Fisk wrote:
>  (kgdb) p *nvp
>  $1 = {v_type = VBAD, v_tag = 0xffffffff807e5627 "none", v_op = 0xffffffff80a18220, v_data = 0x0, v_mount = 0x0, v_nmntvnodes = {

Thanks for the info. If you can't try 8.0-CURRENT here is an attempt to
backport some bits from head to RELENG_7. I have only compile tested the
patch so be careful.

--- patch begins here ---
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	(revision 189044)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	(working copy)
@@ -554,10 +554,10 @@ zfs_zget(zfsvfs_t *zfsvfs, uint64_t obj_
 	dmu_buf_t	*db;
 	znode_t		*zp;
 	vnode_t		*vp;
-	int err;
+	int err, first = 1;
 
 	*zpp = NULL;
-
+again:
 	ZFS_OBJ_HOLD_ENTER(zfsvfs, obj_num);
 
 	err = dmu_bonus_hold(zfsvfs->z_os, obj_num, NULL, &db);
@@ -574,64 +574,60 @@ zfs_zget(zfsvfs_t *zfsvfs, uint64_t obj_
 		return (EINVAL);
 	}
 
-	ASSERT(db->db_object == obj_num);
-	ASSERT(db->db_offset == -1);
-	ASSERT(db->db_data != NULL);
-
 	zp = dmu_buf_get_user(db);
-
 	if (zp != NULL) {
 		mutex_enter(&zp->z_lock);
 
+		/*
+		 * Since we do immediate eviction of the z_dbuf, we
+		 * should never find a dbuf with a znode that doesn't
+		 * know about the dbuf.
+		 */
+		ASSERT3P(zp->z_dbuf, ==, db);
 		ASSERT3U(zp->z_id, ==, obj_num);
 		if (zp->z_unlinked) {
-			dmu_buf_rele(db, NULL);
-			mutex_exit(&zp->z_lock);
-			ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num);
-			return (ENOENT);
-		} else if (zp->z_dbuf_held) {
-			dmu_buf_rele(db, NULL);
+			err = ENOENT;
 		} else {
-			zp->z_dbuf_held = 1;
-			VFS_HOLD(zfsvfs->z_vfs);
-		}
-
-		if (ZTOV(zp) != NULL)
-			VN_HOLD(ZTOV(zp));
-		else {
-			err = getnewvnode("zfs", zfsvfs->z_vfs, &zfs_vnodeops,
-			    &zp->z_vnode);
-			ASSERT(err == 0);
-			vp = ZTOV(zp);
-			vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, curthread);
-			vp->v_data = (caddr_t)zp;
-			vp->v_vnlock->lk_flags |= LK_CANRECURSE;
-			vp->v_vnlock->lk_flags &= ~LK_NOSHARE;
-			vp->v_type = IFTOVT((mode_t)zp->z_phys->zp_mode);
-			if (vp->v_type == VDIR)
-				zp->z_zn_prefetch = B_TRUE;	/* z_prefetch default is enabled */
-			vp->v_vflag |= VV_FORCEINSMQ;
-			err = insmntque(vp, zfsvfs->z_vfs);
-			vp->v_vflag &= ~VV_FORCEINSMQ;
-			KASSERT(err == 0, ("insmntque() failed: error %d", err));
-			VOP_UNLOCK(vp, 0, curthread);
+			if (ZTOV(zp) != NULL)
+				VN_HOLD(ZTOV(zp));
+			else {
+				if (first) {
+					ZFS_LOG(1, "dying znode detected (zp=%p)", zp);
+					first = 0;
+				}
+				/*
+				 * znode is dying so we can't reuse it, we must
+				 * wait until destruction is completed.
+				 */
+				dmu_buf_rele(db, NULL);
+				mutex_exit(&zp->z_lock);
+				ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num);
+				tsleep(zp, 0, "zcollide", 1);
+				goto again;
+			}
+			*zpp = zp;
+			err = 0;
 		}
+		dmu_buf_rele(db, NULL);
 		mutex_exit(&zp->z_lock);
 		ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num);
-		*zpp = zp;
-		return (0);
+		return (err);
 	}
 
 	/*
 	 * Not found create new znode/vnode
 	 */
 	zp = zfs_znode_alloc(zfsvfs, db, obj_num, doi.doi_data_block_size);
-	ASSERT3U(zp->z_id, ==, obj_num);
-	zfs_znode_dmu_init(zp);
+
+	vp = ZTOV(zp);
+	vp->v_vflag |= VV_FORCEINSMQ;
+	err = insmntque(vp, zfsvfs->z_vfs);
+	vp->v_vflag &= ~VV_FORCEINSMQ;
+	KASSERT(err == 0, ("insmntque() failed: error %d", err));
+	VOP_UNLOCK(vp, 0, curthread);
+
 	ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num);
 	*zpp = zp;
-	if ((vp = ZTOV(zp)) != NULL)
-		VOP_UNLOCK(vp, 0, curthread);
 	return (0);
 }
 
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	(revision 189044)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	(working copy)
@@ -3475,9 +3475,9 @@ zfs_freebsd_reclaim(ap)
 	ASSERT(zp->z_phys);
 	ASSERT(zp->z_dbuf_held);
 	zfsvfs = zp->z_zfsvfs;
+	ZTOV(zp) = NULL;
 	if (!zp->z_unlinked) {
 		zp->z_dbuf_held = 0;
-		ZTOV(zp) = NULL;
 		mutex_exit(&zp->z_lock);
 		dmu_buf_rele(zp->z_dbuf, NULL);
 	} else {
--- patch ends here ---

-- 
Jaakko

Comment 6 Mark Linimon freebsd_committer

2009-03-05 07:28:23 UTC

State Changed
From-To: open->feedback

To submitter: can you test the patch on 8-CURRENT?

Comment 7 Mark Linimon 2009-03-05 10:27:38 UTC

On Thu, Mar 05, 2009 at 07:29:03AM +0000, linimon@FreeBSD.org wrote:
> To submitter: can you test the patch on 8-CURRENT?

Er, what I *should* have said is: can you either test the patch,
or try 8-CURRENT, which has the patch in it?

mcl

Comment 8 Edward Fisk 2009-03-05 11:46:25 UTC

I've tried the patch kindly supplied by Jaakko, but the machine still deadlocks (no panic though).  I then did a source upgrade to 8.0-CURRENT (20090227).  The machine would still deadlock every 1-2 hours under the same load.  I've now added the following deadlock debugging options:

options INVARIANTS
options INVARIANT_SUPPORT
options WITNESS
options DEBUG_LOCKS
options DEBUG_VFS_LOCKS
options DIAGNOSTIC

Unfortunately, the machine has now been running for over 48 hours without a crash, while still being under the same load/usage pattern.  -Ed

Comment 9 Edward Fisk 2009-03-05 11:47:54 UTC

kern/129174 looks like a similar problem

just to clarify, the clients using my system use both NFS2 and NFS3

Comment 10 Edward Fisk 2009-03-05 11:50:41 UTC

The machine did a panic earlier, but was unfortunately unable to do a dump.

------
panic: Bad link elm 0xffffff014f3e2400 prev->next != elm
cpuid = 3
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
panic() at panic+0x182
xprt_unregister_locked() at xprt_unregister_locked+0xad
xprt_unregister() at xprt_unregister+0x2c
svc_run_internal() at svc_run_internal+0x42f
svc_run() at svc_run+0x94
nfssvc_nfsd() at nfssvc_nfsd+0xa2
nfssvc() at nfssvc+0x12d
syscall() at syscall+0x1e7
Xfast_syscall() at Xfast_syscall+0xab
--- syscall (155, FreeBSD ELF64, nfssvc), rip = 0x800695c4c, rsp = 0x7fffffffe8e8, rbp = 0 ---
Uptime: 3d4h21m8s
Physical memory: 16368 MB
Dumping 3849 MB: 3834 3818 3802 3786 3770 3754 3738 3722 3706 3690Error dumping block 0x0

** DUMP FAILED (ERROR 5) **
aac0: shutting down controller...FAILED.
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
------

Any idea as to why it was unable to complete the crash dump?

Comment 11 Mark Linimon 2009-03-05 15:29:11 UTC

----- Forwarded message from Edward Fisk <7ogcg7g02@sneakemail.com> -----

To: freebsd-bugs@FreeBSD.org
From: Edward Fisk <7ogcg7g02@sneakemail.com>
Subject: Re: kern/132068: [zfs] page fault when using ZFS over NFS on
	7.1-RELEASE/amd64

I tried with the patch kindly provided by Jaakko.  The machine would no
longer panic, but would instead deadlock every 1-2 hours when in use.
I then did a source upgrade to 8-CURRENT (27022008) with a GENERIC
kernel, which had the same result: a deadlock every 1 to 2 hours when in use.

The machine is now running a GENERIC kernel with the following options added:

options INVARIANTS
options INVARIANT_SUPPORT
options WITNESS
options DEBUG_LOCKS
options DEBUG_VFS_LOCKS
options DIAGNOSTIC

Unfortunately, I've yet to be able to get it to deadlock, despite the
workload on the machine not having changed.

----- End forwarded message -----

Comment 12 Mark Linimon 2009-03-05 15:29:39 UTC

----- Forwarded message from Edward Fisk <7ogcg7g02@sneakemail.com> -----

To: freebsd-bugs@FreeBSD.org
From: Edward Fisk <7ogcg7g02@sneakemail.com>
Subject: Re: kern/132068: [zfs] page fault when using ZFS over NFS on
	7.1-RELEASE/amd64

kern/129174 looks like a similar problem, altohugh there's not much
useful information there

----- End forwarded message -----

Comment 13 Mark Linimon 2009-03-05 19:26:45 UTC

----- Forwarded message from Weldon S Godfrey 3 <weldon@excelsus.com> -----

From: Weldon S Godfrey 3 <weldon@excelsus.com>
To: Mark Linimon <linimon@lonesome.com>
cc: freebsd-fs@FreeBSD.org
Subject: Re: kern/132068: [zfs] page fault when using ZFS over NFS on
	7.1-RELEASE/amd64

Are you using v3 NFS or v2?  Switching to v2 has made things MUCH more
stable for me, however, I still loose stability with ZIL enabled (even if
prefetch is disabled)(and ZIL disabled is NOT desirable as I know the
potential client side corruption with that, but so far I haven't run into
that).  I currently have ZIL and prefetch disabled.  I currently limit
ARC to 2GB as well and set kmem to 4GB (I currently am using FreeBSD 8)

I am up to about 1.5 weeks without a panic now.  The system does a
constant 80-120Mb/s in read and 30Mb/s write/s during the day and has 9
NFS clients.

Weldon

----- End forwarded message -----

Comment 14 Jaakko Heinonen 2009-03-26 19:01:10 UTC

I was able to trigger a panic on current using this patch:

%%%
Index: sys/nfsserver/nfs_srvsubs.c
===================================================================
--- sys/nfsserver/nfs_srvsubs.c	(revision 190316)
+++ sys/nfsserver/nfs_srvsubs.c	(working copy)
@@ -1169,6 +1169,8 @@ nfsrv_fhtovp(fhandle_t *fhp, int lockfla
 	vfs_unbusy(mp);
 	if (error)
 		goto out;
+	if ((*vpp)->v_type == VBAD)
+		panic("VBAD *vpp in nfsrv_fhtovp()");
 #ifdef MNT_EXNORESPORT
 	if (!(exflags & (MNT_EXNORESPORT|MNT_EXPUBLIC))) {
 		saddr = (struct sockaddr_in *)nam;
%%%

#2  0xc08537d2 in panic (fmt=Variable "fmt" is not available.)
    at /home/jaakko/src/head/sys/kern/kern_shutdown.c:576
#3  0xc0a1c17e in nfsrv_fhtovp (fhp=0xf48ecae4, lockflag=1, vpp=0xf48ecad8, 
    vfslockedp=0xf48ecac4, nfsd=0xf48ecbb0, slp=0x0, nam=0xc63aeac4, 
    rdonlyp=0xf48ecad0, pubflag=1)
    at /home/jaakko/src/head/sys/nfsserver/nfs_srvsubs.c:1173
#4  0xc0a0ff20 in nfsrv_commit (nfsd=0xf48ecbb0, slp=0x0, mrq=0xf48ecba8)
    at /home/jaakko/src/head/sys/nfsserver/nfs_serv.c:3836
#5  0xc0a1b1a7 in nfssvc_program (rqst=0xca95d000, xprt=0xc63aea00)
    at /home/jaakko/src/head/sys/nfsserver/nfs_srvkrpc.c:420
#6  0xc0a35fe2 in svc_run_internal (pool=0xc57a7a80, ismaster=0)
    at /home/jaakko/src/head/sys/rpc/svc.c:883
#7  0xc0a362c0 in svc_thread_start (arg=0xc57a7a80)
    at /home/jaakko/src/head/sys/rpc/svc.c:1188
#8  0xc082fd38 in fork_exit (callout=0xc0a362b0 <svc_thread_start>, 
    arg=0xc57a7a80, frame=0xf48ecd38)
    at /home/jaakko/src/head/sys/kern/kern_fork.c:821
#9  0xc0b41630 in fork_trampoline ()
    at /home/jaakko/src/head/sys/i386/i386/exception.s:270

I now know what is going on. The vnode may be reclaimed during
zfs_zget() because it doesn't hold the vnode lock (except when a new
znode is created).

I tried to modify zfs_zget() to grab the vnode lock and check if the
vnode is doomed. I was able to trigger the condition. However eventually
vnode locking deadlocked with lookup(). I used exclusive locking so
maybe one could avoid deadlocking with shared locking but it isn't a
real solution especially because shared lookups can be disabled. Also
other deadlock scenarios may exist.

On 2009-03-05, Edward Fisk wrote:
>  The machine did a panic earlier, but was unfortunately unable to do a dump.
>  
>  panic: Bad link elm 0xffffff014f3e2400 prev->next != elm

>  xprt_unregister_locked() at xprt_unregister_locked+0xad

This is a different issue. You may have already seen this discussion:

http://lists.freebsd.org/pipermail/freebsd-current/2009-March/005097.html

-- 
Jaakko

Comment 15 Jaakko Heinonen 2009-04-10 12:46:02 UTC

On 2009-03-26, Jaakko Heinonen wrote:
> I now know what is going on. The vnode may be reclaimed during
> zfs_zget() because it doesn't hold the vnode lock (except when a new
> znode is created).

OK, I have now put together a patch which should avoid the original
panic you reported. The same panic was also reported by Weldon Godfrey
(Cc'd) on -fs:
http://lists.freebsd.org/pipermail/freebsd-fs/2008-August/004998.html

--- patch begins here ---
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	(revision 190593)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	(working copy)
@@ -890,8 +890,9 @@ again:
 		if (zp->z_unlinked) {
 			err = ENOENT;
 		} else {
-			if (ZTOV(zp) != NULL)
-				VN_HOLD(ZTOV(zp));
+			vp = ZTOV(zp);
+			if (vp != NULL)
+				VN_HOLD(vp);
 			else {
 				if (first) {
 					ZFS_LOG(1, "dying znode detected (zp=%p)", zp);
@@ -907,12 +908,25 @@ again:
 				tsleep(zp, 0, "zcollide", 1);
 				goto again;
 			}
-			*zpp = zp;
 			err = 0;
 		}
 		dmu_buf_rele(db, NULL);
 		mutex_exit(&zp->z_lock);
 		ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num);
+		if (err == 0) {
+			/*
+			 * Check if we lost lost race against reclaim.
+			 */
+			VI_LOCK(vp);
+			if (vp->v_iflag & VI_DOOMED) {
+				VI_UNLOCK(vp);
+				VN_RELE(vp);
+				ZFS_LOG(1, "doomed vnode detected (zp=%p)", zp);
+				goto again;
+			}
+			VI_UNLOCK(vp);
+			*zpp = zp;
+		}
 		return (err);
 	}
--- patch ends here ---

The fix isn't perfect. Vnodes may be still reclaimed during zfs_zget()
in forced unmount case. However zfs doesn't support forced unmounts at
all right now. The patch is againt 8.0-CURRENT.

-- 
Jaakko

Comment 16 Jaakko Heinonen 2009-04-22 15:38:57 UTC

On 2009-04-10, Jaakko Heinonen wrote:
> OK, I have now put together a patch which should avoid the original
> panic you reported.

Have you had a chance to test the patch?

http://www.freebsd.org/cgi/query-pr.cgi?pr=132068

-- 
Jaakko

Comment 17 Weldon Godfrey 2009-04-23 13:40:10 UTC

Sorry.  Around Dec 12 I switched to head.  By increasing kmem to 4GB and
using NFS v2, that reduced the panics to a few times a month.  The
server is in production.  I'll need to acquire some additional drives so
I can install the OS on different drives (in case I need to backout) and
wait until summer to attempt to upgrade.

Weldon

-----Original Message-----
From: Jaakko Heinonen [mailto:jh@saunalahti.fi]=20
Sent: Wednesday, April 22, 2009 9:39 AM
To: Edward Fisk
Cc: bug-followup@FreeBSD.org; Weldon Godfrey
Subject: Re: kern/132068: [zfs] page fault when using ZFS over NFS
on7.1-RELEASE/amd64

On 2009-04-10, Jaakko Heinonen wrote:
> OK, I have now put together a patch which should avoid the original
> panic you reported.

Have you had a chance to test the patch?

http://www.freebsd.org/cgi/query-pr.cgi?pr=3D132068

--=20
Jaakko

Comment 18 Jaakko Heinonen 2009-04-24 11:14:09 UTC

On 2009-04-23, Weldon Godfrey wrote:
> Around Dec 12 I switched to head.
...
> I'll need to acquire some additional drives so I can install the OS on
> different drives (in case I need to backout) and wait until summer to
> attempt to upgrade.

FYI, the patch is against head. 

-- 
Jaakko

Comment 19 dfilter service freebsd_committer

2009-09-12 20:28:07 UTC

Author: pjd
Date: Sat Sep 12 19:27:54 2009
New Revision: 197131
URL: http://svn.freebsd.org/changeset/base/197131

Log:
  Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
  NULL, but also can point to dead vnode, take that into account.
  
  PR:		kern/132068
  Reported by:	Edward Fisk" <7ogcg7g02@sneakemail.com>, kris
  Fix based on patch from:	Jaakko Heinonen <jh@saunalahti.fi>
  MFC after:	1 week

Modified:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c

Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
==============================================================================
--- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	Sat Sep 12 19:07:03 2009	(r197130)
+++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	Sat Sep 12 19:27:54 2009	(r197131)
@@ -890,8 +890,16 @@ again:
 		if (zp->z_unlinked) {
 			err = ENOENT;
 		} else {
-			if (ZTOV(zp) != NULL)
-				VN_HOLD(ZTOV(zp));
+			if ((vp = ZTOV(zp)) != NULL) {
+				VI_LOCK(vp);
+				if ((vp->v_iflag & VI_DOOMED) != 0) {
+					VI_UNLOCK(vp);
+					vp = NULL;
+				} else
+					VI_UNLOCK(vp);
+			}
+			if (vp != NULL)
+				VN_HOLD(vp);
 			else {
 				if (first) {
 					ZFS_LOG(1, "dying znode detected (zp=%p)", zp);
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"

Comment 20 dfilter service freebsd_committer

2009-09-15 12:14:04 UTC

Author: pjd
Date: Tue Sep 15 11:13:40 2009
New Revision: 197215
URL: http://svn.freebsd.org/changeset/base/197215

Log:
  MFC r196456,r196457,r196458,r196662,r196702,r196703,r196919,r196927,r196928,
  r196943,r196944,r196947,r196950,r196953,r196954,r196965,r196978,r196979,
  r196980,r196982,r196985,r196992,r197131,r197133,r197150,r197151,r197152,
  r197153,r197167,r197172,r197177,r197200,r197201:
  
  r196456:
  - Give minclsyspri and maxclsyspri real values (consulted with kmacy).
  - Honour 'pri' argument for thread_create().
  
  r196457:
  Set priority of vdev_geom threads and zvol threads to PRIBIO.
  
  r196458:
  - Hide ZFS kernel threads under zfskern process.
  - Use better (shorter) threads names:
  	'zvol:worker zvol/tank/vol00' -> 'zvol tank/vol00'
  	'vdev:worker da0' -> 'vdev da0'
  
  r196662:
  Add missing mountpoint vnode locking.
  This fixes panic on assertion with DEBUG_VFS_LOCKS and vfs.usermount=1 when
  regular user tries to mount dataset owned by him.
  
  r196702:
  Remove empty directory.
  
  r196703:
  Backport the 'dirtying dbuf' panic fix from newer ZFS version.
  
  Reported by:	Thomas Backman <serenity@exscape.org>
  
  r196919:
  bzero() on-stack argument, so mutex_init() won't misinterpret that the
  lock is already initialized if we have some garbage on the stack.
  
  PR:	kern/135480
  Reported by:	Emil Mikulic <emikulic@gmail.com>
  
  r196927:
  Changing provider size is not really supported by GEOM, but doing so when
  provider is closed should be ok.
  When administrator requests to change ZVOL size do it immediately if ZVOL
  is closed or do it on last ZVOL close.
  
  PR:	kern/136942
  Requested by:	Bernard Buri <bsd@ask-us.at>
  
  r196928:
  Teach zdb(8) how to obtain GEOM provider size.
  
  PR:	kern/133134
  Reported by:	Philipp Wuensche <cryx-freebsd@h3q.com>
  
  r196943:
  - Avoid holding mutex around M_WAITOK allocations.
  - Add locking for mnt_opt field.
  
  r196944:
  Don't recheck ownership on update mount. This will eliminate LOR between
  vfs_busy() and mount mutex. We check ownership in vfs_domount() anyway.
  
  Noticed by:	kib
  Reviewed by:	kib
  
  r196947:
  Defer thread start until we set priority.
  
  Reviewed by:	kib
  
  r196950:
  Fix detection of file system being shared. Now zfs unshare/destroy/rename
  command will properly remove exported file systems.
  
  r196953:
  When snapshot mount point is busy (for example we are still in it)
  we will fail to unmount it, but it won't be removed from the tree,
  so in that case there is no need to reinsert it.
  
  Reported by:	trasz
  
  r196954:
  If we have to use avl_find(), optimize a bit and use avl_insert() instead of
  avl_add() (the latter is actually a wrapper around avl_find() + avl_insert()).
  Fix similar case in the code that is currently commented out.
  
  r196965:
  Fix reference count leak for a case where snapshot's mount point is updated.
  
  r196978:
  Call ZFS_EXIT() after locking the vnode.
  
  r196979:
  On FreeBSD we don't have to look for snapshot's mount point,
  because fhtovp method is already called with proper mount point.
  
  r196980:
  When we automatically mount snapshot we want to return vnode of the mount point
  from the lookup and not covered vnode. This is one of the fixes for using .zfs/
  over NFS.
  
  r196982:
  We don't export individual snapshots, so mnt_export field in snapshot's
  mount point is NULL. That's why when we try to access snapshots over NFS
  use mnt_export field from the parent file system.
  
  r196985:
  Only log successful commands! Without this fix we log even unsuccessful
  commands executed by unprivileged users. Action is not really taken, but it is
  logged to pool history, which might be confusing.
  
  Reported by:	Denis Ahrens <denis@h3q.com>
  
  r196992:
  Implement __assert() for Solaris-specific code. Until now Solaris code was
  using Solaris prototype for __assert(), but FreeBSD's implementation.
  Both take different arguments, so we were either core-dumping in assert()
  or printing garbage.
  
  Reported by:	avg
  
  r197131:
  Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
  NULL, but also can point to dead vnode, take that into account.
  
  PR:	kern/132068
  Reported by:	Edward Fisk <7ogcg7g02@sneakemail.com>, kris
  Fix based on patch from:	Jaakko Heinonen <jh@saunalahti.fi>
  
  r197133:
  - Protect reclaim with z_teardown_inactive_lock.
  - Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
    z_dbuf field is NULL - this might happen in case of rollback or forced
    unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
  - On forced unmount wait for all znodes to be destroyed - destruction can be
    done asynchronously via zfs_reclaim_complete().
  
  r197150:
  There is a bug where mze_insert() can trigger an assert() of inserting
  the same entry twice. This bug is not fixed yet, but leads to situation
  where when try to access corrupted directory the kernel will panic.
  Until the bug is properly fixed, try to recover from it and log that it
  happened.
  
  Reported by:	marck
  OpenSolaris bug:	6709336
  
  r197151:
  Be sure not to overflow struct fid.
  
  r197152:
  Extend scope of the z_teardown_lock lock for consistency and "just in case".
  
  r197153:
  When zfs.ko is compiled with debug, make sure that znode and vnode point at
  each other.
  
  r197167:
  Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
  by just returning EOPNOTSUPP. This will allow NFS server to fall back to
  regular READDIR.
  Note that converting inode number to snapshot's vnode is expensive operation.
  Snapshots are stored in AVL tree, but based on their names, not inode numbers,
  so to convert inode to snapshot vnode we have to interate over all snalshots.
  This is not a problem in OpenSolaris, because in their READDIRPLUS
  implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
  d_fileno as we do.
  
  PR:	kern/125149
  Reported by:	Weldon Godfrey <wgodfrey@ena.com>
  Analysis by:	Jaakko Heinonen <jh@saunalahti.fi>
  
  r197172:
  Add missing \n.
  
  Reported by:	marck
  
  r197177:
  Support both case: when snapshot is already mounted and when it is not yet
  mounted.
  
  r197200:
  Modify mount(8) to skip MNT_IGNORE file systems by default, just like df(1)
  does. This is not POLA violation, because there is no single file system in the
  base that use MNT_IGNORE currently, although ZFS snapshots will be mounted with
  MNT_IGNORE after next commit.
  
  Reviewed by:	kib
  
  r197201:
  - Mount ZFS snapshots with MNT_IGNORE flag, so they are not visible in regular
    df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
    ZFS route of not listing snapshots by default with 'zfs list' command.
  - Add UPDATING entry to note that ZFS snapshots are no longer visible in
    mount(8) and df(1) output by default.
  
  Reviewed by:	kib
  
  Approved by:	re (bz)

Added:
  stable/8/cddl/compat/opensolaris/include/assert.h
     - copied unchanged from r196992, head/cddl/compat/opensolaris/include/assert.h
Deleted:
  stable/8/cddl/contrib/opensolaris/head/assert.h
  stable/8/sys/cddl/contrib/opensolaris/uts/common/rpc/
Modified:
  stable/8/UPDATING
  stable/8/cddl/compat/opensolaris/   (props changed)
  stable/8/cddl/contrib/opensolaris/   (props changed)
  stable/8/cddl/contrib/opensolaris/cmd/zdb/zdb.c
  stable/8/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
  stable/8/sbin/mount/   (props changed)
  stable/8/sbin/mount/mount.8
  stable/8/sbin/mount/mount.c
  stable/8/sys/   (props changed)
  stable/8/sys/amd64/include/xen/   (props changed)
  stable/8/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c
  stable/8/sys/cddl/compat/opensolaris/sys/mutex.h
  stable/8/sys/cddl/compat/opensolaris/sys/proc.h
  stable/8/sys/cddl/compat/opensolaris/sys/vfs.h
  stable/8/sys/cddl/contrib/opensolaris/   (props changed)
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
  stable/8/sys/cddl/contrib/opensolaris/uts/common/sys/callb.h
  stable/8/sys/contrib/dev/acpica/   (props changed)
  stable/8/sys/contrib/pf/   (props changed)
  stable/8/sys/dev/xen/xenpci/   (props changed)

Modified: stable/8/UPDATING
==============================================================================
--- stable/8/UPDATING	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/UPDATING	Tue Sep 15 11:13:40 2009	(r197215)
@@ -22,6 +22,10 @@ NOTE TO PEOPLE WHO THINK THAT FreeBSD 8.
 	to maximize performance.  (To disable malloc debugging, run
 	ln -s aj /etc/malloc.conf.)
 
+20090915:
+	ZFS snapshots are now mounted with MNT_IGNORE flag. Use -v option for
+	mount(8) and -a option for df(1) to see them.
+
 20090813:
 	Remove the option STOP_NMI.  The default action is now to use NMI
 	only for KDB via the newly introduced function stop_cpus_hard()

Copied: stable/8/cddl/compat/opensolaris/include/assert.h (from r196992, head/cddl/compat/opensolaris/include/assert.h)
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ stable/8/cddl/compat/opensolaris/include/assert.h	Tue Sep 15 11:13:40 2009	(r197215, copy of r196992, head/cddl/compat/opensolaris/include/assert.h)
@@ -0,0 +1,55 @@
+/*-
+ * Copyright (c) 2009 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#undef assert
+#undef _assert
+
+#ifdef NDEBUG
+#define	assert(e)	((void)0)
+#define	_assert(e)	((void)0)
+#else
+#define	_assert(e)	assert(e)
+
+#define	assert(e)	((e) ? (void)0 : __assert(#e, __FILE__, __LINE__))
+#endif /* NDEBUG */
+
+#ifndef _ASSERT_H_
+#define _ASSERT_H_
+#include <stdio.h>
+#include <stdlib.h>
+
+static __inline void
+__assert(const char *expr, const char *file, int line)
+{
+
+	(void)fprintf(stderr, "Assertion failed: (%s), file %s, line %d.\n",
+	    expr, file, line);
+	abort();
+	/* NOTREACHED */
+}
+#endif /* !_ASSERT_H_ */

Modified: stable/8/cddl/contrib/opensolaris/cmd/zdb/zdb.c
==============================================================================
--- stable/8/cddl/contrib/opensolaris/cmd/zdb/zdb.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/cddl/contrib/opensolaris/cmd/zdb/zdb.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -1322,6 +1322,14 @@ dump_label(const char *dev)
 		exit(1);
 	}
 
+	if (S_ISCHR(statbuf.st_mode)) {
+		if (ioctl(fd, DIOCGMEDIASIZE, &statbuf.st_size) == -1) {
+			(void) printf("failed to get size of '%s': %s\n", dev,
+			    strerror(errno));
+			exit(1);
+		}
+	}
+
 	psize = statbuf.st_size;
 	psize = P2ALIGN(psize, (uint64_t)sizeof (vdev_label_t));
 

Modified: stable/8/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
==============================================================================
--- stable/8/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -172,6 +172,7 @@ is_shared(libzfs_handle_t *hdl, const ch
 
 		*tab = '\0';
 		if (strcmp(buf, mountpoint) == 0) {
+#if defined(sun)
 			/*
 			 * the protocol field is the third field
 			 * skip over second field
@@ -194,6 +195,10 @@ is_shared(libzfs_handle_t *hdl, const ch
 					return (0);
 				}
 			}
+#else
+			if (proto == PROTO_NFS)
+				return (SHARED_NFS);
+#endif
 		}
 	}
 

Modified: stable/8/sbin/mount/mount.8
==============================================================================
--- stable/8/sbin/mount/mount.8	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sbin/mount/mount.8	Tue Sep 15 11:13:40 2009	(r197215)
@@ -469,6 +469,12 @@ or
 option.
 .It Fl v
 Verbose mode.
+If the
+.Fl v
+is used alone, show all file systems, including those that were mounted with the
+.Dv MNT_IGNORE
+flag and show additional information about each file system (including fsid
+when run by root).
 .It Fl w
 The file system object is to be read and write.
 .El

Modified: stable/8/sbin/mount/mount.c
==============================================================================
--- stable/8/sbin/mount/mount.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sbin/mount/mount.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -348,6 +348,9 @@ main(int argc, char *argv[])
 				if (checkvfsname(mntbuf[i].f_fstypename,
 				    vfslist))
 					continue;
+				if (!verbose &&
+				    (mntbuf[i].f_flags & MNT_IGNORE) != 0)
+					continue;
 				prmount(&mntbuf[i]);
 			}
 		}

Modified: stable/8/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c
==============================================================================
--- stable/8/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -45,20 +45,33 @@ vfs_setmntopt(vfs_t *vfsp, const char *n
 {
 	struct vfsopt *opt;
 	size_t namesize;
+	int locked;
+
+	if (!(locked = mtx_owned(MNT_MTX(vfsp))))
+		MNT_ILOCK(vfsp);
 
 	if (vfsp->mnt_opt == NULL) {
-		vfsp->mnt_opt = malloc(sizeof(*vfsp->mnt_opt), M_MOUNT, M_WAITOK);
-		TAILQ_INIT(vfsp->mnt_opt);
+		void *opts;
+
+		MNT_IUNLOCK(vfsp);
+		opts = malloc(sizeof(*vfsp->mnt_opt), M_MOUNT, M_WAITOK);
+		MNT_ILOCK(vfsp);
+		if (vfsp->mnt_opt == NULL) {
+			vfsp->mnt_opt = opts;
+			TAILQ_INIT(vfsp->mnt_opt);
+		} else {
+			free(opts, M_MOUNT);
+		}
 	}
 
-	opt = malloc(sizeof(*opt), M_MOUNT, M_WAITOK);
+	MNT_IUNLOCK(vfsp);
 
+	opt = malloc(sizeof(*opt), M_MOUNT, M_WAITOK);
 	namesize = strlen(name) + 1;
 	opt->name = malloc(namesize, M_MOUNT, M_WAITOK);
 	strlcpy(opt->name, name, namesize);
 	opt->pos = -1;
 	opt->seen = 1;
-
 	if (arg == NULL) {
 		opt->value = NULL;
 		opt->len = 0;
@@ -67,16 +80,23 @@ vfs_setmntopt(vfs_t *vfsp, const char *n
 		opt->value = malloc(opt->len, M_MOUNT, M_WAITOK);
 		bcopy(arg, opt->value, opt->len);
 	}
-	/* TODO: Locking. */
+
+	MNT_ILOCK(vfsp);
 	TAILQ_INSERT_TAIL(vfsp->mnt_opt, opt, link);
+	if (!locked)
+		MNT_IUNLOCK(vfsp);
 }
 
 void
 vfs_clearmntopt(vfs_t *vfsp, const char *name)
 {
+	int locked;
 
-	/* TODO: Locking. */
+	if (!(locked = mtx_owned(MNT_MTX(vfsp))))
+		MNT_ILOCK(vfsp);
 	vfs_deleteopt(vfsp->mnt_opt, name);
+	if (!locked)
+		MNT_IUNLOCK(vfsp);
 }
 
 int
@@ -92,12 +112,13 @@ vfs_optionisset(const vfs_t *vfsp, const
 }
 
 int
-domount(kthread_t *td, vnode_t *vp, const char *fstype, char *fspath,
+mount_snapshot(kthread_t *td, vnode_t **vpp, const char *fstype, char *fspath,
     char *fspec, int fsflags)
 {
 	struct mount *mp;
 	struct vfsconf *vfsp;
 	struct ucred *cr;
+	vnode_t *vp;
 	int error;
 
 	/*
@@ -112,23 +133,28 @@ domount(kthread_t *td, vnode_t *vp, cons
 	if (vfsp == NULL)
 		return (ENODEV);
 
+	vp = *vpp;
 	if (vp->v_type != VDIR)
 		return (ENOTDIR);
+	/*
+	 * We need vnode lock to protect v_mountedhere and vnode interlock
+	 * to protect v_iflag.
+	 */
+	vn_lock(vp, LK_SHARED | LK_RETRY);
 	VI_LOCK(vp);
-	if ((vp->v_iflag & VI_MOUNT) != 0 ||
-	    vp->v_mountedhere != NULL) {
+	if ((vp->v_iflag & VI_MOUNT) != 0 || vp->v_mountedhere != NULL) {
 		VI_UNLOCK(vp);
+		VOP_UNLOCK(vp, 0);
 		return (EBUSY);
 	}
 	vp->v_iflag |= VI_MOUNT;
 	VI_UNLOCK(vp);
+	VOP_UNLOCK(vp, 0);
 
 	/*
 	 * Allocate and initialize the filesystem.
 	 */
-	vn_lock(vp, LK_SHARED | LK_RETRY);
 	mp = vfs_mount_alloc(vp, vfsp, fspath, td->td_ucred);
-	VOP_UNLOCK(vp, 0);
 
 	mp->mnt_optnew = NULL;
 	vfs_setmntopt(mp, "from", fspec, 0);
@@ -138,11 +164,18 @@ domount(kthread_t *td, vnode_t *vp, cons
 	/*
 	 * Set the mount level flags.
 	 */
-	if (fsflags & MNT_RDONLY)
-		mp->mnt_flag |= MNT_RDONLY;
-	mp->mnt_flag &=~ MNT_UPDATEMASK;
+	mp->mnt_flag &= ~MNT_UPDATEMASK;
 	mp->mnt_flag |= fsflags & (MNT_UPDATEMASK | MNT_FORCE | MNT_ROOTFS);
 	/*
+	 * Snapshots are always read-only.
+	 */
+	mp->mnt_flag |= MNT_RDONLY;
+	/*
+	 * We don't want snapshots to be visible in regular
+	 * mount(8) and df(1) output.
+	 */
+	mp->mnt_flag |= MNT_IGNORE;
+	/*
 	 * Unprivileged user can trigger mounting a snapshot, but we don't want
 	 * him to unmount it, so we switch to privileged of original mount.
 	 */
@@ -150,11 +183,6 @@ domount(kthread_t *td, vnode_t *vp, cons
 	mp->mnt_cred = crdup(vp->v_mount->mnt_cred);
 	mp->mnt_stat.f_owner = mp->mnt_cred->cr_uid;
 	/*
-	 * Mount the filesystem.
-	 * XXX The final recipients of VFS_MOUNT just overwrite the ndp they
-	 * get.  No freeing of cn_pnbuf.
-	 */
-	/*
 	 * XXX: This is evil, but we can't mount a snapshot as a regular user.
 	 * XXX: Is is safe when snapshot is mounted from within a jail?
 	 */
@@ -163,7 +191,7 @@ domount(kthread_t *td, vnode_t *vp, cons
 	error = VFS_MOUNT(mp);
 	td->td_ucred = cr;
 
-	if (!error) {
+	if (error == 0) {
 		if (mp->mnt_opt != NULL)
 			vfs_freeopts(mp->mnt_opt);
 		mp->mnt_opt = mp->mnt_optnew;
@@ -175,42 +203,33 @@ domount(kthread_t *td, vnode_t *vp, cons
 	*/
 	mp->mnt_optnew = NULL;
 	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
-	/*
-	 * Put the new filesystem on the mount list after root.
-	 */
 #ifdef FREEBSD_NAMECACHE
 	cache_purge(vp);
 #endif
-	if (!error) {
+	VI_LOCK(vp);
+	vp->v_iflag &= ~VI_MOUNT;
+	VI_UNLOCK(vp);
+	if (error == 0) {
 		vnode_t *mvp;
 
-		VI_LOCK(vp);
-		vp->v_iflag &= ~VI_MOUNT;
-		VI_UNLOCK(vp);
 		vp->v_mountedhere = mp;
+		/*
+		 * Put the new filesystem on the mount list.
+		 */
 		mtx_lock(&mountlist_mtx);
 		TAILQ_INSERT_TAIL(&mountlist, mp, mnt_list);
 		mtx_unlock(&mountlist_mtx);
 		vfs_event_signal(NULL, VQ_MOUNT, 0);
 		if (VFS_ROOT(mp, LK_EXCLUSIVE, &mvp))
 			panic("mount: lost mount");
-		mountcheckdirs(vp, mvp);
-		vput(mvp);
-		VOP_UNLOCK(vp, 0);
-		if ((mp->mnt_flag & MNT_RDONLY) == 0)
-			error = vfs_allocate_syncvnode(mp);
+		vput(vp);
 		vfs_unbusy(mp);
-		if (error)
-			vrele(vp);
-		else
-			vfs_mountedfrom(mp, fspec);
+		*vpp = mvp;
 	} else {
-		VI_LOCK(vp);
-		vp->v_iflag &= ~VI_MOUNT;
-		VI_UNLOCK(vp);
-		VOP_UNLOCK(vp, 0);
+		vput(vp);
 		vfs_unbusy(mp);
 		vfs_mount_destroy(mp);
+		*vpp = NULL;
 	}
 	return (error);
 }

Modified: stable/8/sys/cddl/compat/opensolaris/sys/mutex.h
==============================================================================
--- stable/8/sys/cddl/compat/opensolaris/sys/mutex.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/compat/opensolaris/sys/mutex.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -32,9 +32,9 @@
 #ifdef _KERNEL
 
 #include <sys/param.h>
-#include <sys/proc.h>
 #include <sys/lock.h>
 #include_next <sys/mutex.h>
+#include <sys/proc.h>
 #include <sys/sx.h>
 
 typedef enum {

Modified: stable/8/sys/cddl/compat/opensolaris/sys/proc.h
==============================================================================
--- stable/8/sys/cddl/compat/opensolaris/sys/proc.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/compat/opensolaris/sys/proc.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -34,13 +34,17 @@
 #include_next <sys/proc.h>
 #include <sys/stdint.h>
 #include <sys/smp.h>
+#include <sys/sched.h>
+#include <sys/lock.h>
+#include <sys/mutex.h>
+#include <sys/unistd.h>
 #include <sys/debug.h>
 
 #ifdef _KERNEL
 
 #define	CPU		curcpu
-#define	minclsyspri	0
-#define	maxclsyspri	0
+#define	minclsyspri	PRIBIO
+#define	maxclsyspri	PVM
 #define	max_ncpus	mp_ncpus
 #define	boot_max_ncpus	mp_ncpus
 
@@ -54,11 +58,13 @@ typedef	struct thread	kthread_t;
 typedef struct thread	*kthread_id_t;
 typedef struct proc	proc_t;
 
+extern struct proc *zfsproc;
+
 static __inline kthread_t *
 thread_create(caddr_t stk, size_t stksize, void (*proc)(void *), void *arg,
     size_t len, proc_t *pp, int state, pri_t pri)
 {
-	proc_t *p;
+	kthread_t *td = NULL;
 	int error;
 
 	/*
@@ -67,13 +73,20 @@ thread_create(caddr_t stk, size_t stksiz
 	ASSERT(stk == NULL);
 	ASSERT(len == 0);
 	ASSERT(state == TS_RUN);
+	ASSERT(pp == &p0);
 
-	error = kproc_create(proc, arg, &p, 0, stksize / PAGE_SIZE,
-	    "solthread %p", proc);
-	return (error == 0 ? FIRST_THREAD_IN_PROC(p) : NULL);
+	error = kproc_kthread_add(proc, arg, &zfsproc, &td, RFSTOPPED,
+	    stksize / PAGE_SIZE, "zfskern", "solthread %p", proc);
+	if (error == 0) {
+		thread_lock(td);
+		sched_prio(td, pri);
+		sched_add(td, SRQ_BORING);
+		thread_unlock(td);
+	}
+	return (td);
 }
 
-#define	thread_exit()	kproc_exit(0)
+#define	thread_exit()	kthread_exit()
 
 #endif	/* _KERNEL */
 

Modified: stable/8/sys/cddl/compat/opensolaris/sys/vfs.h
==============================================================================
--- stable/8/sys/cddl/compat/opensolaris/sys/vfs.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/compat/opensolaris/sys/vfs.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -110,8 +110,8 @@ void vfs_setmntopt(vfs_t *vfsp, const ch
     int flags __unused);
 void vfs_clearmntopt(vfs_t *vfsp, const char *name);
 int vfs_optionisset(const vfs_t *vfsp, const char *opt, char **argp);
-int domount(kthread_t *td, vnode_t *vp, const char *fstype, char *fspath,
-    char *fspec, int fsflags);
+int mount_snapshot(kthread_t *td, vnode_t **vpp, const char *fstype,
+    char *fspath, char *fspec, int fsflags);
 
 typedef	uint64_t	vfs_feature_t;
 

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -19,7 +19,7 @@
  * CDDL HEADER END
  */
 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
@@ -864,10 +864,11 @@ restore_object(struct restorearg *ra, ob
 		/* currently allocated, want to be allocated */
 		dmu_tx_hold_bonus(tx, drro->drr_object);
 		/*
-		 * We may change blocksize, so need to
-		 * hold_write
+		 * We may change blocksize and delete old content,
+		 * so need to hold_write and hold_free.
 		 */
 		dmu_tx_hold_write(tx, drro->drr_object, 0, 1);
+		dmu_tx_hold_free(tx, drro->drr_object, 0, DMU_OBJECT_END);
 		err = dmu_tx_assign(tx, TXG_WAIT);
 		if (err) {
 			dmu_tx_abort(tx);

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -415,7 +415,7 @@ void
 dnode_reallocate(dnode_t *dn, dmu_object_type_t ot, int blocksize,
     dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
 {
-	int i, old_nblkptr;
+	int i, nblkptr;
 	dmu_buf_impl_t *db = NULL;
 
 	ASSERT3U(blocksize, >=, SPA_MINBLOCKSIZE);
@@ -445,6 +445,8 @@ dnode_reallocate(dnode_t *dn, dmu_object
 		dnode_free_range(dn, 0, -1ULL, tx);
 	}
 
+	nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
+
 	/* change blocksize */
 	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
 	if (blocksize != dn->dn_datablksz &&
@@ -457,6 +459,8 @@ dnode_reallocate(dnode_t *dn, dmu_object
 	dnode_setdirty(dn, tx);
 	dn->dn_next_bonuslen[tx->tx_txg&TXG_MASK] = bonuslen;
 	dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = blocksize;
+	if (dn->dn_nblkptr != nblkptr)
+		dn->dn_next_nblkptr[tx->tx_txg&TXG_MASK] = nblkptr;
 	rw_exit(&dn->dn_struct_rwlock);
 	if (db)
 		dbuf_rele(db, FTAG);
@@ -466,19 +470,15 @@ dnode_reallocate(dnode_t *dn, dmu_object
 
 	/* change bonus size and type */
 	mutex_enter(&dn->dn_mtx);
-	old_nblkptr = dn->dn_nblkptr;
 	dn->dn_bonustype = bonustype;
 	dn->dn_bonuslen = bonuslen;
-	dn->dn_nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
+	dn->dn_nblkptr = nblkptr;
 	dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
 	dn->dn_compress = ZIO_COMPRESS_INHERIT;
 	ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);
 
-	/* XXX - for now, we can't make nblkptr smaller */
-	ASSERT3U(dn->dn_nblkptr, >=, old_nblkptr);
-
-	/* fix up the bonus db_size if dn_nblkptr has changed */
-	if (dn->dn_bonus && dn->dn_bonuslen != old_nblkptr) {
+	/* fix up the bonus db_size */
+	if (dn->dn_bonus) {
 		dn->dn_bonus->db.db_size =
 		    DN_MAX_BONUSLEN - (dn->dn_nblkptr-1) * sizeof (blkptr_t);
 		ASSERT(dn->dn_bonuslen <= dn->dn_bonus->db.db_size);

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -19,12 +19,10 @@
  * CDDL HEADER END
  */
 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
-#pragma ident	"%Z%%M%	%I%	%E% SMI"
-
 #include <sys/zfs_context.h>
 #include <sys/dbuf.h>
 #include <sys/dnode.h>
@@ -534,18 +532,12 @@ dnode_sync(dnode_t *dn, dmu_tx_t *tx)
 			/* XXX shouldn't the phys already be zeroed? */
 			bzero(dnp, DNODE_CORE_SIZE);
 			dnp->dn_nlevels = 1;
+			dnp->dn_nblkptr = dn->dn_nblkptr;
 		}
 
-		if (dn->dn_nblkptr > dnp->dn_nblkptr) {
-			/* zero the new blkptrs we are gaining */
-			bzero(dnp->dn_blkptr + dnp->dn_nblkptr,
-			    sizeof (blkptr_t) *
-			    (dn->dn_nblkptr - dnp->dn_nblkptr));
-		}
 		dnp->dn_type = dn->dn_type;
 		dnp->dn_bonustype = dn->dn_bonustype;
 		dnp->dn_bonuslen = dn->dn_bonuslen;
-		dnp->dn_nblkptr = dn->dn_nblkptr;
 	}
 
 	ASSERT(dnp->dn_nlevels > 1 ||
@@ -605,6 +597,30 @@ dnode_sync(dnode_t *dn, dmu_tx_t *tx)
 		return;
 	}
 
+	if (dn->dn_next_nblkptr[txgoff]) {
+		/* this should only happen on a realloc */
+		ASSERT(dn->dn_allocated_txg == tx->tx_txg);
+		if (dn->dn_next_nblkptr[txgoff] > dnp->dn_nblkptr) {
+			/* zero the new blkptrs we are gaining */
+			bzero(dnp->dn_blkptr + dnp->dn_nblkptr,
+			    sizeof (blkptr_t) *
+			    (dn->dn_next_nblkptr[txgoff] - dnp->dn_nblkptr));
+#ifdef ZFS_DEBUG
+		} else {
+			int i;
+			ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
+			/* the blkptrs we are losing better be unallocated */
+			for (i = dn->dn_next_nblkptr[txgoff];
+			    i < dnp->dn_nblkptr; i++)
+				ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));
+#endif
+		}
+		mutex_enter(&dn->dn_mtx);
+		dnp->dn_nblkptr = dn->dn_next_nblkptr[txgoff];
+		dn->dn_next_nblkptr[txgoff] = 0;
+		mutex_exit(&dn->dn_mtx);
+	}
+
 	if (dn->dn_next_nlevels[txgoff]) {
 		dnode_increase_indirection(dn, tx);
 		dn->dn_next_nlevels[txgoff] = 0;

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -1419,6 +1419,7 @@ dsl_dataset_drain_refs(dsl_dataset_t *ds
 {
 	struct refsarg arg;
 
+	bzero(&arg, sizeof(arg));
 	mutex_init(&arg.lock, NULL, MUTEX_DEFAULT, NULL);
 	cv_init(&arg.cv, NULL, CV_DEFAULT, NULL);
 	arg.gone = FALSE;

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -19,7 +19,7 @@
  * CDDL HEADER END
  */
 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
@@ -160,6 +160,7 @@ typedef struct dnode {
 	uint16_t dn_datablkszsec;	/* in 512b sectors */
 	uint32_t dn_datablksz;		/* in bytes */
 	uint64_t dn_maxblkid;
+	uint8_t dn_next_nblkptr[TXG_SIZE];
 	uint8_t dn_next_nlevels[TXG_SIZE];
 	uint8_t dn_next_indblkshift[TXG_SIZE];
 	uint16_t dn_next_bonuslen[TXG_SIZE];

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	Tue Sep 15 11:13:40 2009	(r197215)
@@ -231,8 +231,27 @@ typedef struct znode {
 /*
  * Convert between znode pointers and vnode pointers
  */
+#ifdef DEBUG
+static __inline vnode_t *
+ZTOV(znode_t *zp)
+{
+	vnode_t *vp = zp->z_vnode;
+
+	ASSERT(vp == NULL || vp->v_data == NULL || vp->v_data == zp);
+	return (vp);
+}
+static __inline znode_t *
+VTOZ(vnode_t *vp)
+{
+	znode_t *zp = (znode_t *)vp->v_data;
+
+	ASSERT(zp == NULL || zp->z_vnode == NULL || zp->z_vnode == vp);
+	return (zp);
+}
+#else
 #define	ZTOV(ZP)	((ZP)->z_vnode)
 #define	VTOZ(VP)	((znode_t *)(VP)->v_data)
+#endif
 
 /*
  * ZFS_ENTER() is called on entry to each ZFS vnode and vfs operation.

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -194,6 +194,10 @@ vdev_geom_worker(void *arg)
 	zio_t *zio;
 	struct bio *bp;
 
+	thread_lock(curthread);
+	sched_prio(curthread, PRIBIO);
+	thread_unlock(curthread);
+
 	ctx = arg;
 	for (;;) {
 		mtx_lock(&ctx->gc_queue_mtx);
@@ -203,7 +207,7 @@ vdev_geom_worker(void *arg)
 				ctx->gc_state = 2;
 				wakeup_one(&ctx->gc_state);
 				mtx_unlock(&ctx->gc_queue_mtx);
-				kproc_exit(0);
+				kthread_exit();
 			}
 			msleep(&ctx->gc_queue, &ctx->gc_queue_mtx,
 			    PRIBIO | PDROP, "vgeom:io", 0);
@@ -530,8 +534,8 @@ vdev_geom_open(vdev_t *vd, uint64_t *psi
 	vd->vdev_tsd = ctx;
 	pp = cp->provider;
 
-	kproc_create(vdev_geom_worker, ctx, NULL, 0, 0, "vdev:worker %s",
-	    pp->name);
+	kproc_kthread_add(vdev_geom_worker, ctx, &zfsproc, NULL, 0, 0,
+	    "zfskern", "vdev %s", pp->name);
 
 	/*
 	 * Determine the actual size of the device.

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -181,10 +181,11 @@ mze_compare(const void *arg1, const void
 	return (0);
 }
 
-static void
+static int
 mze_insert(zap_t *zap, int chunkid, uint64_t hash, mzap_ent_phys_t *mzep)
 {
 	mzap_ent_t *mze;
+	avl_index_t idx;
 
 	ASSERT(zap->zap_ismicro);
 	ASSERT(RW_WRITE_HELD(&zap->zap_rwlock));
@@ -194,7 +195,12 @@ mze_insert(zap_t *zap, int chunkid, uint
 	mze->mze_chunkid = chunkid;
 	mze->mze_hash = hash;
 	mze->mze_phys = *mzep;
-	avl_add(&zap->zap_m.zap_avl, mze);
+	if (avl_find(&zap->zap_m.zap_avl, mze, &idx) != NULL) {
+		kmem_free(mze, sizeof (mzap_ent_t));
+		return (EEXIST);
+	}
+	avl_insert(&zap->zap_m.zap_avl, mze, idx);
+	return (0);
 }
 
 static mzap_ent_t *
@@ -329,10 +335,15 @@ mzap_open(objset_t *os, uint64_t obj, dm
 			if (mze->mze_name[0]) {
 				zap_name_t *zn;
 
-				zap->zap_m.zap_num_entries++;
 				zn = zap_name_alloc(zap, mze->mze_name,
 				    MT_EXACT);
-				mze_insert(zap, i, zn->zn_hash, mze);
+				if (mze_insert(zap, i, zn->zn_hash, mze) == 0)
+					zap->zap_m.zap_num_entries++;
+				else {
+					printf("ZFS WARNING: Duplicated ZAP "
+					    "entry detected (%s).\n",
+					    mze->mze_name);
+				}
 				zap_name_free(zn);
 			}
 		}
@@ -771,7 +782,7 @@ again:
 			if (zap->zap_m.zap_alloc_next ==
 			    zap->zap_m.zap_num_chunks)
 				zap->zap_m.zap_alloc_next = 0;
-			mze_insert(zap, i, zn->zn_hash, mze);
+			VERIFY(0 == mze_insert(zap, i, zn->zn_hash, mze));
 			return;
 		}
 	}

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -669,9 +669,12 @@ zfsctl_snapdir_remove(vnode_t *dvp, char
 	if (sep) {
 		avl_remove(&sdp->sd_snaps, sep);
 		err = zfsctl_unmount_snap(sep, MS_FORCE, cr);
-		if (err)
-			avl_add(&sdp->sd_snaps, sep);
-		else
+		if (err) {
+			avl_index_t where;
+
+			if (avl_find(&sdp->sd_snaps, sep, &where) == NULL)
+				avl_insert(&sdp->sd_snaps, sep, where);
+		} else
 			err = dmu_objset_destroy(snapname);
 	} else {
 		err = ENOENT;
@@ -877,20 +880,20 @@ domount:
 	mountpoint = kmem_alloc(mountpoint_len, KM_SLEEP);
 	(void) snprintf(mountpoint, mountpoint_len, "%s/.zfs/snapshot/%s",
 	    dvp->v_vfsp->mnt_stat.f_mntonname, nm);
-	err = domount(curthread, *vpp, "zfs", mountpoint, snapname, 0);
+	err = mount_snapshot(curthread, vpp, "zfs", mountpoint, snapname, 0);
 	kmem_free(mountpoint, mountpoint_len);
-	/* FreeBSD: This line was moved from below to avoid a lock recursion. */
-	if (err == 0)
-		vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY);
-	mutex_exit(&sdp->sd_lock);
-	/*
-	 * If we had an error, drop our hold on the vnode and
-	 * zfsctl_snapshot_inactive() will clean up.
-	 */
-	if (err) {
-		VN_RELE(*vpp);
-		*vpp = NULL;
+	if (err == 0) {
+		/*
+		 * Fix up the root vnode mounted on .zfs/snapshot/<snapname>.
+		 *
+		 * This is where we lie about our v_vfsp in order to
+		 * make .zfs/snapshot/<snapname> accessible over NFS
+		 * without requiring manual mounts of <snapname>.
+		 */
+		ASSERT(VTOZ(*vpp)->z_zfsvfs != zfsvfs);
+		VTOZ(*vpp)->z_zfsvfs->z_parent = zfsvfs;
 	}
+	mutex_exit(&sdp->sd_lock);
 	ZFS_EXIT(zfsvfs);
 	return (err);
 }
@@ -1344,7 +1347,17 @@ zfsctl_umount_snapshots(vfs_t *vfsp, int
 		if (vn_ismntpt(sep->se_root)) {
 			error = zfsctl_unmount_snap(sep, fflags, cr);
 			if (error) {
-				avl_add(&sdp->sd_snaps, sep);
+				avl_index_t where;
+
+				/*
+				 * Before reinserting snapshot to the tree,
+				 * check if it was actually removed. For example
+				 * when snapshot mount point is busy, we will
+				 * have an error here, but there will be no need
+				 * to reinsert snapshot.
+				 */
+				if (avl_find(&sdp->sd_snaps, sep, &where) == NULL)
+					avl_insert(&sdp->sd_snaps, sep, where);
 				break;
 			}
 		}

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -3021,8 +3021,10 @@ zfsdev_ioctl(struct cdev *dev, u_long cm
 	if (error == 0)
 		error = zfs_ioc_vec[vec].zvec_func(zc);
 
-	if (zfs_ioc_vec[vec].zvec_his_log == B_TRUE)
-		zfs_log_history(zc);
+	if (error == 0) {
+		if (zfs_ioc_vec[vec].zvec_his_log == B_TRUE)
+			zfs_log_history(zc);
+	}
 
 	return (error);
 }
@@ -3057,6 +3059,7 @@ zfsdev_fini(void)
 }
 
 static struct root_hold_token *zfs_root_token;
+struct proc *zfsproc;
 
 uint_t zfs_fsyncer_key;
 extern uint_t rrw_tsd_key;

Modified: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
==============================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Tue Sep 15 02:25:03 2009	(r197214)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Tue Sep 15 11:13:40 2009	(r197215)
@@ -97,6 +97,8 @@ static int zfs_root(vfs_t *vfsp, int fla
 static int zfs_statfs(vfs_t *vfsp, struct statfs *statp);
 static int zfs_vget(vfs_t *vfsp, ino_t ino, int flags, vnode_t **vpp);
 static int zfs_sync(vfs_t *vfsp, int waitfor);
+static int zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp,
+    struct ucred **credanonp, int *numsecflavors, int **secflavors);
 static int zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vnode_t **vpp);
 static void zfs_objset_close(zfsvfs_t *zfsvfs);
 static void zfs_freevfs(vfs_t *vfsp);
@@ -108,6 +110,7 @@ static struct vfsops zfs_vfsops = {
 	.vfs_statfs =		zfs_statfs,
 	.vfs_vget =		zfs_vget,
 	.vfs_sync =		zfs_sync,
+	.vfs_checkexp =		zfs_checkexp,
 	.vfs_fhtovp =		zfs_fhtovp,
 };
 
@@ -337,6 +340,13 @@ zfs_register_callbacks(vfs_t *vfsp)
 	os = zfsvfs->z_os;
 
 	/*
+	 * This function can be called for a snapshot when we update snapshot's
+	 * mount point, which isn't really supported.
+	 */
+	if (dmu_objset_is_snapshot(os))
+		return (EOPNOTSUPP);
+
+	/*
 	 * The act of registering our callbacks will destroy any mount
 	 * options we may have.  In order to enable temporary overrides
 	 * of mount options, we stash away the current values and
@@ -719,7 +729,10 @@ zfs_mount(vfs_t *vfsp)
 	error = secpolicy_fs_mount(cr, mvp, vfsp);
 	if (error) {
 		error = dsl_deleg_access(osname, ZFS_DELEG_PERM_MOUNT, cr);
-		if (error == 0) {
+		if (error != 0)
+			goto out;
+
+		if (!(vfsp->vfs_flag & MS_REMOUNT)) {
 			vattr_t		vattr;
 
 			/*
@@ -729,7 +742,9 @@ zfs_mount(vfs_t *vfsp)
 
 			vattr.va_mask = AT_UID;
 
+			vn_lock(mvp, LK_SHARED | LK_RETRY);
 			if (error = VOP_GETATTR(mvp, &vattr, cr)) {
+				VOP_UNLOCK(mvp, 0);
 				goto out;
 			}
 
@@ -741,18 +756,19 @@ zfs_mount(vfs_t *vfsp)
 			}
 #else
 			if (error = secpolicy_vnode_owner(mvp, cr, vattr.va_uid)) {
+				VOP_UNLOCK(mvp, 0);
 				goto out;
 			}
 
 			if (error = VOP_ACCESS(mvp, VWRITE, cr, td)) {
+				VOP_UNLOCK(mvp, 0);
 				goto out;
 			}
+			VOP_UNLOCK(mvp, 0);
 #endif
-
-			secpolicy_fs_mount_clearopts(cr, vfsp);
-		} else {
-			goto out;
 		}
+
+		secpolicy_fs_mount_clearopts(cr, vfsp);
 	}
 
 	/*
@@ -931,6 +947,18 @@ zfsvfs_teardown(zfsvfs_t *zfsvfs, boolea
 		zfsvfs->z_unmounted = B_TRUE;
 		rrw_exit(&zfsvfs->z_teardown_lock, FTAG);
 		rw_exit(&zfsvfs->z_teardown_inactive_lock);
+
+#ifdef __FreeBSD__
+		/*
+		 * Some znodes might not be fully reclaimed, wait for them.
+		 */
+		mutex_enter(&zfsvfs->z_znodes_lock);
+		while (list_head(&zfsvfs->z_all_znodes) != NULL) {
+			msleep(zfsvfs, &zfsvfs->z_znodes_lock, 0,
+			    "zteardown", 0);
+		}
+		mutex_exit(&zfsvfs->z_znodes_lock);
+#endif
 	}
 
 	/*
@@ -1086,6 +1114,20 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 	znode_t		*zp;
 	int 		err;
 
+	/*
+	 * XXXPJD: zfs_zget() can't operate on virtual entires like .zfs/ or
+	 * .zfs/snapshot/ directories, so for now just return EOPNOTSUPP.
+	 * This will make NFS to fall back to using READDIR instead of
+	 * READDIRPLUS.
+	 * Also snapshots are stored in AVL tree, but based on their names,
+	 * not inode numbers, so it will be very inefficient to iterate
+	 * over all snapshots to find the right one.
+	 * Note that OpenSolaris READDIRPLUS implementation does LOOKUP on
+	 * d_name, and not VGET on d_fileno as we do.
+	 */
+	if (ino == ZFSCTL_INO_ROOT || ino == ZFSCTL_INO_SNAPDIR)
+		return (EOPNOTSUPP);
+
 	ZFS_ENTER(zfsvfs);
 	err = zfs_zget(zfsvfs, ino, &zp);
 	if (err == 0 && zp->z_unlinked) {
@@ -1103,6 +1145,28 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 }
 
 static int
+zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp,
+    struct ucred **credanonp, int *numsecflavors, int **secflavors)
+{
+	zfsvfs_t *zfsvfs = vfsp->vfs_data;
+
+	/*
+	 * If this is regular file system vfsp is the same as
+	 * zfsvfs->z_parent->z_vfs, but if it is snapshot,
+	 * zfsvfs->z_parent->z_vfs represents parent file system
+	 * which we have to use here, because only this file system
+	 * has mnt_export configured.
+	 */
+	vfsp = zfsvfs->z_parent->z_vfs;
+
+	return (vfs_stdcheckexp(zfsvfs->z_parent->z_vfs, nam, extflagsp,
+	    credanonp, numsecflavors, secflavors));
+}
+
+CTASSERT(SHORT_FID_LEN <= sizeof(struct fid));
+CTASSERT(LONG_FID_LEN <= sizeof(struct fid));
+
+static int
 zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vnode_t **vpp)
 {
 	zfsvfs_t	*zfsvfs = vfsp->vfs_data;
@@ -1117,7 +1181,11 @@ zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vno
 
 	ZFS_ENTER(zfsvfs);
 
-	if (fidp->fid_len == LONG_FID_LEN) {
+	/*
+	 * On FreeBSD we can get snapshot's mount point or its parent file
+	 * system mount point depending if snapshot is already mounted or not.
+	 */
+	if (zfsvfs->z_parent == zfsvfs && fidp->fid_len == LONG_FID_LEN) {
 		zfid_long_t	*zlfid = (zfid_long_t *)fidp;
 		uint64_t	objsetid = 0;
 		uint64_t	setgen = 0;
@@ -1160,9 +1228,8 @@ zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vno
 		} else {
 			VN_HOLD(*vpp);
 		}
-		ZFS_EXIT(zfsvfs);
-		/* XXX: LK_RETRY? */
 		vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY);
+		ZFS_EXIT(zfsvfs);
 		return (0);
 	}
 
@@ -1184,7 +1251,6 @@ zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vno
 	}
 
 	*vpp = ZTOV(zp);
-	/* XXX: LK_RETRY? */
 	vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY);
 	vnode_create_vobject(*vpp, zp->z_phys->zp_size, curthread);
 	ZFS_EXIT(zfsvfs);

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"

Comment 21 dfilter service freebsd_committer

2010-01-06 16:10:16 UTC

Author: netchild
Date: Wed Jan  6 16:09:58 2010
New Revision: 201651
URL: http://svn.freebsd.org/changeset/base/201651

Log:
  MFC several ZFS related commits:
  
  r196980:
  ---snip---
      When we automatically mount snapshot we want to return vnode of the mount point
      from the lookup and not covered vnode. This is one of the fixes for using .zfs/
      over NFS.
  ---snip---
  
  r196982:
  ---snip---
      We don't export individual snapshots, so mnt_export field in snapshot's
      mount point is NULL. That's why when we try to access snapshots over NFS
      use mnt_export field from the parent file system.
  ---snip---
  
  r197131:
  ---snip---
      Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
      NULL, but also can point to dead vnode, take that into account.
  
      PR:				kern/132068
      Reported by:		Edward Fisk" <7ogcg7g02@sneakemail.com>, kris
      Fix based on patch from:	Jaakko Heinonen <jh@saunalahti.fi>
  ---snip---
  
  r197133:
  ---snip---
      - Protect reclaim with z_teardown_inactive_lock.
      - Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
        z_dbuf field is NULL - this might happen in case of rollback or forced
        unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
      - On forced unmount wait for all znodes to be destroyed - destruction can be
        done asynchronously via zfs_reclaim_complete().
  ---snip---
  
  r197153:
  ---snip---
      When zfs.ko is compiled with debug, make sure that znode and vnode point at
      each other.
  ---snip---
  
  r197167:
  ---snip---
      Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
      by just returning EOPNOTSUPP. This will allow NFS server to fall back to
      regular READDIR.
  
      Note that converting inode number to snapshot's vnode is expensive operation.
      Snapshots are stored in AVL tree, but based on their names, not inode numbers,
      so to convert inode to snapshot vnode we have to interate over all snalshots.
  
      This is not a problem in OpenSolaris, because in their READDIRPLUS
      implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
      d_fileno as we do.
  
      PR:			kern/125149
      Reported by:	Weldon Godfrey <wgodfrey@ena.com>
      Analysis by:	Jaakko Heinonen <jh@saunalahti.fi>
  ---snip---
  
  r197177:
  ---snip---
      Support both case: when snapshot is already mounted and when it is not yet
      mounted.
  ---snip---
  
  r197201:
  ---snip---
      - Mount ZFS snapshots with MNT_IGNORE flag, so they are not visible in regular
        df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
        ZFS route of not listing snapshots by default with 'zfs list' command.
      - Add UPDATING entry to note that ZFS snapshots are no longer visible in
        mount(8) and df(1) output by default.
  
      Reviewed by:	kib
  ---snip---
  Note: the MNT_IGNORE part is commented out in this commit and the UPDATING
  entry is not merged, as this would be a POLA violation on a stable branch.
  This revision is included here, as it also makes locking changes and makes
  sure that a snapshot is mounted RO.
  
  r197426:
  ---snip---
      Restore BSD behaviour - when creating new directory entry use parent directory
      gid to set group ownership and not process gid.
  
      This was overlooked during v6 -> v13 switch.
  
      PR:			kern/139076
      Reported by:	Sean Winn <sean@gothic.net.au>
  ---snip---
  
  r197458:
  ---snip---
      Close race in zfs_zget(). We have to increase usecount first and then
      check for VI_DOOMED flag. Before this change vnode could be reclaimed
      between checking for the flag and increasing usecount.
  ---snip---

Modified:
  stable/7/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c
  stable/7/sys/cddl/compat/opensolaris/sys/vfs.h
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
  stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
Directory Properties:
  stable/7/sys/   (props changed)
  stable/7/sys/cddl/contrib/opensolaris/   (props changed)
  stable/7/sys/contrib/dev/acpica/   (props changed)
  stable/7/sys/contrib/pf/   (props changed)

Modified: stable/7/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c
==============================================================================
--- stable/7/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -115,12 +115,13 @@ extern struct mount *vfs_mount_alloc(str
     const char *fspath, struct thread *td);
 
 int
-domount(kthread_t *td, vnode_t *vp, const char *fstype, char *fspath,
+mount_snapshot(kthread_t *td, vnode_t **vpp, const char *fstype, char *fspath,
     char *fspec, int fsflags)
 {
 	struct mount *mp;
 	struct vfsconf *vfsp;
 	struct ucred *cr;
+	vnode_t *vp;
 	int error;
 
 	/*
@@ -135,23 +136,28 @@ domount(kthread_t *td, vnode_t *vp, cons
 	if (vfsp == NULL)
 		return (ENODEV);
 
+	vp = *vpp;
 	if (vp->v_type != VDIR)
 		return (ENOTDIR);
+	/*
+	 * We need vnode lock to protect v_mountedhere and vnode interlock
+	 * to protect v_iflag.
+	 */
+	vn_lock(vp, LK_SHARED | LK_RETRY, td);
 	VI_LOCK(vp);
-	if ((vp->v_iflag & VI_MOUNT) != 0 ||
-	    vp->v_mountedhere != NULL) {
+	if ((vp->v_iflag & VI_MOUNT) != 0 || vp->v_mountedhere != NULL) {
 		VI_UNLOCK(vp);
+		VOP_UNLOCK(vp, 0, td);
 		return (EBUSY);
 	}
 	vp->v_iflag |= VI_MOUNT;
 	VI_UNLOCK(vp);
+	VOP_UNLOCK(vp, 0, td);
 
 	/*
 	 * Allocate and initialize the filesystem.
 	 */
-	vn_lock(vp, LK_SHARED | LK_RETRY, td);
 	mp = vfs_mount_alloc(vp, vfsp, fspath, td);
-	VOP_UNLOCK(vp, 0,td);
 
 	mp->mnt_optnew = NULL;
 	vfs_setmntopt(mp, "from", fspec, 0);
@@ -161,11 +167,20 @@ domount(kthread_t *td, vnode_t *vp, cons
 	/*
 	 * Set the mount level flags.
 	 */
-	if (fsflags & MNT_RDONLY)
-		mp->mnt_flag |= MNT_RDONLY;
-	mp->mnt_flag &=~ MNT_UPDATEMASK;
+	mp->mnt_flag &= ~MNT_UPDATEMASK;
 	mp->mnt_flag |= fsflags & (MNT_UPDATEMASK | MNT_FORCE | MNT_ROOTFS);
 	/*
+	 * Snapshots are always read-only.
+	 */
+	mp->mnt_flag |= MNT_RDONLY;
+#if 0
+	/*
+	 * We don't want snapshots to be visible in regular
+	 * mount(8) and df(1) output.
+	 */
+	mp->mnt_flag |= MNT_IGNORE;
+#endif
+	/*
 	 * Unprivileged user can trigger mounting a snapshot, but we don't want
 	 * him to unmount it, so we switch to privileged of original mount.
 	 */
@@ -173,11 +188,6 @@ domount(kthread_t *td, vnode_t *vp, cons
 	mp->mnt_cred = crdup(vp->v_mount->mnt_cred);
 	mp->mnt_stat.f_owner = mp->mnt_cred->cr_uid;
 	/*
-	 * Mount the filesystem.
-	 * XXX The final recipients of VFS_MOUNT just overwrite the ndp they
-	 * get.  No freeing of cn_pnbuf.
-	 */
-	/*
 	 * XXX: This is evil, but we can't mount a snapshot as a regular user.
 	 * XXX: Is is safe when snapshot is mounted from within a jail?
 	 */
@@ -186,7 +196,7 @@ domount(kthread_t *td, vnode_t *vp, cons
 	error = VFS_MOUNT(mp, td);
 	td->td_ucred = cr;
 
-	if (!error) {
+	if (error == 0) {
 		if (mp->mnt_opt != NULL)
 			vfs_freeopts(mp->mnt_opt);
 		mp->mnt_opt = mp->mnt_optnew;
@@ -198,42 +208,33 @@ domount(kthread_t *td, vnode_t *vp, cons
 	*/
 	mp->mnt_optnew = NULL;
 	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
-	/*
-	 * Put the new filesystem on the mount list after root.
-	 */
 #ifdef FREEBSD_NAMECACHE
 	cache_purge(vp);
 #endif
-	if (!error) {
+	VI_LOCK(vp);
+	vp->v_iflag &= ~VI_MOUNT;
+	VI_UNLOCK(vp);
+	if (error == 0) {
 		vnode_t *mvp;
 
-		VI_LOCK(vp);
-		vp->v_iflag &= ~VI_MOUNT;
-		VI_UNLOCK(vp);
 		vp->v_mountedhere = mp;
+		/*
+		 * Put the new filesystem on the mount list.
+		 */
 		mtx_lock(&mountlist_mtx);
 		TAILQ_INSERT_TAIL(&mountlist, mp, mnt_list);
 		mtx_unlock(&mountlist_mtx);
 		vfs_event_signal(NULL, VQ_MOUNT, 0);
 		if (VFS_ROOT(mp, LK_EXCLUSIVE, &mvp, td))
 			panic("mount: lost mount");
-		mountcheckdirs(vp, mvp);
-		vput(mvp);
-		VOP_UNLOCK(vp, 0, td);
-		if ((mp->mnt_flag & MNT_RDONLY) == 0)
-			error = vfs_allocate_syncvnode(mp);
+		vput(vp);
 		vfs_unbusy(mp, td);
-		if (error)
-			vrele(vp);
-		else
-			vfs_mountedfrom(mp, fspec);
+		*vpp = mvp;
 	} else {
-		VI_LOCK(vp);
-		vp->v_iflag &= ~VI_MOUNT;
-		VI_UNLOCK(vp);
-		VOP_UNLOCK(vp, 0, td);
+		vput(vp);
 		vfs_unbusy(mp, td);
 		vfs_mount_destroy(mp);
+		*vpp = NULL;
 	}
 	return (error);
 }

Modified: stable/7/sys/cddl/compat/opensolaris/sys/vfs.h
==============================================================================
--- stable/7/sys/cddl/compat/opensolaris/sys/vfs.h	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/compat/opensolaris/sys/vfs.h	Wed Jan  6 16:09:58 2010	(r201651)
@@ -110,8 +110,8 @@ void vfs_setmntopt(vfs_t *vfsp, const ch
     int flags __unused);
 void vfs_clearmntopt(vfs_t *vfsp, const char *name);
 int vfs_optionisset(const vfs_t *vfsp, const char *opt, char **argp);
-int domount(kthread_t *td, vnode_t *vp, const char *fstype, char *fspath,
-    char *fspec, int fsflags);
+int mount_snapshot(kthread_t *td, vnode_t **vpp, const char *fstype,
+    char *fspath, char *fspec, int fsflags);
 
 typedef	uint64_t	vfs_feature_t;
 

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	Wed Jan  6 16:09:58 2010	(r201651)
@@ -231,8 +231,27 @@ typedef struct znode {
 /*
  * Convert between znode pointers and vnode pointers
  */
+#ifdef DEBUG
+static __inline vnode_t *
+ZTOV(znode_t *zp)
+{
+	vnode_t *vp = zp->z_vnode;
+
+	ASSERT(vp == NULL || vp->v_data == NULL || vp->v_data == zp);
+	return (vp);
+}
+static __inline znode_t *
+VTOZ(vnode_t *vp)
+{
+	znode_t *zp = (znode_t *)vp->v_data;
+
+	ASSERT(zp == NULL || zp->z_vnode == NULL || zp->z_vnode == vp);
+	return (zp);
+}
+#else
 #define	ZTOV(ZP)	((ZP)->z_vnode)
 #define	VTOZ(VP)	((znode_t *)(VP)->v_data)
+#endif
 
 /*
  * ZFS_ENTER() is called on entry to each ZFS vnode and vfs operation.

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -1841,7 +1841,7 @@ zfs_perm_init(znode_t *zp, znode_t *pare
 				fgid = zfs_fuid_create_cred(zfsvfs,
 				    ZFS_GROUP, tx, cr, fuidp);
 #ifdef __FreeBSD__
-				gid = parent->z_phys->zp_gid;
+				gid = fgid = parent->z_phys->zp_gid;
 #else
 				gid = crgetgid(cr);
 #endif

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -879,20 +879,20 @@ domount:
 	mountpoint = kmem_alloc(mountpoint_len, KM_SLEEP);
 	(void) snprintf(mountpoint, mountpoint_len, "%s/.zfs/snapshot/%s",
 	    dvp->v_vfsp->mnt_stat.f_mntonname, nm);
-	err = domount(curthread, *vpp, "zfs", mountpoint, snapname, 0);
+	err = mount_snapshot(curthread, vpp, "zfs", mountpoint, snapname, 0);
 	kmem_free(mountpoint, mountpoint_len);
-	/* FreeBSD: This line was moved from below to avoid a lock recursion. */
-	if (err == 0)
-		vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY, curthread);
-	mutex_exit(&sdp->sd_lock);
-	/*
-	 * If we had an error, drop our hold on the vnode and
-	 * zfsctl_snapshot_inactive() will clean up.
-	 */
-	if (err) {
-		VN_RELE(*vpp);
-		*vpp = NULL;
+	if (err == 0) {
+		/*
+		 * Fix up the root vnode mounted on .zfs/snapshot/<snapname>.
+		 *
+		 * This is where we lie about our v_vfsp in order to
+		 * make .zfs/snapshot/<snapname> accessible over NFS
+		 * without requiring manual mounts of <snapname>.
+		 */
+		ASSERT(VTOZ(*vpp)->z_zfsvfs != zfsvfs);
+		VTOZ(*vpp)->z_zfsvfs->z_parent = zfsvfs;
 	}
+	mutex_exit(&sdp->sd_lock);
 	ZFS_EXIT(zfsvfs);
 	return (err);
 }

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -97,6 +97,8 @@ static int zfs_root(vfs_t *vfsp, int fla
 static int zfs_statfs(vfs_t *vfsp, struct statfs *statp, kthread_t *td);
 static int zfs_vget(vfs_t *vfsp, ino_t ino, int flags, vnode_t **vpp);
 static int zfs_sync(vfs_t *vfsp, int waitfor, kthread_t *td);
+static int zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp,
+    struct ucred **credanonp);
 static int zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vnode_t **vpp);
 static void zfs_objset_close(zfsvfs_t *zfsvfs);
 static void zfs_freevfs(vfs_t *vfsp);
@@ -108,6 +110,7 @@ static struct vfsops zfs_vfsops = {
 	.vfs_statfs =		zfs_statfs,
 	.vfs_vget =		zfs_vget,
 	.vfs_sync =		zfs_sync,
+	.vfs_checkexp =		zfs_checkexp,
 	.vfs_fhtovp =		zfs_fhtovp,
 };
 
@@ -955,6 +958,18 @@ zfsvfs_teardown(zfsvfs_t *zfsvfs, boolea
 		zfsvfs->z_unmounted = B_TRUE;
 		rrw_exit(&zfsvfs->z_teardown_lock, FTAG);
 		rw_exit(&zfsvfs->z_teardown_inactive_lock);
+
+#ifdef __FreeBSD__
+		/*
+		 * Some znodes might not be fully reclaimed, wait for them.
+		 */
+		mutex_enter(&zfsvfs->z_znodes_lock);
+		while (list_head(&zfsvfs->z_all_znodes) != NULL) {
+			msleep(zfsvfs, &zfsvfs->z_znodes_lock, 0,
+			    "zteardown", 0);
+		}
+		mutex_exit(&zfsvfs->z_znodes_lock);
+#endif
 	}
 
 	/*
@@ -1114,6 +1129,20 @@ zfs_vget(vfs_t *vfsp, ino_t ino, int fla
 	znode_t		*zp;
 	int 		err;
 
+	/*
+	 * XXXPJD: zfs_zget() can't operate on virtual entires like .zfs/ or
+	 * .zfs/snapshot/ directories, so for now just return EOPNOTSUPP.
+	 * This will make NFS to fall back to using READDIR instead of
+	 * READDIRPLUS.
+	 * Also snapshots are stored in AVL tree, but based on their names,
+	 * not inode numbers, so it will be very inefficient to iterate
+	 * over all snapshots to find the right one.
+	 * Note that OpenSolaris READDIRPLUS implementation does LOOKUP on
+	 * d_name, and not VGET on d_fileno as we do.
+	 */
+	if (ino == ZFSCTL_INO_ROOT || ino == ZFSCTL_INO_SNAPDIR)
+		return (EOPNOTSUPP);
+
 	ZFS_ENTER(zfsvfs);
 	err = zfs_zget(zfsvfs, ino, &zp);
 	if (err == 0 && zp->z_unlinked) {
@@ -1134,6 +1163,26 @@ CTASSERT(SHORT_FID_LEN <= sizeof(struct 
 CTASSERT(LONG_FID_LEN <= sizeof(struct fid));
 
 static int
+zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp,
+    struct ucred **credanonp)
+{
+	zfsvfs_t *zfsvfs = vfsp->vfs_data;
+
+	/*
+	 * If this is regular file system vfsp is the same as
+	 * zfsvfs->z_parent->z_vfs, but if it is snapshot,
+	 * zfsvfs->z_parent->z_vfs represents parent file system
+	 * which we have to use here, because only this file system
+	 * has mnt_export configured.
+	 */
+	vfsp = zfsvfs->z_parent->z_vfs;
+
+	return (vfs_stdcheckexp(zfsvfs->z_parent->z_vfs, nam, extflagsp,
+	    credanonp));
+}
+
+
+static int
 zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vnode_t **vpp)
 {
 	zfsvfs_t	*zfsvfs = vfsp->vfs_data;
@@ -1148,7 +1197,11 @@ zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, vno
 
 	ZFS_ENTER(zfsvfs);
 
-	if (fidp->fid_len == LONG_FID_LEN) {
+	/*
+	 * On FreeBSD we can get snapshot's mount point or its parent file
+	 * system mount point depending if snapshot is already mounted or not.
+	 */
+	if (zfsvfs->z_parent == zfsvfs && fidp->fid_len == LONG_FID_LEN) {
 		zfid_long_t	*zlfid = (zfid_long_t *)fidp;
 		uint64_t	objsetid = 0;
 		uint64_t	setgen = 0;

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -4340,11 +4340,20 @@ zfs_reclaim_complete(void *arg, int pend
 	znode_t	*zp = arg;
 	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 
-	ZFS_LOG(1, "zp=%p", zp);
-	ZFS_OBJ_HOLD_ENTER(zfsvfs, zp->z_id);
-	zfs_znode_dmu_fini(zp);
-	ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
+	rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
+	if (zp->z_dbuf != NULL) {
+		ZFS_OBJ_HOLD_ENTER(zfsvfs, zp->z_id);
+		zfs_znode_dmu_fini(zp);
+		ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
+	}
 	zfs_znode_free(zp);
+	rw_exit(&zfsvfs->z_teardown_inactive_lock);
+	/*
+	 * If the file system is being unmounted, there is a process waiting
+	 * for us, wake it up.
+	 */
+	if (zfsvfs->z_unmounted)
+		wakeup_one(zfsvfs);
 }
 
 static int
@@ -4356,6 +4365,9 @@ zfs_freebsd_reclaim(ap)
 {
 	vnode_t	*vp = ap->a_vp;
 	znode_t	*zp = VTOZ(vp);
+	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+
+	rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
 
 	ASSERT(zp != NULL);
 
@@ -4366,7 +4378,7 @@ zfs_freebsd_reclaim(ap)
 
 	mutex_enter(&zp->z_lock);
 	ASSERT(zp->z_phys != NULL);
-	ZTOV(zp) = NULL;
+	zp->z_vnode = NULL;
 	mutex_exit(&zp->z_lock);
 
 	if (zp->z_unlinked)
@@ -4374,7 +4386,6 @@ zfs_freebsd_reclaim(ap)
 	else if (zp->z_dbuf == NULL)
 		zfs_znode_free(zp);
 	else /* if (!zp->z_unlinked && zp->z_dbuf != NULL) */ {
-		zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 		int locked;
 
 		locked = MUTEX_HELD(ZFS_OBJ_MUTEX(zfsvfs, zp->z_id)) ? 2 :
@@ -4397,6 +4408,7 @@ zfs_freebsd_reclaim(ap)
 	vp->v_data = NULL;
 	ASSERT(vp->v_holdcnt >= 1);
 	VI_UNLOCK(vp);
+	rw_exit(&zfsvfs->z_teardown_inactive_lock);
 	return (0);
 }
 

Modified: stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
==============================================================================
--- stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	Wed Jan  6 16:05:33 2010	(r201650)
+++ stable/7/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	Wed Jan  6 16:09:58 2010	(r201651)
@@ -110,7 +110,7 @@ znode_evict_error(dmu_buf_t *dbuf, void 
 		mutex_exit(&zp->z_lock);
 		zfs_znode_free(zp);
 	} else if (vp->v_count == 0) {
-		ZTOV(zp) = NULL;
+		zp->z_vnode = NULL;
 		vhold(vp);
 		mutex_exit(&zp->z_lock);
 		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, curthread);
@@ -896,9 +896,25 @@ again:
 		if (zp->z_unlinked) {
 			err = ENOENT;
 		} else {
-			if (ZTOV(zp) != NULL)
-				VN_HOLD(ZTOV(zp));
+			int dying = 0;
+
+			vp = ZTOV(zp);
+			if (vp == NULL)
+				dying = 1;
 			else {
+				VN_HOLD(vp);
+				if ((vp->v_iflag & VI_DOOMED) != 0) {
+					dying = 1;
+					/*
+					 * Don't VN_RELE() vnode here, because
+					 * it can call vn_lock() which creates
+					 * LOR between vnode lock and znode
+					 * lock. We will VN_RELE() the vnode
+					 * after droping znode lock.
+					 */
+				}
+			}
+			if (dying) {
 				if (first) {
 					ZFS_LOG(1, "dying znode detected (zp=%p)", zp);
 					first = 0;
@@ -910,6 +926,8 @@ again:
 				dmu_buf_rele(db, NULL);
 				mutex_exit(&zp->z_lock);
 				ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num);
+				if (vp != NULL)
+					VN_RELE(vp);
 				tsleep(zp, 0, "zcollide", 1);
 				goto again;
 			}
@@ -1531,7 +1549,7 @@ zfs_create_fs(objset_t *os, cred_t *cr, 
 	ZTOV(rootzp)->v_data = NULL;
 	ZTOV(rootzp)->v_count = 0;
 	ZTOV(rootzp)->v_holdcnt = 0;
-	ZTOV(rootzp) = NULL;
+	rootzp->z_vnode = NULL;
 	VOP_UNLOCK(vp, 0, curthread);
 	vdestroy(vp);
 	dmu_buf_rele(rootzp->z_dbuf, NULL);
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"

Comment 22 Pawel Jakub Dawidek freebsd_committer

2014-06-01 07:01:11 UTC

State Changed
From-To: feedback->patched

Fixed in HEAD. Thanks for the report!

Comment 23 Pawel Jakub Dawidek freebsd_committer

2014-06-01 07:01:11 UTC

Responsible Changed
From-To: freebsd-fs->pjd

I'll take this one.

Comment 24 Pawel Jakub Dawidek freebsd_committer

2014-06-01 07:01:11 UTC

State Changed
From-To: patched->closed

Fix merged to stable/8.