Hi! This is a production machine, one of two identical redunant storage systems. After upgrading one to FreeBSD 10.0-RELEASE-p3, it crashes within a few hours in production. Running bonnie locally we manage to reproduce the crash within minutes. System has 32 GB RAM zpool list -v NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT tank 21,8T 7,71T 14,0T 35% 1.00x ONLINE - raidz2 21,8T 7,71T 14,0T - da0 - - - - da1 - - - - da2 - - - - da3 - - - - da4 - - - - da5 - - - - da6p1 39,8G 4,44M 39,7G - cache - - - - - - da6p2 19,6G 248M 19,4G - zpool verison 28 zfs version 5 panic: solaris assert: sm->sm_space + size <= sm->sm_size, file: /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line: 121 (kgdb) bt #0 doadump (textdump=<value optimized out>) at pcpu.h:219 #1 0xffffffff808af860 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:447 #2 0xffffffff808afc24 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:754 #3 0xffffffff81a31202 in assfail (a=<value optimized out>, f=<value optimized out>, l=<value optimized out>) at /usr/src/sys/modules/opensolaris/../../cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:81 #4 0xffffffff8185fc24 in space_map_add (sm=0xfffff80016a24600, start=1029308416, size=131072) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c:121 #5 0xffffffff81860429 in space_map_load (sm=0xfffff80016a24600, ops=0xffffffff8191ad58, maptype=<value optimized out>, smo=0xfffff800169b2b20, os=0xfffff80016475c00) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c:391 #6 0xffffffff8184aee0 in metaslab_activate (msp=0xfffff800169b2b00, activation_weight=9223372036854775808) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:978 #7 0xffffffff8184a209 in metaslab_alloc (spa=0xfffff80016285000, mc=0xfffff800161a0900, psize=131072, bp=0xfffffe01f5c12008, ndvas=1, txg=9442048, hintbp=0xfffff8001681e000, flags=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:1520 #8 0xffffffff81882f23 in zio_alloc_zil (spa=0xfffff80016285000, txg=9442048, new_bp=0xfffffe01f5c12008, old_bp=0xfffff804f6ff5e48, size=131072, use_slog=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:2448 #9 0xffffffff81880b71 in zil_lwb_write_start (zilog=0xfffff800168a1c00, lwb=0xfffff804f6ff5e40) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:987 #10 0xffffffff8187f488 in zil_commit (zilog=0xfffff800168a1c00, foid=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1060 #11 0xffffffff818a4a78 in zfs_sync (vfsp=<value optimized out>, waitfor=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:161 #12 0xffffffff8094e2e0 in sync_fsync (ap=<value optimized out>) at /usr/src/sys/kern/vfs_subr.c:3666 #13 0xffffffff80d9d458 in VOP_FSYNC_APV (vop=<value optimized out>, a=<value optimized out>) at vnode_if.c:1330 #14 0xffffffff8094eb6a in sched_sync () at vnode_if.h:549 #15 0xffffffff80881cba in fork_exit (callout=0xffffffff8094e7a0 <sched_sync>, arg=0x0, frame=0xfffffe085c3e2ac0) at /usr/src/sys/kern/kern_fork.c:995 #16 0xffffffff80c79f0e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606 #17 0x0000000000000000 in ?? () Current language: auto; currently minimal
I just tried BETA3, and it panics within minutes, just running bonnie++ -d /tank/zfs/directory. The file that bonnie managed to create was just one (1) percent the size of the files created with 10.0p3, so BETA3 is even worse in this respect? # kgdb kernel.debug /var/crash/vmcore.1 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: panic: solaris assert: range_tree_space(rt) + size <= sm->sm_size (0x10020000 <= 0x10000000), file: /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line: 121 cpuid = 8 KDB: stack backtrace: #0 0xffffffff80946fa0 at kdb_backtrace+0x60 #1 0xffffffff8090c0c5 at panic+0x155 #2 0xffffffff81b9b22f at assfail3+0x2f #3 0xffffffff819a5665 at space_map_load+0x355 #4 0xffffffff8198bb0e at metaslab_load+0x2e #5 0xffffffff8198e615 at metaslab_preload+0x65 #6 0xffffffff8193f260 at taskq_run+0x10 #7 0xffffffff80955415 at taskqueue_run_locked+0xe5 #8 0xffffffff80955ea8 at taskqueue_thread_loop+0xa8 #9 0xffffffff808dcb0a at fork_exit+0x9a #10 0xffffffff80cf295e at fork_trampoline+0xe Uptime: 1h31m29s Dumping 1806 out of 32665 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% Reading symbols from /boot/kernel/zfs.ko.symbols...done. Loaded symbols for /boot/kernel/zfs.ko.symbols Reading symbols from /boot/kernel/opensolaris.ko.symbols...done. Loaded symbols for /boot/kernel/opensolaris.ko.symbols Reading symbols from /boot/kernel/ums.ko.symbols...done. Loaded symbols for /boot/kernel/ums.ko.symbols #0 doadump (textdump=<value optimized out>) at pcpu.h:219 219 __asm("movq %%gs:%1,%0" : "=r" (td) (kgdb) bt #0 doadump (textdump=<value optimized out>) at pcpu.h:219 #1 0xffffffff8090bd42 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:452 #2 0xffffffff8090c104 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759 #3 0xffffffff81b9b22f in assfail3 (a=<value optimized out>, lv=<value optimized out>, op=<value optimized out>, rv=<value optimized out>, f=<value optimized out>, l=<value optimized out>) at /usr/src/sys/modules/opensolaris/../../cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:91 #4 0xffffffff819a5665 in space_map_load (sm=0xfffff800176f9480, rt=0xfffff8001747d000, maptype=<value optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c:120 #5 0xffffffff8198bb0e in metaslab_load (msp=0xfffff800233c2000) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:1295 #6 0xffffffff8198e615 in metaslab_preload (arg=0xfffff800233c2000) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:1652 #7 0xffffffff8193f260 in taskq_run (arg=0xfffff8006ecb60f0, pending=0) at /usr/src/sys/modules/zfs/../../cddl/compat/opensolaris/kern/opensolaris_taskq.c:109 #8 0xffffffff80955415 in taskqueue_run_locked (queue=0xfffff80017822400) at /usr/src/sys/kern/subr_taskqueue.c:342 #9 0xffffffff80955ea8 in taskqueue_thread_loop (arg=<value optimized out>) at /usr/src/sys/kern/subr_taskqueue.c:563 #10 0xffffffff808dcb0a in fork_exit (callout=0xffffffff80955e00 <taskqueue_thread_loop>, arg=0xfffff80017fd9090, frame=0xfffffe085c5dfac0) at /usr/src/sys/kern/kern_fork.c:996 #11 0xffffffff80cf295e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606 #12 0x0000000000000000 in ?? () Current language: auto; currently minimal (kgdb)
(In reply to Palle Girgensohn from comment #1) > I just tried BETA3, That is 10.1-BETA3, of course...
Is this reproducable on a newly created pool? It looks like you are using a pool formatted with old format and did not upgrade (DO NOT DO IT NOW!), and there may be existing damage with the space map -- in such case the only way to recover from the situation would be to copy all data off the pool, recreate it and restore the data.
(In reply to Xin LI from comment #3) > Is this reproducable on a newly created pool? It looks like you are using a > pool formatted with old format and did not upgrade (DO NOT DO IT NOW!), and > there may be existing damage with the space map -- in such case the only way > to recover from the situation would be to copy all data off the pool, > recreate it and restore the data. Hi Xin Li, thanks for the reply! I did not try a newly created pool, it is a large pool with data, one of two redundant systems where we use zfs send | ssh | zfs recv to keep them in sync. The other machine is still on 9.3, and we got this problem after updating one system to 10.0. So, we cannot really upgrade just yet. Also, it shouln't present such a big problem just running an old version...? But as you say, there seems to something fishy with the pool, and maybe there is nothing wrong with the kernel itself. Are you sure there are no other ways to fix this but to recreate the pool? Thera are just Terabytes of data, it will take a week... :-/ is there no zdb magic or zpool export + scrub + zpool import ditto with vfs.zfs.recover =1 that could help?
(In reply to Palle Girgensohn from comment #4) > I did not try a newly created pool, it is a large pool with data, one of two > redundant systems where we use zfs send | ssh | zfs recv to keep them in > sync. The other machine is still on 9.3, and we got this problem after > updating one system to 10.0. So, we cannot really upgrade just yet. Also, it > shouln't present such a big problem just running an old version...? Since you haven't upgraded the pool, will it be possible for you to downgrade your kernel/world and see if you can reproduce the problem? > But as you say, there seems to something fishy with the pool, and maybe > there is nothing wrong with the kernel itself. > > Are you sure there are no other ways to fix this but to recreate the pool? > Thera are just Terabytes of data, it will take a week... :-/ Yeah I know :( > is there no zdb magic or zpool export + scrub + zpool import ditto with > vfs.zfs.recover =1 that could help? Unfortunately, space map corruptions are fatal currently. In the future maybe we can implement a tool that traverses all data and rebuild the space map from that, but it's not trivial...
Closed because of timeout. Please reopen if it reoccurs or you have additional information.