I've got a system that crashed due to an assert 0 == zap_increment_int in dmu_objset_do_userquota_updates, now whenever that filesystem is mounted the same panic occurs. panic: solaris assert: 0 == zap_increment_int(os, (-2ULL), user, delta, tx) (0x0 == 0x7a), file: /usr/src/sys/cddl/contrib/ensolaris/uts/common/fs/zfs/dmu_object.c, line: 1203 cpuid = 2 KDB: stack backtrace #0 0xffffffff80xxxxxx at kdb_backtrace+0x67 #1 0xffffffff80xxxxxx at panic+0x182 #2 0xffffffff80xxxxxx at do_userquota_update+0xad #3 0xffffffff80xxxxxx at assfail3+0x2c #4 0xffffffff80xxxxxx at dmu_objset_do_userquota_updates+0x111f #5 0xffffffff80xxxxxx at dso_pool_sync+0x18f #6 0xffffffff80xxxxxx at spa_sunc+0x7ce #7 0xffffffff80xxxxxx at txg_sync_thread+0x389 #8 0xffffffff80xxxxxx at fork_exit+0x85 #9 0xffffffff80xxxxxx at fork_trampoline+0xe Booting from a USB livecd and importing the pool also triggers the same crash, although if you import the pool unmounted the crash does not occur. Only one filesystem when mounted causes the panic. The stack trace is the same as mentioned here: https://lists.freebsd.org/pipermail/freebsd-stable/2012-July/068938.html The system is a dual socket machine, although as per that thread I have tried removing one of the CPUs but it hasn't helped. ECC memory, no issues. If the affected filesystem is destroyed, the system will boot, but after a few days the issue appears to reoccur with another filesystem. The pool has also been destroyed and recreated with files migrated via zfs send/recv. Pool scrubs fine without any errors. Mainboard: X8DTi-F CPU: Intel X5680 RAM: 96GB ECC
Just a note that 0x7a is 122 is ECKSUM.
(In reply to Andriy Gapon from comment #1) Interesting - I tried to discover what that errno was, but came up blank. With that in mind I imported the zpool unmounted and did a scrub (in hindsight, my scrub is clean comment was before the issue appears, and after resolving it by nuking the affected filesystem and restoring from backup). Bingo! CKSUM errors! pool: system-ssd state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 0h12m with 1 errors on Thu Jun 22 10:02:50 2017 config: NAME STATE READ WRITE CKSUM system-ssd ONLINE 0 0 1 mirror-0 ONLINE 0 0 4 gpt/system-ssd1 ONLINE 0 0 4 gpt/system-ssd2 ONLINE 0 0 4 errors: Permanent errors have been detected in the following files: system-ssd/storage/ezjail/jailname:<0xfffffffffffffffe> I'm on a LSI SAS2008 card flashed to IT mode (unsure of exact version, "15" comes to mind though). The fact it's affected both disks is rather strange. I have another zpool with HDDs on the same SAS card which hasn't had any issues (and is considerably larger), so TRIM might be a factor? I don't know enough about it, but seems strange if it is that to have killed both disks at once if it was TRIM? How similar is the physical layout of blocks a mirrored pool between the disks? I think I might try moving the SSDs to the mainboard SATA controller (sadly only SATA 3Gb/s). I've used those disks with TRIM before in another system without issues, but not through a SAS2008 card.
Kinda strange though thinking about it that of all the blocks for TRIM to hit that it's those, and on multiple occasions... Missing any other block I care about. The filesystem isn't exactly empty, 185gb/464gb, and a decent bit of activity from a postgres database running on it among other things. Ports are swapped over to the mainboard SATA ports for now, will keep an eye on it. Definitely interested if anyone has any other ideas on what might be going on though. Is there a specific firmware version for the SAS2008 IT mode that should be used for example that might have impacted on this?
(In reply to neovortex from comment #2) Some thoughts / comments. The block layout is exactly the same between mirror members. It's quite unlikely that both drives would give errors for the same blocks at the same time. The problem can be potentially caused by the checksum being incorrectly calculated when the data is written, but such a bug is not very likely too (but not impossible). Perhaps running "zdb -dddddd system-ssd/storage/ezjail/jailname -2" would give some additional clues...
(In reply to Andriy Gapon from comment #4) Sadly I've already nuked the filesystem and restored it from backup. If it happens again (already twice this week, so if the SAS2008 controller was a red herring I'd expect to see it again pretty soon) I'll definitely run this and save the output. The filesystem that got corrupt in both cases (different jails in each instance) neither of them had any real load on them. If it was calculating wrong (bit flip in RAM that ECC couldn't detect, or CPU error, although I actually swapped over the CPU when I dropped from 2 sockets to 1 socket, so pretty much only mainboard and RAM left at this point) I'd have expected to see this occur more often on actual data rather than on metadata. There's also other pools that have a _LOT_ more writes than the SSD pool, so I'd have expected it to appear there too, but it scrubs clean (all ~9.5TB worth). Mirrored disks having identical layout would I guess match if it was TRIM though. As an aside, found the SAS2008 firmware version in case it's relevant: mps0: Firmware: 15.00.00.00, Driver: 21.01.00.00-fbsd
So since moving the SSDs from the SAS2008 controller to the mainboard SATA controller this issue hasn't reoccurred again. Considering the frequency it was reoccurring previously, I'd say that's done the trick. I guess this issue really has two parts, one is what's causing corruption with SSDs on the SAS2008 controller (my guess being TRIM related), bug in FreeBSD, bug in SAS2008 firmware, or bug in SSD firmware that only gets triggered when on a SAS2008 controller, but not on other controllers. For completeness, the SSDs are the following: === START OF INFORMATION SECTION === Model Family: Marvell based SanDisk SSDs Device Model: SanDisk SDSSDHII480G Serial Number: xxx LU WWN Device Id: xxx Firmware Version: X31200RL User Capacity: 480,103,981,056 bytes [480 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 T13/2015-D revision 3 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Jul 5 23:56:36 2017 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === The second issue I guess is when a filesystem has corrupt metadata if it could be handled more gracefully, eg zfs mount returns an error rather than causing a panic. I'm not sure how practical this is, but it was an unusual experience having on-disk corruption causing a panic compared to the behaviour of other filesystems.
2 hints perhaps : Try upgrading your SAS2008 firmware to the same version as the driver ? (20.00.07.00 IT) Yours is pretty old compared to the driver. The following sysctl should also give you more info from the controller if ever the bug occurs again : dev.mps.0.debug_level=0x1B
A last one, I read that it's better to avoid mixing SATA and SAS disks, I however don't remember if it's in the same pool, or on the same controller... Your SSD are SATA, not sure however about your other disks plugged to this SAS2008.
(In reply to Ben RUBSON from comment #8) They are separate pools. All disks are SATA. The SATA HDD pool is still on the SAS2008 pool, and has never had a checksum error throughout everything. I did notice that regarding the controller firmware, although finding newer versions of the IT mode firmware is proving to be a tad difficult.