Bug 206448 - ZFS hang/stall when drives in ATA mode
Summary: ZFS hang/stall when drives in ATA mode
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.2-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-20 21:37 UTC by Danny McGrath
Modified: 2016-08-08 08:14 UTC (History)
1 user (show)

See Also:


Attachments
Screenshot of ata console error (4.84 KB, image/png)
2016-01-20 21:37 UTC, Danny McGrath
no flags Details
Console errrors relating to swap and device changing (8.76 KB, image/png)
2016-01-20 21:37 UTC, Danny McGrath
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Danny McGrath 2016-01-20 21:37:01 UTC
Created attachment 165888 [details]
Screenshot of ata console error

I had a Dell PowerEdge R210 amd64 system that was exhibiting some off behaviour. A year or two ago I had one of the systems 2 1TB SATA drives drop out of raid, but surprisingly it I simply added it back and it has been fine ever since. Then this week I installed py27-salt on the servers.

After installing salt everything seemed fine for the first day. After the daily mails for the machine came in however, I noticed that the daily periodic got stuck running some smartd checks for the log. I tried to kill the process but ended up not being able to, which prompted a reboot. After the reboot there were jails that refused to start and all of a sudden found myself unable to do any writes to the drive, and only the message "ata2: already connected!" showing up on the console.

After some digging (thanks to auditd and salt and system logs), I was able to narrow the trigger down to some camcontrol inquiry and identify commands that would reliably trigger the problem.

After some more digging I was noticing that only this server (out of several identical/near identical) was showing the problem and that for some strange reason there were /dev/gpt/swap0 (and swap1) files only on this system. Also odd was that when I went to try some tests with stopping swap (`gmirror stop swap`) I found that the second I tried to stop the swap mirror, it redetected the swap mirror but under different device names (see screenshot of the console in attachments). I also noticed that the dmesg of this system only, was showing some odd "unmapped" messages:

  GEOM_MIRROR: cancelling unmapped because of ada0p2
  GEOM_MIRROR: cancelling unmapped because of ada1p2
  GEOM_MIRROR: Device mirror/swap launched (2/2).

As for the ZFS symptoms, when the console would show the "already attached!" error, ZFS (this was a zfs install with the mirrored swap option enabled) would no longer allow writes (or at least very slowly, in the area of 1 IOPS), and reads would eventually fail (when doing a test with `find /`), which I assume happens when they run out of cache entries.

In the end I stumbled on the BIOS setting having the drives set to ATA mode instead of AHCI or RAID, and correcting this setting seems to have solved the problem. While I can't know for sure if this is a "bug" or just a known limitation of ATA, it would almost seem like camcontrol was somehow briefly disconnecting the drives when being issued commands, and in turn was causing the swap device to switch from ada0p2 to gpt/swap0 and vice versa, possibly causing some sort of bug in ZFS.

Anyway, this is the report, and hopefully helps fix a possible bug lurking around the system that could cause problems for other users.

Cheers o/
Comment 1 Danny McGrath 2016-01-20 21:37:59 UTC
Created attachment 165889 [details]
Console errrors relating to swap and device changing
Comment 2 Richard Palusaar 2016-04-22 00:29:52 UTC
I have a similar issue on a PowerEdge 400SC.

For me the issue is on 10.3-RELEASE i386 and gives me:

GEOM_MIRROR: cancelling unmapped because of ada0p2
  GEOM_MIRROR: cancelling unmapped because of ada1p2
  GEOM_MIRROR: Device mirror/swap launched (2/2)

This results in a kernel panic and boot failure.

For me the resolution is to either disable SWAP in fstab or to boot with ACPI off.

If I boot with SWAP disabled then once booted I can re-enable SWAP and all works fine till the next boot.

Just my 5 cents on the topic.