204233 – 'Livelocked' Geom mirror when one PATA provider experienced write timeouts

Bug 204233 - 'Livelocked' Geom mirror when one PATA provider experienced write timeouts

Summary: 'Livelocked' Geom mirror when one PATA provider experienced write timeouts

Status:	New

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	10.2-STABLE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	freebsd-geom (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-11-02 23:12 UTC by Kevin Thompson
Modified:	2015-11-03 03:48 UTC (History)
CC List:	5 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Kevin Thompson 2015-11-02 23:12:54 UTC

I had a geom mirror made of two older 160 GB Western Digital PATA HDDs, ada0 and ada1 going into /dev/mirror/gm0.

One of the drives experienced failure such that it would respond to writes with only a timeout:


> Oct 30 12:34:20 angst (ada0:ata0:0:0:0): WRITE_DMA48. ACB: 35 00 0f a3 64 40 11 00 00 00 20 00
> Oct 30 12:34:20 angst (ada0:ata0:0:0:0): CAM status: Command timeout
> Oct 30 12:34:20 angst (ada0:ata0:0:0:0): Retrying command
> Oct 30 12:34:20 angst (ada0:ata0:0:0:0): WRITE_DMA48. ACB: 35 00 e1 07 6d 40 12 00 00 00 20 00
> Oct 30 12:38:21 angst (ada0:ata0:0:0:0): CAM status: Command timeout
> Oct 30 12:38:21 angst (ada0:ata0:0:0:0): Retrying command

(The drive was old, so I suspect hardware failure)

Unfortunately, this brought most of the OS to a crawling halt for many hours - gmirror was blocking all IO activity to the gm0 provider because it was waiting for the writes to timeout, which took many minutes. However, the system wasn't completely dead - it seemed as if queued block read requests would work when they could slip in between the blocking writes when the write timeout elapsed.

Eventually, I was able to log into the system and manually remove the dead disk from the mirror, after which point the system came back to life.

Why didn't gmirror drop the disk automatically?

Comment 1 Enji Cooper freebsd_committer

2015-11-03 00:57:27 UTC

The modifications we have at $work to gmirror may or may not help this case.

Comment 2 amvandemore 2015-11-03 03:48:26 UTC

ada(4) has some options to help as well as the kern.geom.mirror. tree.