Bug 223808 - zpool attach (and other commands?) fail with misleading error if a wiped disk had previously been used in a RAID array
Summary: zpool attach (and other commands?) fail with misleading error if a wiped disk...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-23 02:30 UTC by Stilez
Modified: 2017-11-24 17:32 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stilez 2017-11-23 02:30:33 UTC
I'm not sure that this is ZFS-specific, it might be more related to some other kernel device handling, or error message improvement. The issue is consistently reproducible in ZFS, I'm not sure what else might cause the same error, and at what level the error is raised. Probably kernel?

SITUATION:

Suppose a user has (new+used) HDD "spares". Needing a new HDD in their pool, they wipe the disk and its MBR/GPT as usual, connect it to their server, and use a standard command such as 'zpool attach <pool> <existing pool device> <this HDD>'. The wiping might have been done on Windows using DISKPART -> CLEAN, or using GPART, or any similar tool.

Suppose also that before being wiped, the disk had been used at some time in the past, with a hardware or "soft" RAID controller. (Probably this issue would happen with LSI controllers and most other  firmware/hardware RAID, as they all store metadata on the HDD, but it does clearly happen with the Intel RST "soft" or fake RAID). 

We suppose that the disk is correctly sized, has been MBR/GPT/surface wiped, and whatever else - there's nothing wrong with the disk, system or command.

EXPECTED RESULT:

Most users will expect the command to work and not think twice. 

The user expects that disk will be quickly recognised by the system and given an identifier ("da5") when connected, and that the commands "zpool attach tank da1 da5" or "zpool attach tank /dev/da1 /dev/da5" will both work (assuming da1 is a suitable existing disk for the command and provided that da5 works and was wiped before use).

ACTUAL RESULT:

The command, in all its permutations, fails with the obscure error "no such device or dataset". No other debug info is provided to the user, and none of the usual reasons for this error provice any help in troubleshooting.

It is also left ambiguous to the user, which of the three items (1 x pool and 2 x devices) is the one with the problem.

DISCUSSION:

The issue is that when such a disk is plugged in, it's recognised on the basis of the old RAID metadata, which is not always wiped by usual disk wipe processes even if they remove the MBR/GPT and perform a surface data wipe of all user data. There will still be metadata held for the previous RAID controller after such wiping.

FreeBSD, in identifying the disk, treats it both as a single device (da5) and as a degraded raid array (raid/r0p1). Therefore zpool attach fails, but it does so in a way and with an error message that makes the true cause very obscure.

The problem is that users tend to assume wiping means wiping. But wiping doesn't always remove RAID metadata, and this can lead to a disk that acts in strange ways, with expected commands failing obscurely.

A second problem is that zpool attach (and perhaps other commands) does not make clear *which* device is referred to. 

Does "no such device or dataset" in response to "zpool attach tank da0 da1" mean the pool (and similarly named dataset) "tank" doesn't exist? Or da0 doesn't exist? Or da1 doesn't exist? Except - other commands show they all do exist. This avoidably confuses the user and would be worth correcting.

ENHANCEMENT:

The error should probably be more specific in all cases: "no such device or dataset: <name>" so the user knows exactly which item is alleged not to exist. If it doesn't mean literally that it doesn't exist, then the wording should convey that it does not exist *or cannot be used*.

If a disk is connected that has RAID metadata but is not part of a known array, the error should probably be more specific: "cannot attach: da5 is part of a RAID array", and not that it doesn't exist.
Comment 1 Stilez 2017-11-24 14:49:26 UTC
As this could affect other situations, perhaps it's worth including a mention in the console messages during connection/recognition, if a disk being connected contains RAID metadata, as this could be helpful/worth exposing in other situations and the console connection messages are a logical place to do so.
Comment 2 Andriy Gapon freebsd_committer 2017-11-24 17:32:50 UTC
Doesn't g_raid print a message when it claims a disk?