Bug 240145

Summary: [smartpqi][zfs] kernel panic with hanging vdev
Product: Base System Reporter: rainer
Component: kernAssignee: freebsd-scsi mailing list <scsi>
Status: New ---    
Severity: Affects Only Me CC: pen
Priority: --- Keywords: panic
Version: 12.0-RELEASE   
Hardware: amd64   
OS: Any   

Description rainer 2019-08-27 12:45:43 UTC
Hi,

I get kernel panics like this one:

2019-08-27T09:51:47+02:00 server-log03-prod kernel: <118>[51] 2019-08-27T09:51:47+02:00 server-log03-prod 1 2019-08-27T09:51:47.264114+02:00 server-log03-prod savecore 75563 - - reboot after panic: I/O to pool 'datapool' appears to be hung on vdev guid 3442909230652761189 at '/dev/da0'.

dmesg shows:

[167] [ERROR]::[17:655.0][0,84,0][CPU 7][pqi_map_request][540]:bus_dmamap_load_ccb failed = 36 count = 131072
[167] [WARN]:[17:655.0][CPU 7][pqisrc_io_start][794]:In Progress on 84
[167] Assertion failed at file /usr/src/sys/dev/smartpqi/smartpqi_response.c line 203


before it crashes.

There's a scrub running and I would assume that it's triggered by that.

The hardware is a HP DL380 Gen10 with 2*8 disk RAIDz2, booting from a separate controller.


zpool status
  pool: datapool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: scrub in progress since Tue Aug 27 03:49:26 2019
	596G scanned at 832M/s, 429M issued at 599K/s, 14.6T total
	0 repaired, 0.00% done, no estimated completion time
config:

	NAME        STATE     READ WRITE CKSUM
	datapool    ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da3     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da0     ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
	    da11    ONLINE       0     0     0
	    da10    ONLINE       0     0     0
	    da9     ONLINE       0     0     0
	    da8     ONLINE       0     0     0
	    da12    ONLINE       0     0     0
	    da13    ONLINE       0     0     0
	    da14    ONLINE       0     0     0
	    da15    ONLINE       0     0     0
	  raidz2-2  ONLINE       0     0     0
	    da16    ONLINE       0     0     0
	    da17    ONLINE       0     0     0
	    da18    ONLINE       0     0     0
	    da19    ONLINE       0     0     0
	    da20    ONLINE       0     0     0
	    da21    ONLINE       0     0     0
	    da22    ONLINE       0     0     0
	    da23    ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    da24p4  ONLINE       0     0     0
	    da25p4  ONLINE       0     0     0

errors: No known data errors

datapool is on a HPE E208i-p SR Gen10 1.98
zroot is on a HPE P408i-a SR Gen10 1.98


I've updated all the firmware to what is available in SPP 2019.03.01

I might be a hardware-issue, but I'm not really sure where to put it.
Is it da0?

What do these error-messages mean?
Comment 1 Andriy Gapon freebsd_committer 2019-08-28 08:48:04 UTC
ZFS just reported a stuck I/O operation.
The problem is likely to be either in the driver or in the hardware.
Maybe it's triggered by the I/O load that a scrub creates.
Comment 2 rainer 2019-08-28 09:18:04 UTC
OK, thanks.

I have two of these servers, this is actually the one that has less I/O (and less drives, it finished scrubbing 19T in 4.5h yesterday).

So, I would also tend to point towards hardware. But what is it?
A specific drive? Or is the HBA toast?

I'll have to look if I can actually swap out the HBA or if I need to swap the motherboard.

I've disabled scrubs, so the server works for the moment.
Comment 3 Peter Eriksson 2019-08-28 11:27:39 UTC
Just another (rather worthless, but anyway) datapoint: 

The same thing happened to us on one of our production file servers just this Monday during prime daytime (1pm). No Scrub running just a normal load of SMB and NFS traffic (some ~400 SMB clients and ~40 NFS clients). 

FreeBSD kernel: 11.2-RELEASE-p10

Hardware: Dell PowerEdge R730xd with LSI SAS3008 (Dell-branded) HBA and the DATA pool the error occured in has 12 x 10TN SAS 7200rpm drives in a RAID-Z2 config.

After the reboot no errors could be found via Smartctl or in any logs (other than the "panic" message on the disk (or any other disk)

The vdev pointed at in the panic message was the one named "diskid/DISK-7PK8RSLC" below

# zpool status -v DATA
  pool: DATA
 state: ONLINE
  scan: scrub repaired 0 in 83h42m with 0 errors on Tue Jan  8 07:44:05 2019
config:

	NAME                              STATE     READ WRITE CKSUM
	DATA                              ONLINE       0     0     0
	  raidz2-0                        ONLINE       0     0     0
	    diskid/DISK-7PK784UC          ONLINE       0     0     0
	    diskid/DISK-7PK2GT9G          ONLINE       0     0     0
	    diskid/DISK-7PK8RSLC          ONLINE       0     0     0
	    diskid/DISK-7PK77Z2C          ONLINE       0     0     0
	    diskid/DISK-7PK1U91G          ONLINE       0     0     0
	    diskid/DISK-7PK2GBPG          ONLINE       0     0     0
	  raidz2-1                        ONLINE       0     0     0
	    diskid/DISK-7PK1AZ4G          ONLINE       0     0     0
	    diskid/DISK-7PK2GEEG          ONLINE       0     0     0
	    diskid/DISK-7PK14ARG          ONLINE       0     0     0
	    diskid/DISK-7PK7HS5C          ONLINE       0     0     0
	    diskid/DISK-7PK2GERG          ONLINE       0     0     0
	    diskid/DISK-7PK200TG          ONLINE       0     0     0
	logs
	  diskid/DISK-BTHV7146043R400NGN  ONLINE       0     0     0
	  diskid/DISK-BTHV715403A9400NGN  ONLINE       0     0     0
	cache
	  diskid/DISK-CVCQ72660083400AGN  ONLINE       0     0     0
	spares
	  diskid/DISK-7PK1RNVG            AVAIL   
	  diskid/DISK-7PK784NC            AVAIL   

errors: No known data errors

# sas3ircu 0 DISPLAY 
Avago Technologies SAS3 IR Configuration Utility.
Version 11.00.00.00 (2015.08.04) 
Copyright (c) 2009-2015 Avago Technologies. All rights reserved. 

Read configuration has been initiated for controller 0
------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
  Controller type                         : SAS3008
  BIOS version                            : 8.37.00.00
  Firmware version                        : 16.00.04.00
  Channel description                     : 1 Serial Attached SCSI
  Initiator ID                            : 0
  Maximum physical devices                : 543
  Concurrent commands supported           : 9584
  Slot                                    : 5
  Segment                                 : 0
  Bus                                     : 2
  Device                                  : 0
  Function                                : 0
  RAID Support                            : No
...
Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 2
  SAS Address                             : 5000cca-2-51b8-fbb1
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 9470975/2424569855
  Manufacturer                            : HGST    
  Model Number                            : HUH721010AL4200 
  Firmware Revision                       : LS17
  Serial No                               : 7PK8RSLC
  GUID                                    : N/A
  Protocol                                : SAS
  Drive Type                              : SAS_HDD
...
Comment 4 rainer 2019-09-02 00:06:42 UTC
So, replacing the controller:

HPE E208i-p SR Gen10


seems to have helped.

The scrub went through.

I know hardware errors are difficult to diagnose from the OS above it, but maybe there could somehow be more diagnostics?


We will have to send back this controller (we pre-ordered a new one on a hunch).
Comment 5 rainer 2019-10-05 12:11:19 UTC
Now, the other of two servers is also acting up.

After rebooting, it finished its scrub though.

I've not yet ordered a replacement HBA but will do soon.

The server with the replaced HBA has never shown a problem again. So far ;-)