Bug 179932 - [ciss] ciss i/o stall problem with HP Bl Gen8 (and HP Bl Gen7 + Storage Blade)
Summary: [ciss] ciss i/o stall problem with HP Bl Gen8 (and HP Bl Gen7 + Storage Blade)
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-06-24 18:50 UTC by Philipp Maechler
Modified: 2019-01-20 03:52 UTC (History)
1 user (show)

See Also:


Attachments
20130626_olddebugsession.txt (33.90 KB, text/plain; charset=UTF-8)
2013-06-26 12:12 UTC, Philipp Maechler
no flags Details
verbose boot g8 ciss head.txt (145.55 KB, text/plain)
2013-06-27 14:01 UTC, markus.gebert
no flags Details
file.dat (4 bytes, text/plain; charset=us-ascii)
2013-06-27 14:01 UTC, markus.gebert
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Philipp Maechler 2013-06-24 18:50:00 UTC
Our blade servers doing in normal operation a very high level of disk
i/o operations, especially the systems with storage blades witch are
used as database servers. The overall i/o performance is okay, but from
time to time the raid controller or the raid controller driver of
FreeBSD does stop handling any i/o requests and completely stall. So far
we could figure out that the problem must belong to the driver or
controller. During such an i/o stall the sever is still operating but
all disk i/o operations are queued by the kernel and do not get finished
any more by controller or disks, we also tried to directly access the
disks during such a stall, avoiding any filesystem levels, but we also
can only see that those requests got queued and never finish by the
disks. The only solution in such a case, to get the system back working,
is to completely reboot it.

So far we did several analysis on logs and kernel stats do figure out
where to problem could belong to. We could manage to exclude almost
everything except the raid controller itself and the kernel driver
(ciss). While expecting the i/o stall we tried to write directly to the
swap partitions of different disks, but this i/o could also not get
finished. With this tests we tried to exclude any problems caused by the
filesystem.

It is very difficult for us to reproduce the needed i/o to cause this
problem. So far we just know, that systems expecting higher disk i/o are
more often affected by the problem. Some of our database servers
expecting during peak hours extraordinary high disk i/o and crashes
occur more often. Some systems are running a couple of weeks without any
problems and some do crash several times a day - but then again they run
1-2 weeks without any problems.

We are also not sure if we face exactly the same problem on servers with
P220i and server with P410i raid controllers, but the symptoms on the
systems are very similar.

We had contact with Linux / cciss staff and some of they developers
recommended us to fill out a p/r - here it is.

At the moment we are testing the "old SIMPLE Mode" activated by
loader.conf on our Gen8 Machines; earlier tests on our Gen7-StorageBlade
systems did not success, to be clear it reduced the frequency of
happening but not resolved it (on Gen7-StorageBlade).

How-To-Repeat: In short: Heavy i/o and network operations for around 1-2 week on the
listed kind of hardware; the more machines you have, the more often
the i/o stall happen.

With 15 Gen8 machines we are experiencing the problem around 6-7 times
/ week; often 2-3 days are completely without problems. The problem also
occurs on non-heavy-load-times like in the early morning (without backup
running).

We unloaded the Gen7-StorageBlade systems for no longer interrupt
productive environments; but on a hot-stby system we can reproduce it
every 1-3 weeks as long as we are scrubbing all the time (for reproducing
it...).

In detail:
It is very difficult for us to reproduce the needed i/o to cause this
problem. So far we just know, that systems expecting higher disk i/o are
more often affected by the problem. Some of our database servers
expecting during peak hours extraordinary high disk i/o and crashes
occur more often. Some systems are running a couple of weeks without any
problems and some do crash several times a day - but then again they run
1-2 weeks without any problems.

We are also not sure if we face exactly the same problem on servers with
P220i and server with P410i raid controllers, but the symptoms on the
systems are very similar.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2013-06-25 06:07:07 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-scsi

Over to maintainer(s).
Comment 2 sean_bruno 2013-06-25 16:41:14 UTC
I have updated head a bit to handle some edge cases better.  I'll send a
patch into this ticket for 9.1p4 for you to test.

In the meantime, can you update your kernel configs with
BREAK_TO_DEBUGGER or ALT_BREAK_TO_DEBUGGER so you can get a dump when
the system stalls?  We might be able to determine what the system is
waiting on and give us a clue to what the issue is.

If you are able to get a system to stall with this set, break to the db>
prompt on the serial console, and email the output of "alltrace" here so
I can do some analysis.  

Sean
Comment 3 Philipp Maechler 2013-06-26 12:12:40 UTC
hi Sean

At the moment we are running the serial console at 9600 baud; we have to
tune this to 115200 baud and also then an alltrace will take "forever"
because of hundreds of apache threads; because this machines are
productive and non-redundant we have to figure out a procedure how we
can handle to do this long-running alltrace without interrupting the
services for such a time.

As an attachement, I can provide you an old debugger session with a
(part) of a ps and some backtraces. It is from one of our HP
Gen7-StorageBlade-Systems running on 9.1-RELEASE or higher, hanging
earlier this year with the same problem.

If you find there enough information to say "alltrace is not needed",
we'd be happy. But until then we will search for possibilities to do
such an alltrace...

The break possibility we already enabled by setting the sysctl
  debug.kdb.break_to_debugger=1

As soon as you send us the mentioned patch we will begin to test it and
introduce it to the affected systems.

Philipp

p.s. please note; even there are a lot of processes in some zfs states,
we also tried to do dd to some raw partitions on disk and were not
successful, so we think it's not a zfs problem. In gstat we saw the wait
queue's stall at specific values and nothing going on anymore.
Comment 4 sean_bruno 2013-06-26 18:36:47 UTC
For clarification, the production p220i are configured raid1 with ZFS on
top of them?

You mention running them in Simple mode, that simply slows them down in
the new order of things.  I wouldn't bother.

I would request a system dmesg with verbose boot turned on so I can see
what the hardware is reporting as parameters.

Also, have you attempted 10-current install and test?

sean

Comment 5 markus.gebert 2013-06-27 14:01:43 UTC
Hi Sean

I'm taking over from Philipp as he's out of office the next couple of days.

No, neither the p220i in the g8 blades nor the p410i in the g7 blades are configured raid1. We expose all disks to the OS by creating one raid0 "stripe" per disk. In case of the G8 blades we do zfs mirroring on top of that, in case of the G7 with storage blades, we do raidz2 across 10 disks (2 internal (ciss0), 8 in the storage blade (ciss1)).

We haven't tried CURRENT so far. But I've created a patch that brings most changes to the ciss driver from head to 9.1. Actually, I used ciss from head (r250031) and backed out these two commits that affected more than the ciss driver (i.e. changes to the whole CAM layer):

- r246713
- r249468

Some servers have been running quite stable for over a week now, but we don't know yet if that's coincidence, because of the patch or because of simple mode. We propagated the patch to all the affected systems with the recent security upgrade and at the same time activated simple mode on all of them. This of course doesn't make it simpler to find out what really brought some relief.

This is why we rebooted 15 G8 blades back into performance mode yesterday, part of them with my patch, part of them without. So what we test right now is the following:

- 11 G8 blades with patch and SIMPLE mode (we keep them up to rule out coincidence, these have over a week uptime right now)
- 6 G8 blades PERF mode without patch (to prove the problem comes back when restoring default settings)
- 9 G8 blades PERF mode with patch (to test if it is the patch that actually helps)

I think it's to early to state that something actually helped. The situation with simple mode and patch has been better so far for sure, but then again the simple mode could just have slowed things down to the point the problem does no longer occur often enough.

What we can say though is that we did have an IO stall while using the that same ciss patch on the G7 blades. This was a few weeks a ago and unfortunately we haven't been able to reproduce that since then Also simple mode did not seem to help on G7 blades.

The servers we rebooted yesterday now run at 115k console speed, so we should be able to pull out debug information once the problem occurs. Unfortunately, serial console and ddb is the only way to get any information, as a memory dump to the swap partion obviously doesn't work when all disk IO is stuck.

On the servers with the ciss head patch, we've also defined CISS_DEBUG in ciss.c. If one of those servers should have the problem, we could also print the driver debug information in ddb. Would that be valuable to you? Or do you think it's even bad to have CISS_DEBUG defined, as it could slow down things and make our problem less likely to occur?

The servers without ciss head patch don't have CISS_DEBUG defined, because the driver won't compile on 9.1 with that.

Verbose boot output of a server with ciss head patch and CISS_DEBUG defined should be attached.


Thanks for looking into this.

Markus








Comment 6 markus.gebert 2013-06-28 14:23:25 UTC
I already replied to Sean last friday but the copy has not made its way =
into the PR system, probably due to the size of the alltrace attachment. =
So here's a dropbox link:

=
https://dl.dropboxusercontent.com/u/10669369/alltrace%20G8%20PERF%209.1cis=
s.txt

And the email I sent:

----
Hi again

One of the G8 blades we had booted back into PERF mode yesterday stalled =
last night. This one was _not_ patched with the ciss changes from head, =
so it was the plain 9.1 driver running in default PERF mode. This means =
that as soon as we restore the original default configuration, the =
stalls come back quite quickly. Of course we still don't know wether it =
is the patch, SIMPLE mode or the combination of both that seems to help =
the other systems. Hopefully we'll know that soon.

Anyway, we got an alltrace from the server that stalled. Unfortunately =
no ciss debug outpout since it was not patched.


Markus
----
Comment 7 sean_bruno 2013-07-03 04:13:38 UTC
While I wait for a p420 and p410 to test your configuration, I setup a
dl180g6 with a p400 in a test configuration.  I've had some interesting
results.

I setup two disks in a RAID1+0 volume for root and swap and UFS

I created a 10 disk raidz2 volume named "ztest":
--------------------------------------------------------------------------------
bash-4.2# swapinfo; df -k; zpool status
Device          512-blocks     Used    Avail Capacity
/dev/da0p3        16777216        0 16777216     0%
--------------------------------------------------------------------------------
Filesystem 1024-blocks    Used      Avail Capacity  Mounted on
/dev/da0p2     4058430 1256338    2477418    34%    /
devfs                1       1          0   100%    /dev
/dev/da0p4   245197222 6081304  219500142     3%    /home
procfs               4       4          0   100%    /proc
ztest       1118637925      67 1118637858     0%    /ztest
--------------------------------------------------------------------------------
  pool: ztest
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        ztest       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0
            da6     ONLINE       0     0     0
            da7     ONLINE       0     0     0
            da8     ONLINE       0     0     0
            da9     ONLINE       0     0     0
            da10    ONLINE       0     0     0

errors: No known data errors
--------------------------------------------------------------------------------

I find that as we reach maximum capacity of /ztest, zfs really starts
bogging down.  

I'm running iozone with the following parameters:
--------------------------------------------------------------------------------
iozone -s 50G -r 256K -+q 30 -i 0 -i 1 -R -t 12
-F /ztest/1 /ztest/2 /ztest/3 /ztest/4 /ztest/5 /ztest/6 /ztest/7 /ztest/8 /ztest/9 /ztest/10 /ztest/11 /ztest/12
--------------------------------------------------------------------------------


This would be a good start to more reliably test your configuration.  I
didn't bother doing gpt partitions and the geom partitioning scheme in
this configuration because its not bootable.

Sean
Comment 8 markus.gebert 2013-07-05 09:19:58 UTC
Hey Sean

I'm glad to hear you're getting the same controller as ours to test. In =
the meantime it seems that the backported ciss changes from head seem to =
help a lot on the G8 blades with the p220 controllers. It's quite likely =
that the G8 problem is already fixed in head. Of course, we can't be =
sure yet, but still it might be better to focus on the G7 with p410 and =
storage blade, where the issue has occured even with ciss from head. So =
it's good your getting a p410.

We discussed your test scenario. ZFS is known to go nuts and do really =
much IO once a zpool get quite full, so is your goal just to maximise IO =
to reproduce the problem more reliably? Or is there a specific reason =
why you want us to fill a zpool?

Our problem is that half of the G7 blades are productive, so filling the =
zpool is no option there. The second half is where the first half =
replicates all data to, so they're kind of hot standby and we're more =
flexibel doing tests there, but we still have to keep the replication =
running, which makes filling the pool impossible as well.

The day before yesterday we installed the patched kernel that has ciss =
from head and CISS_DEBUG defined on all these standby systems. We run =
zpool scrubs non-stop on all of them to generate IO and as they are =
replication targets, they also receive some amount of write IO. Like =
that, we hope to get a system to stall more often, so we can progress =
more quickly debugging the G7 problem. If you think that more write IO =
would help, we can look into using iozone, but a stated before, we won't =
be able to do things like filling the zpool.

Also, once a G7 blade stalls, is there any information apart from =
alltrace and DDB ciss debug print you want as to pull out of the system?

When reading through the ciss driver source I noticed that the DDB print =
may only outpout information about the first controller. Since the =
storage blade contains a second p410, do you think it'd be worth to =
alter the debug function to print out information about any ciss =
controller in the system?


Markus
Comment 9 markus.gebert 2013-07-05 13:28:42 UTC
Hey Sean

Two of the G7 blades stalled today, that was quick=85 Here's an alltrace =
and ciss debug output each:

First stall:
=
https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug=
/20130705%20-%20G7%20crash%201%20(12s)/20130705%20-%201%20-%20alltrace.txt=

=
https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug=
/20130705%20-%20G7%20crash%201%20(12s)/20130705%20-%201%20-%20cissdebug.tx=
t

Second stall:
=
https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug=
/20130705%20-%20G7%20crash%202%20(16s)/20130705%20-%202%20-%20alltrace.txt=

=
https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug=
/20130705%20-%20G7%20crash%202%20(16s)/20130705%20-%202%20-%20cissdebug.tx=
t

As said, the debug information is for ciss0 only, because the driver =
does not iterate through all controllers when printing debug =
information. Also, if you need more information when this happens the =
next time, please let us know.


Markus
Comment 10 markus.gebert 2013-07-05 15:14:26 UTC
Hey Steven

Thanks for your input.

On 05.07.2013, at 15:43, Steven Hartland <killing@multiplay.co.uk> wrote:

> Might also want to get the output from "show sleepchain" for all threads
> too as that will easily identify sleep lock dead locks.


Is there an easy way to do this for all threads with one command? The first server that crashed had 800 threads If not, we should probably script this outside of ddb using thread ids from the alltrace output. Or is there a subset of threads you're particularly interested in?


> Also whats the check_disk process?


This is Nagios' check_disk plugin we use to check the filesystem usage on all mountpoints. It runs quite frequently, that's why multiple may be get started until we notice and break into the debugger.


Markus
Comment 11 seanbru 2013-07-14 18:17:29 UTC
I updated stable/9 with most of the changes made to head recently.  See
if that makes any difference.

Sean
Comment 12 sean_bruno 2013-07-14 18:54:50 UTC
I've updated the DDB hook to display as many adapters as possible.  Can
you add this to your tests please?

http://people.freebsd.org/~sbruno/ciss_ddb_update.txt

Sean
Comment 13 markus.gebert 2013-07-15 14:51:06 UTC
I checked your MFC and all the fixes are already included in my patch =
for 9.1 that we're currently testing with. With that patch, all G8 =
blades are still running stable and have not shown any more IO stalls. =
The G7 ones still reliably crash with our test load. So I think we can =
state that we have already tested wether the changes from head help or =
not. Is there another reason you want us to test with a stable/9 kernel, =
or should we stick with the patched 9.1 for now?

In any case I'll apply your DDB hook patch to our patched 9.1 kernel, so =
we'll get out more debug information when a G7 blade stalls next time.


Markus
Comment 14 markus.gebert 2013-07-15 15:42:29 UTC
The patch applied but it broke the the kernel build. I've corrected it =
as I think it was intended. The version I used is here:

https://dl.dropboxusercontent.com/u/10669369/ciss_ddb_update_v2.txt


Markus
Comment 15 markus.gebert 2013-07-19 09:52:27 UTC
We had another G7 IO stall and were able to get ciss debug outpout for =
both controllers. ciss0 is the internal one, ciss1 is the one built into =
the storage blade.

=
https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug=
/20130717%20-%20G7%20crash%20%2815s%29/20130717%20-%20alltrace.txt
=
https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug=
/20130717%20-%20G7%20crash%20%2815s%29/20130717%20-%20cissdebug.txt

I hope this helps=85

G8 blades are still running stable with the ciss changes from head.


Markus
Comment 16 dfilter service freebsd_committer freebsd_triage 2013-07-27 18:21:00 UTC
Author: sbruno
Date: Sat Jul 27 17:20:43 2013
New Revision: 253712
URL: http://svnweb.freebsd.org/changeset/base/253712

Log:
  update ciss debugger command to iterate over multiple raid controllers up to
  max controllers supported
  
  PR:	kern/179932
  Reviewed by:	Markus Gebert <markus.gebert@hostpoint.ch>
  MFC after:	2 weeks
  Sponsored by:	Yahoo! Inc.

Modified:
  head/sys/dev/ciss/ciss.c

Modified: head/sys/dev/ciss/ciss.c
==============================================================================
--- head/sys/dev/ciss/ciss.c	Sat Jul 27 17:16:59 2013	(r253711)
+++ head/sys/dev/ciss/ciss.c	Sat Jul 27 17:20:43 2013	(r253712)
@@ -4378,11 +4378,17 @@ ciss_print_adapter(struct ciss_softc *sc
 DB_COMMAND(ciss_prt, db_ciss_prt)
 {
     struct ciss_softc	*sc;
+    devclass_t dc;
+    int maxciss, i;
 
-    sc = devclass_get_softc(devclass_find("ciss"), 0);
-    if (sc == NULL) {
-	printf("no ciss controllers\n");
-    } else {
+    dc = devclass_find("ciss");
+    if ( dc == NULL ) {
+        printf("%s: can't find devclass!\n", __func__);
+        return;
+    }
+    maxciss = devclass_get_maxunit(dc);
+    for (i = 0; i < maxciss; i++) {
+        sc = devclass_get_softc(dc, i);
 	ciss_print_adapter(sc);
     }
 }
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 17 sean_bruno 2013-11-25 16:45:07 UTC
I have a test box for this now with an external array.

Currently, its configured with an onboard p420i with a single raid0 of
one disk and a P222 handling a 40TB array external to the chassis.

I'll see if I can do something to make this die in my Gen8 box.

sean
Comment 18 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 08:00:32 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped
Comment 19 Oleksandr Tymoshenko freebsd_committer freebsd_triage 2019-01-20 03:52:48 UTC
Is this PR still relevant?

Thanks