Our blade servers doing in normal operation a very high level of disk i/o operations, especially the systems with storage blades witch are used as database servers. The overall i/o performance is okay, but from time to time the raid controller or the raid controller driver of FreeBSD does stop handling any i/o requests and completely stall. So far we could figure out that the problem must belong to the driver or controller. During such an i/o stall the sever is still operating but all disk i/o operations are queued by the kernel and do not get finished any more by controller or disks, we also tried to directly access the disks during such a stall, avoiding any filesystem levels, but we also can only see that those requests got queued and never finish by the disks. The only solution in such a case, to get the system back working, is to completely reboot it. So far we did several analysis on logs and kernel stats do figure out where to problem could belong to. We could manage to exclude almost everything except the raid controller itself and the kernel driver (ciss). While expecting the i/o stall we tried to write directly to the swap partitions of different disks, but this i/o could also not get finished. With this tests we tried to exclude any problems caused by the filesystem. It is very difficult for us to reproduce the needed i/o to cause this problem. So far we just know, that systems expecting higher disk i/o are more often affected by the problem. Some of our database servers expecting during peak hours extraordinary high disk i/o and crashes occur more often. Some systems are running a couple of weeks without any problems and some do crash several times a day - but then again they run 1-2 weeks without any problems. We are also not sure if we face exactly the same problem on servers with P220i and server with P410i raid controllers, but the symptoms on the systems are very similar. We had contact with Linux / cciss staff and some of they developers recommended us to fill out a p/r - here it is. At the moment we are testing the "old SIMPLE Mode" activated by loader.conf on our Gen8 Machines; earlier tests on our Gen7-StorageBlade systems did not success, to be clear it reduced the frequency of happening but not resolved it (on Gen7-StorageBlade). How-To-Repeat: In short: Heavy i/o and network operations for around 1-2 week on the listed kind of hardware; the more machines you have, the more often the i/o stall happen. With 15 Gen8 machines we are experiencing the problem around 6-7 times / week; often 2-3 days are completely without problems. The problem also occurs on non-heavy-load-times like in the early morning (without backup running). We unloaded the Gen7-StorageBlade systems for no longer interrupt productive environments; but on a hot-stby system we can reproduce it every 1-3 weeks as long as we are scrubbing all the time (for reproducing it...). In detail: It is very difficult for us to reproduce the needed i/o to cause this problem. So far we just know, that systems expecting higher disk i/o are more often affected by the problem. Some of our database servers expecting during peak hours extraordinary high disk i/o and crashes occur more often. Some systems are running a couple of weeks without any problems and some do crash several times a day - but then again they run 1-2 weeks without any problems. We are also not sure if we face exactly the same problem on servers with P220i and server with P410i raid controllers, but the symptoms on the systems are very similar.
Responsible Changed From-To: freebsd-bugs->freebsd-scsi Over to maintainer(s).
I have updated head a bit to handle some edge cases better. I'll send a patch into this ticket for 9.1p4 for you to test. In the meantime, can you update your kernel configs with BREAK_TO_DEBUGGER or ALT_BREAK_TO_DEBUGGER so you can get a dump when the system stalls? We might be able to determine what the system is waiting on and give us a clue to what the issue is. If you are able to get a system to stall with this set, break to the db> prompt on the serial console, and email the output of "alltrace" here so I can do some analysis. Sean
hi Sean At the moment we are running the serial console at 9600 baud; we have to tune this to 115200 baud and also then an alltrace will take "forever" because of hundreds of apache threads; because this machines are productive and non-redundant we have to figure out a procedure how we can handle to do this long-running alltrace without interrupting the services for such a time. As an attachement, I can provide you an old debugger session with a (part) of a ps and some backtraces. It is from one of our HP Gen7-StorageBlade-Systems running on 9.1-RELEASE or higher, hanging earlier this year with the same problem. If you find there enough information to say "alltrace is not needed", we'd be happy. But until then we will search for possibilities to do such an alltrace... The break possibility we already enabled by setting the sysctl debug.kdb.break_to_debugger=1 As soon as you send us the mentioned patch we will begin to test it and introduce it to the affected systems. Philipp p.s. please note; even there are a lot of processes in some zfs states, we also tried to do dd to some raw partitions on disk and were not successful, so we think it's not a zfs problem. In gstat we saw the wait queue's stall at specific values and nothing going on anymore.
For clarification, the production p220i are configured raid1 with ZFS on top of them? You mention running them in Simple mode, that simply slows them down in the new order of things. I wouldn't bother. I would request a system dmesg with verbose boot turned on so I can see what the hardware is reporting as parameters. Also, have you attempted 10-current install and test? sean
Hi Sean I'm taking over from Philipp as he's out of office the next couple of days. No, neither the p220i in the g8 blades nor the p410i in the g7 blades are configured raid1. We expose all disks to the OS by creating one raid0 "stripe" per disk. In case of the G8 blades we do zfs mirroring on top of that, in case of the G7 with storage blades, we do raidz2 across 10 disks (2 internal (ciss0), 8 in the storage blade (ciss1)). We haven't tried CURRENT so far. But I've created a patch that brings most changes to the ciss driver from head to 9.1. Actually, I used ciss from head (r250031) and backed out these two commits that affected more than the ciss driver (i.e. changes to the whole CAM layer): - r246713 - r249468 Some servers have been running quite stable for over a week now, but we don't know yet if that's coincidence, because of the patch or because of simple mode. We propagated the patch to all the affected systems with the recent security upgrade and at the same time activated simple mode on all of them. This of course doesn't make it simpler to find out what really brought some relief. This is why we rebooted 15 G8 blades back into performance mode yesterday, part of them with my patch, part of them without. So what we test right now is the following: - 11 G8 blades with patch and SIMPLE mode (we keep them up to rule out coincidence, these have over a week uptime right now) - 6 G8 blades PERF mode without patch (to prove the problem comes back when restoring default settings) - 9 G8 blades PERF mode with patch (to test if it is the patch that actually helps) I think it's to early to state that something actually helped. The situation with simple mode and patch has been better so far for sure, but then again the simple mode could just have slowed things down to the point the problem does no longer occur often enough. What we can say though is that we did have an IO stall while using the that same ciss patch on the G7 blades. This was a few weeks a ago and unfortunately we haven't been able to reproduce that since then Also simple mode did not seem to help on G7 blades. The servers we rebooted yesterday now run at 115k console speed, so we should be able to pull out debug information once the problem occurs. Unfortunately, serial console and ddb is the only way to get any information, as a memory dump to the swap partion obviously doesn't work when all disk IO is stuck. On the servers with the ciss head patch, we've also defined CISS_DEBUG in ciss.c. If one of those servers should have the problem, we could also print the driver debug information in ddb. Would that be valuable to you? Or do you think it's even bad to have CISS_DEBUG defined, as it could slow down things and make our problem less likely to occur? The servers without ciss head patch don't have CISS_DEBUG defined, because the driver won't compile on 9.1 with that. Verbose boot output of a server with ciss head patch and CISS_DEBUG defined should be attached. Thanks for looking into this. Markus
I already replied to Sean last friday but the copy has not made its way = into the PR system, probably due to the size of the alltrace attachment. = So here's a dropbox link: = https://dl.dropboxusercontent.com/u/10669369/alltrace%20G8%20PERF%209.1cis= s.txt And the email I sent: ---- Hi again One of the G8 blades we had booted back into PERF mode yesterday stalled = last night. This one was _not_ patched with the ciss changes from head, = so it was the plain 9.1 driver running in default PERF mode. This means = that as soon as we restore the original default configuration, the = stalls come back quite quickly. Of course we still don't know wether it = is the patch, SIMPLE mode or the combination of both that seems to help = the other systems. Hopefully we'll know that soon. Anyway, we got an alltrace from the server that stalled. Unfortunately = no ciss debug outpout since it was not patched. Markus ----
While I wait for a p420 and p410 to test your configuration, I setup a dl180g6 with a p400 in a test configuration. I've had some interesting results. I setup two disks in a RAID1+0 volume for root and swap and UFS I created a 10 disk raidz2 volume named "ztest": -------------------------------------------------------------------------------- bash-4.2# swapinfo; df -k; zpool status Device 512-blocks Used Avail Capacity /dev/da0p3 16777216 0 16777216 0% -------------------------------------------------------------------------------- Filesystem 1024-blocks Used Avail Capacity Mounted on /dev/da0p2 4058430 1256338 2477418 34% / devfs 1 1 0 100% /dev /dev/da0p4 245197222 6081304 219500142 3% /home procfs 4 4 0 100% /proc ztest 1118637925 67 1118637858 0% /ztest -------------------------------------------------------------------------------- pool: ztest state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM ztest ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 da10 ONLINE 0 0 0 errors: No known data errors -------------------------------------------------------------------------------- I find that as we reach maximum capacity of /ztest, zfs really starts bogging down. I'm running iozone with the following parameters: -------------------------------------------------------------------------------- iozone -s 50G -r 256K -+q 30 -i 0 -i 1 -R -t 12 -F /ztest/1 /ztest/2 /ztest/3 /ztest/4 /ztest/5 /ztest/6 /ztest/7 /ztest/8 /ztest/9 /ztest/10 /ztest/11 /ztest/12 -------------------------------------------------------------------------------- This would be a good start to more reliably test your configuration. I didn't bother doing gpt partitions and the geom partitioning scheme in this configuration because its not bootable. Sean
Hey Sean I'm glad to hear you're getting the same controller as ours to test. In = the meantime it seems that the backported ciss changes from head seem to = help a lot on the G8 blades with the p220 controllers. It's quite likely = that the G8 problem is already fixed in head. Of course, we can't be = sure yet, but still it might be better to focus on the G7 with p410 and = storage blade, where the issue has occured even with ciss from head. So = it's good your getting a p410. We discussed your test scenario. ZFS is known to go nuts and do really = much IO once a zpool get quite full, so is your goal just to maximise IO = to reproduce the problem more reliably? Or is there a specific reason = why you want us to fill a zpool? Our problem is that half of the G7 blades are productive, so filling the = zpool is no option there. The second half is where the first half = replicates all data to, so they're kind of hot standby and we're more = flexibel doing tests there, but we still have to keep the replication = running, which makes filling the pool impossible as well. The day before yesterday we installed the patched kernel that has ciss = from head and CISS_DEBUG defined on all these standby systems. We run = zpool scrubs non-stop on all of them to generate IO and as they are = replication targets, they also receive some amount of write IO. Like = that, we hope to get a system to stall more often, so we can progress = more quickly debugging the G7 problem. If you think that more write IO = would help, we can look into using iozone, but a stated before, we won't = be able to do things like filling the zpool. Also, once a G7 blade stalls, is there any information apart from = alltrace and DDB ciss debug print you want as to pull out of the system? When reading through the ciss driver source I noticed that the DDB print = may only outpout information about the first controller. Since the = storage blade contains a second p410, do you think it'd be worth to = alter the debug function to print out information about any ciss = controller in the system? Markus
Hey Sean Two of the G7 blades stalled today, that was quick=85 Here's an alltrace = and ciss debug output each: First stall: = https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug= /20130705%20-%20G7%20crash%201%20(12s)/20130705%20-%201%20-%20alltrace.txt= = https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug= /20130705%20-%20G7%20crash%201%20(12s)/20130705%20-%201%20-%20cissdebug.tx= t Second stall: = https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug= /20130705%20-%20G7%20crash%202%20(16s)/20130705%20-%202%20-%20alltrace.txt= = https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug= /20130705%20-%20G7%20crash%202%20(16s)/20130705%20-%202%20-%20cissdebug.tx= t As said, the debug information is for ciss0 only, because the driver = does not iterate through all controllers when printing debug = information. Also, if you need more information when this happens the = next time, please let us know. Markus
Hey Steven Thanks for your input. On 05.07.2013, at 15:43, Steven Hartland <killing@multiplay.co.uk> wrote: > Might also want to get the output from "show sleepchain" for all threads > too as that will easily identify sleep lock dead locks. Is there an easy way to do this for all threads with one command? The first server that crashed had 800 threads If not, we should probably script this outside of ddb using thread ids from the alltrace output. Or is there a subset of threads you're particularly interested in? > Also whats the check_disk process? This is Nagios' check_disk plugin we use to check the filesystem usage on all mountpoints. It runs quite frequently, that's why multiple may be get started until we notice and break into the debugger. Markus
I updated stable/9 with most of the changes made to head recently. See if that makes any difference. Sean
I've updated the DDB hook to display as many adapters as possible. Can you add this to your tests please? http://people.freebsd.org/~sbruno/ciss_ddb_update.txt Sean
I checked your MFC and all the fixes are already included in my patch = for 9.1 that we're currently testing with. With that patch, all G8 = blades are still running stable and have not shown any more IO stalls. = The G7 ones still reliably crash with our test load. So I think we can = state that we have already tested wether the changes from head help or = not. Is there another reason you want us to test with a stable/9 kernel, = or should we stick with the patched 9.1 for now? In any case I'll apply your DDB hook patch to our patched 9.1 kernel, so = we'll get out more debug information when a G7 blade stalls next time. Markus
The patch applied but it broke the the kernel build. I've corrected it = as I think it was intended. The version I used is here: https://dl.dropboxusercontent.com/u/10669369/ciss_ddb_update_v2.txt Markus
We had another G7 IO stall and were able to get ciss debug outpout for = both controllers. ciss0 is the internal one, ciss1 is the one built into = the storage blade. = https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug= /20130717%20-%20G7%20crash%20%2815s%29/20130717%20-%20alltrace.txt = https://dl.dropboxusercontent.com/u/10669369/fbsd%20ciss%20iostall%20debug= /20130717%20-%20G7%20crash%20%2815s%29/20130717%20-%20cissdebug.txt I hope this helps=85 G8 blades are still running stable with the ciss changes from head. Markus
Author: sbruno Date: Sat Jul 27 17:20:43 2013 New Revision: 253712 URL: http://svnweb.freebsd.org/changeset/base/253712 Log: update ciss debugger command to iterate over multiple raid controllers up to max controllers supported PR: kern/179932 Reviewed by: Markus Gebert <markus.gebert@hostpoint.ch> MFC after: 2 weeks Sponsored by: Yahoo! Inc. Modified: head/sys/dev/ciss/ciss.c Modified: head/sys/dev/ciss/ciss.c ============================================================================== --- head/sys/dev/ciss/ciss.c Sat Jul 27 17:16:59 2013 (r253711) +++ head/sys/dev/ciss/ciss.c Sat Jul 27 17:20:43 2013 (r253712) @@ -4378,11 +4378,17 @@ ciss_print_adapter(struct ciss_softc *sc DB_COMMAND(ciss_prt, db_ciss_prt) { struct ciss_softc *sc; + devclass_t dc; + int maxciss, i; - sc = devclass_get_softc(devclass_find("ciss"), 0); - if (sc == NULL) { - printf("no ciss controllers\n"); - } else { + dc = devclass_find("ciss"); + if ( dc == NULL ) { + printf("%s: can't find devclass!\n", __func__); + return; + } + maxciss = devclass_get_maxunit(dc); + for (i = 0; i < maxciss; i++) { + sc = devclass_get_softc(dc, i); ciss_print_adapter(sc); } } _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
I have a test box for this now with an external array. Currently, its configured with an onboard p420i with a single raid0 of one disk and a P222 handling a 40TB array external to the chassis. I'll see if I can do something to make this die in my Gen8 box. sean
For bugs matching the following criteria: Status: In Progress Changed: (is less than) 2014-06-01 Reset to default assignee and clear in-progress tags. Mail being skipped
Is this PR still relevant? Thanks