Summary: | [vfs] [patch] Heavy disk I/O may hang system | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Klaus Weber <fbsd-bugs-2013-1> | ||||||
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> | ||||||
Status: | Open --- | ||||||||
Severity: | Affects Only Me | CC: | agh, chris, emaste, pr, rainer | ||||||
Priority: | Normal | Keywords: | patch | ||||||
Version: | Unspecified | ||||||||
Hardware: | Any | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Description
Klaus Weber
2013-05-26 21:00:00 UTC
On Sun, 26 May 2013, Klaus Weber wrote: >> Environment: > FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #5 r250475: Sun May 12 19:14:21 CEST 2013 root@filepile:/usr/obj/usr/src/sys/FILEPILE amd64 > (Kernel config has some drivers removed and debugging options added. Tested with GENERIC as well) >> Description: > During heavy disk I/O (two bonnie++ processes working on the same disk simultaneously) causes an extreme degradation in disk throughput (combined throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows when both bonnie++ processes are in the "Rewriting..." phase. Please use the unix newline character in mail. A huge drop in disk throughput is normal with multiple processes, unfortunately. Perhaps not quite this much. > Sometimes, the systems "hangs" (throughput reaches 0 MB/sec); in that case, the system does not recover from this state. "Hang" means that the system responds to ping etc., processes that are already running will continue, but every command that requires disk accesses (even on disks other than the disk being tested) just hangs. If shutdown can be started without disk access (i.e. still in cache), the system will begin the shutdown, but fails to sync all buffers. Hangs are bugs though. I have been arguing with kib@ about some methods of handling heavy disk i/o being nonsense since they either make the problem worse (by switching to direct unclustered writes, so that slow i/o goes even slower) or have no effect except to complicate the analysis of the problem (because they are old hackish methods, and newer better methods make the cases handled by the old methods unreachable). But the analysis is complicated, and we couldn't agree on anything. One thing I noticed more recently than my last mail to kib about this is that even a single heavy disk write (say dd bs=256m count=2 </dev/zero >foo) can block another non-write-bound process (say iostat 1 or systat -v 1 to watch the progress) from starting up (sometimes even when it has just been run to load it into cache(s)) until the dd completes 10 seconds later. This was on a 2-core system running a fairly old version of FreeBSD (8.0-CURRENT from ~2008). I found that the problem could be fixed by killing cluster_write() by turning it into bdwrite() (by editing the running kernel using ddb, since this is easier than rebuilding the kernel). I was trying many similar things since I had a theory that cluster_write() is useless. Actually, cluster_write() gives some minor optimizations, but this misbehaviour indicates that it has a resource starvation bug. My theory for what the bug is is that cluster_write() and cluster_read() share the limit resource of pbufs. pbufs are not managed as carefully as normal buffers. In particular, there is nothing to limit write pressure from pbufs like there is for normal buffers. However, this behaviour doesn't happen on the much older version of FreeBSD that I normally use (5.2-CURRENT from 2004 with many local modifications). It starts iostat to watch dd instantly, even running on a single core system. > gpart create -s gpt da0 > gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0 > newfs -b 64k -f 8k /dev/da0p1 > mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2} The default for newfs is -b 32k. This asks for buffer cache fragmentation. Someone increased the default from 16k to 32k without changing the buffer cache's preferred size (BKVASIZE = 16K). BKVASIZE has always been too small, but on 32-bit arches kernel virtual memory is too limited to have a larger BKVASIZE by default. BKVASIZE is still 16K on all arches although this problem doesn't affetc 64-bit arches. -b 64k is worse. > When both bonnie++ processes are in their "Rewriting" phase, the system hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". bufdaemon takes about 40% CPU time and is in state "qsleep" when not active. You got the buffer cache fragmentation that you asked for. Buffer cache fragmentation caused serious problems 10-20 years ago, but the ones causing hangs were supposed to be fixed about 12 years ago and the ones causing extreme slowness for just the software parts of handling fragmentation were fixed a little later. But there apparently still some. However, I have never been able to reproduce serious fragmentation problems from using too-large-block sizes, or demonstrate significant improvements from avoiding the known fragmentation problem by increasing BKVASIZE. Perhaps my systems are too small, or have tuning or local changes that accidentally avoid the problem. Apparently you found a way to reproduce the serious fragmentaion problems. Try using a block size that doesn't ask for the problem. Increasing BKVASIZE would take more work than this, since although it was intended to be a system parameter which could be changed to reduce the fragmentation problem, one of the many bugs in it is that it was never converted into a "new" kernel option. Another of the bugs in it is that doubling it halves the number of buffers, so doubling it does more than use twice as much kva. This severely limited the number of buffers back when memory sizes were 64MB. It is not a very significant limitation if the memory size is 1GB or larger. > I have also briefly tested a single disk instead of a RAID volume. In that case, I could not get the system to hang, put performance drops from ~150 MB/sec to ~1-3 MB/sec. I do not know if that indicates a problem with the arcmsr driver, or whether I cannot create "enough" I/O load with a single disk: I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to a nearly full old ffs file system on an old PATA disk, and 2 times 20MB/sec with 2 dd's. This is is almost as good as possible. The configuration was somewhat flawed: - the writes all fitted in memory because my disks are too full to hold more. This shouldn't be a problem for write benchmarks, because write pressure is (too) limited by writing out dirty buffers (too) early to limit write pressure, so writes are mostly physical. - the fs block size was 16K. This not a flaw, just the right size, but other sizes should be tested. - the file system was mounted async. This should have little affect, since for large writes all data is written async anyway (the write pressure handling (causes async writes instead of delayed writes even for not so large writes). - ffs was modified to remove its bogus write pressure handling. This should have no affect, but I didn't verify it. > I have, however, logged the sysctl vas subtree every second while the problem occurs: > As long as both bonnie++ are in "Writing intelligently..." phase, vfs.numdirtybuffers stays mostly between 80-100. When the first bonnie++ goes to "Rewriting..." (at 3:40 below) starts to increase (about 150-350 new dirty buffers per second). When the second bonnie++ goes to "Rewriting..." as well (at 4:24), this speeds up a lot (about 3000-5000 new dirty buffers/sec). A few seconds later, vfs.numdirtybuffers reaches its limit, and remains there until the machine is rebooted: > > 00-03-28.log:vfs.numdirtybuffers: 107 > 00-03-29.log:vfs.numdirtybuffers: 106 > 00-03-30.log:vfs.numdirtybuffers: 105 > ... > [first bonnie++ starts Rewriting] > 00-03-41.log:vfs.numdirtybuffers: 294 > 00-03-42.log:vfs.numdirtybuffers: 404 > 00-03-43.log:vfs.numdirtybuffers: 959 > 00-03-44.log:vfs.numdirtybuffers: 1071 There are various write pressure handling mechanisms that have different thresholds and get in each way a bit. The dirty buffer watermarks are presumably much higher than 105 or even 1071. Other mechanisms apparently work well and limit the number of dirty buffers to only 107 with 1 writer. Writes to memory can go much faster than writes to the disk even when the latter go at full speed, but some mechanism blocks writers (too) early so the 1 writer soon reaches a steady state with not many dirty buffers. > 00-03-45.log:vfs.numdirtybuffers: 1155 > 00-03-46.log:vfs.numdirtybuffers: 1253 > 00-03-47.log:vfs.numdirtybuffers: 1366 > 00-03-48.log:vfs.numdirtybuffers: 1457 > ... > 00-04-21.log:vfs.numdirtybuffers: 8156 > 00-04-22.log:vfs.numdirtybuffers: 8362 > 00-04-23.log:vfs.numdirtybuffers: 8558 > [second bonnie goes Rewriting as well] > 00-04-24.log:vfs.numdirtybuffers: 11586 > 00-04-25.log:vfs.numdirtybuffers: 16325 > 00-04-26.log:vfs.numdirtybuffers: 24333 > ... > 00-04-54.log:vfs.numdirtybuffers: 52096 > 00-04-57.log:vfs.numdirtybuffers: 52098 > 00-05-00.log:vfs.numdirtybuffers: 52096 > [ etc. ] This is a rather large buildup and may indicate a problem. Try reducing the dirty buffer watermarks. Their default values are mostly historical nonsense. They are just some fixed fraction (about 1/4) of the number of buffers. With multiple GB of memory, the number of buffers may be huge. On a system with 24GB, vfs.bbuf is 157800, and vfs.dirtybufthresh is 35523. But the size of main memory has nothing to do with the correct limit on the number of dirty buffers (unless the latter is not being used to limit write pressure, but only to limit starvation of buffers for readers. This is perhaps all it is supposed to do, but see below). We want buildups of dirty buffers to be limited so that the system never gets too far behind on physical writes. The correct limit depends on the physical write capabilities of the machine and on which write channels have pressure on them. Default global limits don't work at all for this. You can fix the defaults but you can't distribute the limits properly accross channels, or easily keep the limits up to date as channels are added. It is easy to add slow USB channels. These won't increase the write capability much, but it is very easy for a slow disk or channel to it to grab many dirty buffers and keep them for a long time so that faster disks are starved. Other mechanisms handle write pressure better than the simple dirty buffer count one can (others count the amount of i/o, which is more relevant to how long it will take). So assume that any affect of the dirty buffer count mechanism on write pressure is unintentional. There are still the following problems with it: - as above, slow channels can starve fast channels, thus giving a severe side effect on the write pressure (at best the users of the fast channels block so that the are slowed down to the speed of the slow channels) - buffer starvation for readers can happen anyway. vfs.nbuf is the number of buffers that are allocated, but not all of these can be used when some file systems use buffer sizes larger than about BKVASIZE. The worst case is when all file systems a block size of 64KB. Then only about 1/4 of the allocated buffers can be used. The number that can be used is limited by vfs.maxbufspace and related variables. The default for vfs.maxbufspace is controlled indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf = 128K; the ratio is 16K = BKVASIZE). So when you ask for buffer cache fragmentation with -b 64k, you get even more. The effective nbuf is reduced by a factor of 4. This means that all the dirty buffer count watermarks except one are slightly above the effective nbuf, so if the higher watermarks cannot quite be reached and when they are nearly reached readers are starved of buffers. > vfs.hidirtybuffers: 52139 > vfs.lodirtybuffers: 26069 > (the machine has 32 GB RAM) Fairly consistent with my examples of a machine with 24GB RAM. The upper watermark is 33% higher, etc. Your nbuf will be more like 200K than 150K. You reached numdirtybuffers = 52096. At 64K each, that's about 3GB. maxbufspace will be about the same, and bufspace (the amount used) will be not much smaller. There will be little space for buffers for readers, and the 2 writers apparently manage to starve even each other, perhaps for similar reasons. The backlog of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec it would take almost an hour, so it is not much better than a hang. > I have found http://freebsd.1045724.n5.nabble.com/7-1-RELEASE-I-O-hang-td3966945.html , which seems to describe the same issue (also in combination with a RAID controller, albeit a different one). I think the hardware is usually not the problem (at least after kib's fix for the above), except when it is large it can stress the software more. Bruce First of all, thank you very much for looking into this, and for your detailed explanations. Much appreciated. On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote: > On Sun, 26 May 2013, Klaus Weber wrote: > >>Description: > >During heavy disk I/O (two bonnie++ processes working on the same disk > >simultaneously) causes an extreme degradation in disk throughput (combined > >throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows > >when both bonnie++ processes are in the "Rewriting..." phase. > Please use the unix newline character in mail. My apologies. I submitted the report via the web-interface and did not realize that it would come out this way. > A huge drop in disk throughput is normal with multiple processes, > unfortunately. Perhaps not quite this much. Yes, the slight performance drop from ~1 GB/sec to ~600 or even ~400 MB/sec does not concern me (this might even be an issue with the arcmsr driver or the controller's firmware; I found some hints that this might be the case.) However, a performance drop by 3 orders of a magnitude for just two concurrent processes seemed to indicate a problem. > Hangs are bugs though. > > I have been arguing with kib@ about some methods of handling heavy disk > i/o being nonsense since they either make the problem worse (by switching > to direct unclustered writes, so that slow i/o goes even slower) or have > no effect except to complicate the analysis of the problem (because they > are old hackish methods, and newer better methods make the cases handled > by the old methods unreachable). But the analysis is complicated, and > we couldn't agree on anything. The machine where I experience this problem is not in production yet. I can reboot it at any time, test patches, etc. Just let me know if I can do anything helpful. The only limitation is that I usually have access to the server only on weekends. > [single process with heavy write access can block other non-write-bound process] I have not experienced this so far, but I can test this when I have access to the server next time. > I found that > the problem could be fixed by killing cluster_write() by turning it into > bdwrite() (by editing the running kernel using ddb, since this is easier > than rebuilding the kernel). I was trying many similar things since I > had a theory that cluster_write() is useless. [...] If that would provide a useful datapoint, I could try if that make a difference on my system. What changes would be required to test this? Surely its not as easy as replacing the function body of cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"? > My theory for what the bug is is that > cluster_write() and cluster_read() share the limit resource of pbufs. > pbufs are not managed as carefully as normal buffers. In particular, > there is nothing to limit write pressure from pbufs like there is for > normal buffers. Is there anything I can do to confirm rebut this? Is the number of pbufs in use visible via a sysctl, or could I add debug printfs that are triggered when certain limits are reached? > >newfs -b 64k -f 8k /dev/da0p1 > > The default for newfs is -b 32k. This asks for buffer cache fragmentation. > Someone increased the default from 16k to 32k without changing the buffer > cache's preferred size (BKVASIZE = 16K). BKVASIZE has always been too > small, but on 32-bit arches kernel virtual memory is too limited to have > a larger BKVASIZE by default. BKVASIZE is still 16K on all arches > although this problem doesn't affetc 64-bit arches. > > -b 64k is worse. Thank you for this explanation. I was not aware that -b 64k (or even the default values to newfs) would have this effect. I will repeat the tests with 32/4k and 16/2k, although I seem to remember that 64/8k provided a significant performance boost over the defaults. This, and the reduced fsck times was my original motivation to go with the larger values. Given the potentially drastic effects of block sizes other than 16/2k, maybe a warning should be added to the newfs manpage? I only found the strong advice to maintain a 8:1 buffer:fragment ratio. > >When both bonnie++ processes are in their "Rewriting" phase, the system > >hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". > >bufdaemon takes about 40% CPU time and is in state "qsleep" when not > >active. > > You got the buffer cache fragmentation that you asked for. Looking at vfs_bio.c, I see that it has defrag-code in it. Should I try adding some debug output to this code to get some insight why this does not work, or not as effective as it should? > Apparently you found a way to reproduce the serious fragmentaion > problems. A key factor seems to be the "Rewriting" operation. I see no problem during the "normal" writing, nor could I reproduce it with concurrent dd runs. > Try using a block size that doesn't ask for the problem. Will do, although for production use I would really prefer a 64/8k system, due to the higher performance. > Increasing BKVASIZE would take more work than this, since although it > was intended to be a system parameter which could be changed to reduce > the fragmentation problem, one of the many bugs in it is that it was > never converted into a "new" kernel option. Another of the bugs in > it is that doubling it halves the number of buffers, so doubling it > does more than use twice as much kva. This severely limited the number > of buffers back when memory sizes were 64MB. It is not a very > significant limitation if the memory size is 1GB or larger. Should I try to experiment with BKVASIZE of 65536? If so, can I somehow up the number of buffers again? Also, after modifying BKVASIZE, is it sufficient to compile and install a new kernel, or do I have to build and install the entire world? > I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to > a nearly full old ffs file system on an old PATA disk, and 2 times > 20MB/sec with 2 dd's. This is is almost as good as possible. I agree. Performance with dd, or during bonnie++'s "Writing intelligently" phaase is very reasonable, with both a single process and with two processes simultaneously. Something specific to the "Rewriting..." workload is triggering the problem. > >[second bonnie goes Rewriting as well] > >00-04-24.log:vfs.numdirtybuffers: 11586 > >00-04-25.log:vfs.numdirtybuffers: 16325 > >00-04-26.log:vfs.numdirtybuffers: 24333 > >... > >00-04-54.log:vfs.numdirtybuffers: 52096 > >00-04-57.log:vfs.numdirtybuffers: 52098 > >00-05-00.log:vfs.numdirtybuffers: 52096 > >[ etc. ] > > This is a rather large buildup and may indicate a problem. Try reducing > the dirty buffer watermarks. Their default values are mostly historical > nonsense. You mean the vfs.(hi|lo)dirtybuffers? Will do. What would be reasonable starting values for experimenting? 800/200? > [Helpful explanation snipped] > - buffer starvation for readers can happen anyway. vfs.nbuf is the > number of buffers that are allocated, but not all of these can be > used when some file systems use buffer sizes larger than about > BKVASIZE. The worst case is when all file systems a block size of > 64KB. Then only about 1/4 of the allocated buffers can be used. > The number that can be used is limited by vfs.maxbufspace and > related variables. The default for vfs.maxbufspace is controlled > indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf = > 128K; the ratio is 16K = BKVASIZE). > > So when you ask for buffer cache fragmentation with -b 64k, you get > even more. The effective nbuf is reduced by a factor of 4. This > means that all the dirty buffer count watermarks except one are > slightly above the effective nbuf, so if the higher watermarks > cannot quite be reached and when they are nearly reached readers > are starved of buffers. That makes sense - and bonnie++ is both reader and writer in the rewriting phase, and thus may even "starve itself"? > >vfs.hidirtybuffers: 52139 > >vfs.lodirtybuffers: 26069 > >(the machine has 32 GB RAM) > > Fairly consistent with my examples of a machine with 24GB RAM. > The upper watermark is 33% higher, etc. Your nbuf will be more like > 200K than 150K. You reached numdirtybuffers = 52096. At 64K each, > that's about 3GB. maxbufspace will be about the same, and bufspace > (the amount used) will be not much smaller. There will be little > space for buffers for readers, and the 2 writers apparently manage > to starve even each other, perhaps for similar reasons. The backlog > of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec > it would take almost an hour, so it is not much better than a hang. OK, I will experiment with these parameters when I have access to the system again (which won't be before Thursday, unfortunately). Thanks again so far, Klaus On Mon, 27 May 2013, Klaus Weber wrote: > On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote: >> On Sun, 26 May 2013, Klaus Weber wrote: >>>> Description: >>> During heavy disk I/O (two bonnie++ processes working on the same disk >>> simultaneously) causes an extreme degradation in disk throughput (combined >>> throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows >>> when both bonnie++ processes are in the "Rewriting..." phase. > >> Please use the unix newline character in mail. > > My apologies. I submitted the report via the web-interface and did not > realize that it would come out this way. Thanks. The log output somehow came out right. >> I found that >> the problem could be fixed by killing cluster_write() by turning it into >> bdwrite() (by editing the running kernel using ddb, since this is easier >> than rebuilding the kernel). I was trying many similar things since I >> had a theory that cluster_write() is useless. [...] > > If that would provide a useful datapoint, I could try if that make a > difference on my system. What changes would be required to test this? > > Surely its not as easy as replacing the function body of > cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"? That should work for testing, but it is safer to edit ffs_write() and remove the block where it calls cluster_write() (or bawrite()), so that it falls through to call bdwrite() in most cases. >> My theory for what the bug is is that >> cluster_write() and cluster_read() share the limit resource of pbufs. >> pbufs are not managed as carefully as normal buffers. In particular, >> there is nothing to limit write pressure from pbufs like there is for >> normal buffers. > > Is there anything I can do to confirm rebut this? Is the number of > pbufs in use visible via a sysctl, or could I add debug printfs that > are triggered when certain limits are reached? Here I don't really know what to look for. First add a sysctl to read the number of free pbufs. The variable for this is cluster_pbuf_freecnt in vm. >>> newfs -b 64k -f 8k /dev/da0p1 >> >> The default for newfs is -b 32k. This asks for buffer cache fragmentation. >> Someone increased the default from 16k to 32k without changing the buffer >> cache's preferred size (BKVASIZE = 16K). BKVASIZE has always been too >> small, but on 32-bit arches kernel virtual memory is too limited to have >> a larger BKVASIZE by default. BKVASIZE is still 16K on all arches >> although this problem doesn't affetc 64-bit arches. >> >> -b 64k is worse. > > Thank you for this explanation. I was not aware that -b 64k (or even > the default values to newfs) would have this effect. I will repeat the > tests with 32/4k and 16/2k, although I seem to remember that 64/8k > provided a significant performance boost over the defaults. This, and > the reduced fsck times was my original motivation to go with the > larger values. The reduced fsck time and perhaps the reduced number of cylinder groups are the main advantages of large clusters. vfs-level clustering turns most physical i/o's into 128K-blocks (especially for large files) so there is little difference between the i/o speed for all fs block sizes unless the fs block size is very small. > Given the potentially drastic effects of block sizes other than 16/2k, > maybe a warning should be added to the newfs manpage? I only found the > strong advice to maintain a 8:1 buffer:fragment ratio. Once the kernel misconfiguration understood eniygh for such a warning to not be FUD, it should be easy to fix. >>> When both bonnie++ processes are in their "Rewriting" phase, the system >>> hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". >>> bufdaemon takes about 40% CPU time and is in state "qsleep" when not >>> active. >> >> You got the buffer cache fragmentation that you asked for. > > Looking at vfs_bio.c, I see that it has defrag-code in it. Should I > try adding some debug output to this code to get some insight why this > does not work, or not as effective as it should? Don't start there, since it is complicated and timing-dependent. Maybe add some printfs to make it easy to see when it enters and leaves defrag mode. >> Apparently you found a way to reproduce the serious fragmentaion >> problems. > > A key factor seems to be the "Rewriting" operation. I see no problem > during the "normal" writing, nor could I reproduce it with concurrent > dd runs. I don't know exactly what bonnie rewrite bmode does. Is it just read/ [modify]/write of sequential blocks with a fairly small block size? Old bonnie docs say that the block size is always 8K. One reason I don't like bonnie. Clustering should work fairly normally with that. Anything with random seeks would break clustering. >> Increasing BKVASIZE would take more work than this, since although it >> was intended to be a system parameter which could be changed to reduce >> the fragmentation problem, one of the many bugs in it is that it was >> never converted into a "new" kernel option. Another of the bugs in >> it is that doubling it halves the number of buffers, so doubling it >> does more than use twice as much kva. This severely limited the number >> of buffers back when memory sizes were 64MB. It is not a very >> significant limitation if the memory size is 1GB or larger. > > Should I try to experiment with BKVASIZE of 65536? If so, can I > somehow up the number of buffers again? Also, after modifying > BKVASIZE, is it sufficient to compile and install a new kernel, or do > I have to build and install the entire world? Just the kernel, but changing sys/param.h will make most of the world want to recompile itself according to dependencies. I don't like rebuilding things, and often set timestamps in header files back to what they were to avoid rebuilding (after rebuilding only the object files that actually depend on the change). Use this hack with caution, or rebuild kernels in a separate tree that doesn't affect the world. >>> [second bonnie goes Rewriting as well] >>> 00-04-24.log:vfs.numdirtybuffers: 11586 >>> 00-04-25.log:vfs.numdirtybuffers: 16325 >>> 00-04-26.log:vfs.numdirtybuffers: 24333 >>> ... >>> 00-04-54.log:vfs.numdirtybuffers: 52096 >>> 00-04-57.log:vfs.numdirtybuffers: 52098 >>> 00-05-00.log:vfs.numdirtybuffers: 52096 >>> [ etc. ] >> >> This is a rather large buildup and may indicate a problem. Try reducing >> the dirty buffer watermarks. Their default values are mostly historical >> nonsense. > > You mean the vfs.(hi|lo)dirtybuffers? Will do. What would be > reasonable starting values for experimenting? 800/200? 1000 or 10000 (if nbuf is 50000). 1000 is probably too conservative, but I think it is plenty for most loads. Bruce Sorry for the late reply, testing took longer than expected. (I have combined your replies from separate mails into one, and reordered some of the text.) On Tue, May 28, 2013 at 10:03:10PM +1000, Bruce Evans wrote: > On Mon, 27 May 2013, Klaus Weber wrote: > >On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote: > >>On Sun, 26 May 2013, Klaus Weber wrote: > However, I have never been able to reproduce serious fragmentation problems > from using too-large-block sizes, or demonstrate significant improvements > from avoiding the known fragmentation problem by increasing BKVASIZE. > Perhaps my systems are too small, or have tuning or local changes that > accidentally avoid the problem. > > Apparently you found a way to reproduce the serious fragmentaion > problems. Try using a block size that doesn't ask for the problem. > (...) > The reduced fsck time and perhaps the reduced number of cylinder groups > are the main advantages of large clusters. vfs-level clustering turns > most physical i/o's into 128K-blocks (especially for large files) so > there is little difference between the i/o speed for all fs block sizes > unless the fs block size is very small. I have now repeated the tests with several variations of block- and fragment sizes. In all cases, I did two tests: 1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000 2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1 bonnie++ -s 64g -n 0 -f -D -d /mnt/t2 The dd is simply to give a rough idea of the performance impact of the fs parameters, with the two bonnie++ processes, I was mainly interested in performance and hangs when both bonnie++ processes are in their "Rewriting" phase. I have also tested variations where the block:fragment ratio does not follow the 8:1 recommendation. 64/8k, kernel unpatched: dd: 1218155712 bytes/sec bonnie++: around 300 MB/sec, then drops to 0 and system hangs 32/4k, kernel unpatched: dd: 844188881 bytes/sec bonnie++: jumps between 25 and 900 MB/sec, no hang 16/2k, kernel unpatched: dd: 517996142 bytes/sec bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of 400-650 MB/sec, no hang 64/4k, kernel unpatched: dd: 1156041159 bytes/sec bonnie++: hangs system quickly once both processes are in rewriting 32/8k, kernel unpatched: dd: 938072430 bytes/sec bonnie++: 29-50 MB/sec, with 3-10 second "bursts" of up to 650 MB/sec, no hang (but I canceled the test after an hour or so). So while a file system created with the current (32/4k) or old (16/2k) defaults does prevent the hangs, it also reduces the sequential write performance to 70% and 43% of an 64/8k fs. The problem seems to be 64k block size, not the 8k fragment size. In all cases, the vfs.numdirtybuffers count remained fairly small as long as the bonnie++ processes were writing the testfiles. It rose to vfs.hidirtybuffers (slowler with only one process in "Rewriting", much faster when both processes are rewriting). > >00-04-57.log:vfs.numdirtybuffers: 52098 > >00-05-00.log:vfs.numdirtybuffers: 52096 > >[ etc. ] > > This is a rather large buildup and may indicate a problem. Try reducing > the dirty buffer watermarks. I have tried to tune system parameters as per your advice, in an attempt to get a 64/8k fs running stable and with reasonable write performance. (dd: results omitted for brevity, all normal for 64/8k fs) all with 64/8k, kernel unpatched: vfs.lodirtybuffers=250 vfs.hidirtybuffers=1000 vfs.dirtybufthresh=800 bonnie++: 40-150 MB/sec, no hang vfs.numdirtybuffers raises to 1000 when both processes are rewriting. vfs.lodirtybuffers=1000 vfs.hidirtybuffers=4000 vfs.dirtybufthresh=3000 bonnie++: 380-50 MB/sec, no hang For the next tests, I kept lo/hidirtybuffers at 1000/4000, and only varied dirtybufthresh: 1200: bonnie++: 80-370 MB/sec 1750: bonnie++: around 600 MB/sec 1900: bonnie++: around 580 MB/sec. vfs.numdirtybuffers=3800 (i.e. does reach vfs.hidirtybuffers anymore!) (no hangs in any of the tests). I then re-tested with lo/hidirtybuffers at their defaults, and only dirtybufthresh set to slightly less half of hidirtybuffers: vfs.lodirtybuffers=26069 vfs.hidirtybuffers=52139 vfs.dirtybufthresh=26000 dd: 1199121549 bytes/sec bonnie++: 180-650 MB/sec, mostly around 500, no hang By testing with 3, 4 and 5 bonnie++ processes running simultaneously, I found that (vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly less than vfs.hidirtybuffers for reasonable performance without hangs. vfs.numdirtybuffers rises to (vfs.dirtybufthresh) * (number of bonnie++ process) and as long as this is below vfs.hidirtybuffers, the system will not hang. > >>I found that > >>the problem could be fixed by killing cluster_write() by turning it into > >>bdwrite() (by editing the running kernel using ddb, since this is easier > >>than rebuilding the kernel). I was trying many similar things since I > >>had a theory that cluster_write() is useless. [...] > > > >If that would provide a useful datapoint, I could try if that make a > >difference on my system. What changes would be required to test this? > > > >Surely its not as easy as replacing the function body of > >cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"? > > That should work for testing, but it is safer to edit ffs_write() > and remove the block where it calls cluster_write() (or bawrite()), > so that it falls through to call bdwrite() in most cases. I was not sure whether to disable the "bawrite(bp);" in the else part as well. Here is what I used for the next test (in ffs_write): } else if (vm_page_count_severe() || buf_dirty_count_severe() || (ioflag & IO_ASYNC)) { bp->b_flags |= B_CLUSTEROK; bawrite(bp); /* KWKWKW } else if (xfersize + blkoffset == fs->fs_bsize) { if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) { bp->b_flags |= B_CLUSTEROK; cluster_write(vp, bp, ip->i_size, seqcount); } else { bawrite(bp); } KWKWKW */ } else if (ioflag & IO_DIRECT) { bp->b_flags |= B_CLUSTEROK; bawrite(bp); } else { bp->b_flags |= B_CLUSTEROK; bdwrite(bp); } dd: 746804775 bytes/sec During the dd tests, iostat shows a weird, sawtooth-like behavior: 64.00 26730 1670.61 0 0 12 3 85 64.00 13308 831.73 0 0 4 1 95 64.00 5534 345.85 0 0 10 1 89 64.00 12 0.75 0 0 16 0 84 64.00 26544 1658.99 0 0 10 2 87 64.00 12172 760.74 0 0 3 1 95 64.00 8190 511.87 0 0 8 1 91 64.00 10 0.62 0 0 14 0 86 64.00 22578 1411.11 0 0 14 3 83 64.00 12634 789.63 0 0 3 1 95 64.00 11695 730.96 0 0 6 2 92 48.00 7 0.33 0 0 13 0 87 64.00 11801 737.58 0 0 17 1 82 64.00 19113 1194.59 0 0 6 2 92 64.00 15996 999.77 0 0 4 2 94 64.00 3 0.19 0 0 13 0 87 64.00 10202 637.63 0 0 16 1 83 64.00 20443 1277.71 0 0 8 2 90 64.00 15586 974.10 0 0 4 1 95 64.00 682 42.64 0 0 13 0 87 With two bonnie++ processes in "Writing intelligently" phase, iostat jumped between 9 and 350 MB/sec. I had cancelled the test before the first bonnie++ process reached "Rewriting..." phase, due to the dismal performance. Already during the "Writing intelligently" phase, vfs.numdirtybuffers reaches vfs.hidirtybuffers (during previous tests, vfs.numdirtybuffers only raises to high numbers in the "Rewriting..." phase). After reverting the source change, I have decided to try mounting the file system with "-o noclusterr,noclusterw", and re-test. This is equivalent to disabling only the if-part of the expression in the source snippet above. dd: 1206199580 bytes/sec bonnie++: 550-700 MB/sec, no hang During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers and dirtybufthresh are at their defaults: vfs.dirtybufthresh: 46925 vfs.hidirtybuffers: 52139 vfs.lodirtybuffers: 26069 vfs.numdirtybuffers: 15 So it looks like you were spot-on by suspecting cluster_write(). Further tests confirmed that "-o noclusterw" is sufficient to prevent the hangs and provide good performance. "-o noclusterr" on its own makes no difference; the system will hang. I have also have tested with write-clustering enabled, but with vfs.write_behind=0 and vfs.write_behind=2, respectively. In both cases, the system hangs with two bonnie++ in "Rewriting...". I have also tested with BKVASIZE set to 65536. As you explained, this reduced the number of buffers: vfs.dirtybufthresh: 11753 vfs.hidirtybuffers: 13059 vfs.lodirtybuffers: 6529 dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and numdirtybuffers reaches max: vfs.numdirtybuffers: 13059 Even though numdirtybuffers reaches hidirtybuffers, the system does not hang, but performance is not very good. With BKVASIZE set to 65536 _and_ the fs mounted "-o noclusterw", the performance is same as with BKVASIZE of 16k, and the system does not hang. I have now reverted BKVASIZE to its default, as the main factor for a stable and fast system seems to be the noclusterw mount option. > >>Apparently you found a way to reproduce the serious fragmentaion > >>problems. > > > >A key factor seems to be the "Rewriting" operation. I see no problem > >during the "normal" writing, nor could I reproduce it with concurrent > >dd runs. > > I don't know exactly what bonnie rewrite bmode does. Is it just read/ > [modify]/write of sequential blocks with a fairly small block size? > Old bonnie docs say that the block size is always 8K. One reason I > don't like bonnie. Clustering should work fairly normally with that. > Anything with random seeks would break clustering. Here is the relevant part from bonnie++'s source (in C++): --------------- bufindex = 0; for(words = 0; words < num_chunks; words++) { // for each chunk in the file dur.start(); if (file.read_block(PVOID(buf)) == -1) return 1; bufindex = bufindex % globals.io_chunk_size(); buf[bufindex]++; bufindex++; if (file.seek(-1, SEEK_CUR) == -1) return 1; if (file.write_block(PVOID(buf)) == -1) return io_error("re write(2)"); ----------- globals.io_chunk_size() is 8k (by default and in all of my tests), and bonnie++ makes sure that buf is page-aligned. So you are correct: bonnie++ re-reads the file that was created previously in the "Writing intelligently..." phase in blocks, modifies one byte in the block, and writes the block back. Something in this specific workload is triggering the huge buildup of numdirtybuffers when write-clustering is enabled. I am now looking at vfs_cluster.c to see whether I can find which part is responsible for letting numdirtybuffers raise without bounds and why only *re* writing a file causes problems, not the initial writing. Any suggestions on where to start looking are very welcome. Klaus On Fri, 31 May 2013, Klaus Weber wrote: > Sorry for the late reply, testing took longer than expected. This is very interesting. It will take me a long time to complete testing and replying too. Maybe more later. > (I have combined your replies from separate mails into one, and > reordered some of the text.) Very readable now. > ... > I have now repeated the tests with several variations of block- and > fragment sizes. In all cases, I did two tests: > > 1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000 > 2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1 > bonnie++ -s 64g -n 0 -f -D -d /mnt/t2 > > The dd is simply to give a rough idea of the performance impact of the > fs parameters, with the two bonnie++ processes, I was mainly > interested in performance and hangs when both bonnie++ processes are > in their "Rewriting" phase. I have also tested variations where the > block:fragment ratio does not follow the 8:1 recommendation. > > 64/8k, kernel unpatched: > dd: 1218155712 bytes/sec > bonnie++: around 300 MB/sec, then drops to 0 and system hangs My hardware can barely do 40 MB/sec, but under FreeBSD-5.2 (modified), there is at most a 50% drop with 2 bonnies and no tuning, and no hangs. Except it hung in my debugging code for understanding a related problem. I had this code enabled for a month or 2 and had forgotten about it. Only 2 bonnies caused enough load to reach it. > 32/4k, kernel unpatched: > dd: 844188881 bytes/sec > bonnie++: jumps between 25 and 900 MB/sec, no hang > > 16/2k, kernel unpatched: > dd: 517996142 bytes/sec > bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of > 400-650 MB/sec, no hang The non-hangs might be due to BKVASIZE not being a problem. > So while a file system created with the current (32/4k) or old (16/2k) > defaults does prevent the hangs, it also reduces the sequential write > performance to 70% and 43% of an 64/8k fs. I think bufdaemon and other write-pressure handling code just don't work as well as upper layers for clustering and ordering the writes. They are currently broken, as shown by the hangs. Clustering shouldn't be generating so much pressure. I can explain why it does (see below). But random i/o can easily generate the same pressure, so bufdaemon etc. need to handle it. Also, pressure should be the usual case for async mounts, and bufdaemon etc. should be able to do clustering and ordering _better_ than upper layers, since they have more blocks to work with. cluster_write() intentionally leaves as much as possible to bufdaemon in order to potentially optimize it. > In all cases, the vfs.numdirtybuffers count remained fairly small as > long as the bonnie++ processes were writing the testfiles. It rose to > vfs.hidirtybuffers (slowler with only one process in "Rewriting", much > faster when both processes are rewriting). The slow buildup is caused by fragmentation. If the files happen to be laid out contiguously, then cluster_write() acts like bawrite(), but better. With your 64K-blocks and FreeBSD's 128K max cluster size, it normally combines just 2 64K-blocks to create a 128K-cluster, and does 1 write instead of 2. After this write, both methods have the leave number of dirty buffers (none for data, but a couple for metadata). In more detail, the write pattern with bonnie's 8K-blocks is: write; bdwrite ... write; bdwrite (7 bdwrites to the same buffer) write; cluster_write or bawrite (buffer is now full, and dirty iff clusterw) write; bdwrite ... write; bdwrite (7 bdwrites to the next buffer) write; cluster_write or bawrite (2 buffers are now full and both are clean) However, with fragmentation, cluster_write() "leaks" a dirty buffer although it is full, while bawrite() just writes all full buffers. The leakage is because cluster_write() wants physical contiguity. This is too much to ask for. The leak may be fast if the file system is very fragmented. I happened to test on a file system which was almost full and thus very fragmented. Two test files of size 256M each had 452 extents (counting indirect blocks as an extent although the might be contiguous). The buildup with my test program (which tries to produce bonnie's i/o pattern) was often to the same magic number (something like 768, a little higher than the number of extents). This much leakage was enough to make clustering a pessimization, although it worked to combine 4 blocks in most cases (my fs block size is 16K and my max cluster size is 64K). The disks are old, so 16K is as good an i/o size as 64K for them. But smaller sizes wouldn't be. >>> 00-04-57.log:vfs.numdirtybuffers: 52098 >>> 00-05-00.log:vfs.numdirtybuffers: 52096 >>> [ etc. ] >> >> This is a rather large buildup and may indicate a problem. Try reducing >> the dirty buffer watermarks. > > I have tried to tune system parameters as per your advice, in an attempt > to get a 64/8k fs running stable and with reasonable write performance. > (dd: results omitted for brevity, all normal for 64/8k fs) I got mostly worse behaviour by tuning things related to dirtybuffers. > [... You managed to fix the hangs, at a cost of too much performance. ] > By testing with 3, 4 and 5 bonnie++ processes running simultaneously, > I found that > (vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly > less than vfs.hidirtybuffers for reasonable performance without hangs. > > vfs.numdirtybuffers rises to > (vfs.dirtybufthresh) * (number of bonnie++ process) > and as long as this is below vfs.hidirtybuffers, the system will not > hang. I wonder why it is linear in the number of processes. I saw indications of similar behaviour (didn't test extensively). 1 bonnie created about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at vfs.hidirtybuffers. This is with vfs.hidirtbuffers much smaller than yours. > ... > I was not sure whether to disable the "bawrite(bp);" in the else part > as well. Here is what I used for the next test (in ffs_write): > > } else if (vm_page_count_severe() || > buf_dirty_count_severe() || > (ioflag & IO_ASYNC)) { > bp->b_flags |= B_CLUSTEROK; > bawrite(bp); Note that this bawrite() is used when hidirtybuffers is reached. This is OK for this buffer if this buffer is large, but it doesn't stop the flood -- the system stays saturated at hidirtybuffs until bonnie stops writing. This has a bad effect on other parts of the system, e.g., if they are writing small buffers. > /* KWKWKW } else if (xfersize + blkoffset == fs->fs_bsize) { > if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) { > bp->b_flags |= B_CLUSTEROK; > cluster_write(vp, bp, ip->i_size, seqcount); > } else { > bawrite(bp); > } KWKWKW */ OK. I tried a bdwrite() for the async mount case. Async mounts should give delayed writes always, but it is a bugfeature that they give delayed writes for critical metadata but normally async() writes for data (via cluster_write()). Using bdwrite here let me control the behaviour using a combination of mount flags. It became clearer that the bugfeature is not just a bug. Without cluster_write() forcing out writes for async mounts, the write pressure is much the same as from bonnie rewrite. > During the dd tests, iostat shows a weird, sawtooth-like behavior: > 64.00 26730 1670.61 0 0 12 3 85 > ... Some bad interaction of bufdaemon with the writers. I believe there are locking problems, but didn't know anything about them until recently. FreeBSD-5.2 is simpler and seems to have fewer of them. I got a deadlock in it by putting a sleep waiting for the write pressure to clear in the above. This gave deadlock, since bufdaemon is apparently prevented from doing anything while we hold a lock. > ... > After reverting the source change, I have decided to try mounting the > file system with "-o noclusterr,noclusterw", and re-test. This is > equivalent to disabling only the if-part of the expression in the > source snippet above. Good idea. > dd: 1206199580 bytes/sec > bonnie++: 550-700 MB/sec, no hang > > During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers > and dirtybufthresh are at their defaults: > vfs.dirtybufthresh: 46925 > vfs.hidirtybuffers: 52139 > vfs.lodirtybuffers: 26069 > vfs.numdirtybuffers: 15 > > So it looks like you were spot-on by suspecting cluster_write(). Not for the reasons that I suspected it! > Further tests confirmed that "-o noclusterw" is sufficient to prevent > the hangs and provide good performance. "-o noclusterr" on its own > makes no difference; the system will hang. I thought -noclusterw would be worse until now... > I have also tested with BKVASIZE set to 65536. As you explained, this > reduced the number of buffers: > vfs.dirtybufthresh: 11753 > vfs.hidirtybuffers: 13059 > vfs.lodirtybuffers: 6529 > > dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats > with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and > numdirtybuffers reaches max: > > vfs.numdirtybuffers: 13059 > > Even though numdirtybuffers reaches hidirtybuffers, the system does > not hang, but performance is not very good. ... I hoped BKVASIZE would be the problem, but it is at most a secondary problem. > Here is the relevant part from bonnie++'s source (in C++): > ... > So you are correct: bonnie++ re-reads the file that was created > previously in the "Writing intelligently..." phase in blocks, modifies > one byte in the block, and writes the block back. > > Something in this specific workload is triggering the huge buildup of > numdirtybuffers when write-clustering is enabled. I can explain this. Readers and writers share the offset for the sequential heuristic (it is per-open-file), so cluster_write() cannot tell that bonnie's writes are sequential. The sequential_heuristic() function sees the offset moving back and forth, which normally means random access, so it tells cluster_write() that the access is random, allthiugh all bonnie seeks back after each read() so that all the writes() are sequential. cluster_write() rarely if ever pushes out random writes, so the write pressure builds up to a saturation value almost instantly. Then all subsequent writes work even more poorly due to the pressure. Truly random writes would give similar buildup of pressure. > I am now looking at vfs_cluster.c to see whether I can find which part > is responsible for letting numdirtybuffers raise without bounds and > why only *re* writing a file causes problems, not the initial > writing. Any suggestions on where to start looking are very welcome. It is very complicated, but it was easy to find its comments saying that it tries not to force out the writes for non-seqential accesses. I am currently trying the following workarounds: % diff -u2 vfs_cluster.c~ vfs_cluster.c % --- vfs_cluster.c~ Tue Apr 6 20:11:55 2004 % +++ vfs_cluster.c Tue Jun 4 04:53:52 2013 % @@ -593,4 +714,6 @@ % } % % +int buf_dirty_count_large(void); % + % /* % * Do clustered write for FFS. This is like buf_dirty_count_severe() except it has a lower threshold. It doesn't really help. % @@ -630,4 +753,44 @@ % vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0; % % + /* % + * XXX the 13 year old changes in the algorithm to use seqcount % + * are too agressive. Avoid them by faking sequential access if % + * we are under write pressure. This causes us to write out % + * clusters in more cases (in particular, when clusters are % + * complete). This doesn't really help. It's too late to start faking after pressure has built up. % + * % + * XXX we should probably use a more conservative condition for % + * write pressure than is currently available. Try not to leave % + * anything to bufdaemon, since it can too easily fail to clear % + * write pressure because the processes creating the write % + * pressure have the vnode locked. E.g., use lodirtybuffers % + * instead of hidirtybuffers in buf_dirty_count_severe(), and % + * also check dirtybufthesh. I use lodirtybuffers as a default. This is still too high, and almost any value reduces performance. % + * % + * XXX seqcount itself is broken. It gets cleared by reads of % + * the same open file, even if the application seeks back after % + * reading so that writes are perfectly sequential. This case % + * is probably unimportant, but bonnie does it. As describe differently above. % + * % + * XXX I think that if bonnie didn't seek back then the write % + * pressure would be the same (provided the bonnie write size is % + * smaller than the fs block size). Then the writes wouldn't be % + * sequential, so a fixed seqcount wouldn't help force them out. % + * But they need to be forced out under write pressure, and I % + * don't see anything to do it. Check this. % + * % + * XXX the writing system shouldn't collapse when we get this % + * slightly wrong. Why doesn't bdwrite() force things when % + * we don't? It has some code to do so. It is still better % + * for us to force out full clusters than to leave it to % + * the writing system to possibly force out sub-buffers. % + * % + * XXX try killing async under the same conditions. This is too late to reduce pressure from async too. % + */ % + if (vm_page_count_severe() || buf_dirty_count_large()) { % + seqcount = 2; % + async = 0; % + } % + % if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 || % (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) { % @@ -719,4 +882,5 @@ % } else { /* Wait for rest of cluster */ % vp->v_cstart = lbn; % + /* XXX problem if we're async and orig maxclen = 0. */ % bdwrite(bp); % } This case for async breaks the bugfeature of not honoring it. The bugfeature often prevents buildup of exessive pressure, but some cases arrive here and then it doesn't. This is the only old change I have in cluster_write(). % @@ -726,8 +890,13 @@ % * are operating sequentially, otherwise let the buf or % * update daemon handle it. % + * % + * Algorithm changeback: ignore seqcount here, at least for % + * now, to work around readers breaking it for writers. It % + * is too late to start ignoring after write pressure builds % + * up, since not writing out here is the main source of the % + * buildup. % */ % bdwrite(bp); % - if (seqcount > 1) % - cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1); % + cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1); % vp->v_clen = 0; % vp->v_cstart = lbn + 1; This works fairly well. Changing the algorithm back in all cases reduces performance, but here we have usually finished with the buffer. Even if seqcount is fixed so that it isn't clobbered by reads, it might be right to ignoe it here, so as to reduce write pressure, until write pressure is handled better. I used the following program to generate i/o like bonnie's: %%% #include <err.h> #include <fcntl.h> #include <stdint.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> #define IOSIZE 8192 static char buf[IOSIZE] __aligned(4096); int main(int argc, char **argv) { ssize_t n; int ifd, ofd; if (argc != 2 && argc != 3) errx(1, "usage: ./prog file1 [file2]"); ifd = open(argv[1], O_RDWR); if (ifd < 0) err(1, "open"); if (argc == 3) ofd = open(argv[2], O_RDWR); else ofd = ifd; if (ifd < 0) err(1, "open"); for (;;) { n = read(ifd, buf, IOSIZE); if (n == 0) { printf("finished: ioffset %#jx; ooffset = %#jx\n", (intmax_t)lseek(ifd, 0, SEEK_CUR), (intmax_t)lseek(ofd, 0, SEEK_CUR)); exit(0); } if (n < 0) err(1, "read error"); if (n != IOSIZE) errx(1, "short read %d", (int)n); if (ofd == ifd) lseek(ofd, -IOSIZE, SEEK_CUR); n = write(ofd, buf, IOSIZE); if (n < 0) err(1, "write error"); if (n != IOSIZE) errx(1, "short write %d", (int)n); } } %%% Usage: create a test file zz of a suitable size (the program normally doesn't extend it); then ./prog zz does read/write like bonnie; ./prog zz zz does the same with independent file positions. Bruce Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130604052658.K1039@besplex.bde.org> User-Agent: Mutt/1.4.2.3i On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote: > On Fri, 31 May 2013, Klaus Weber wrote: Again, I have reordered some sections of your mail in my reply. I have also removed freebsd-bugs from Cc: - it seems that replies to PRs get forwarded there automatically, and judging from the web interface to the mailing list archives, the previous mails were all duplicated.) > >So you are correct: bonnie++ re-reads the file that was created > >previously in the "Writing intelligently..." phase in blocks, modifies > >one byte in the block, and writes the block back. > > > >Something in this specific workload is triggering the huge buildup of > >numdirtybuffers when write-clustering is enabled. > > I can explain this. Readers and writers share the offset for the > sequential heuristic (it is per-open-file), so cluster_write() cannot > tell that bonnie's writes are sequential. The sequential_heuristic() > function sees the offset moving back and forth, which normally means > random access, so it tells cluster_write() that the access is random, > allthiugh all bonnie seeks back after each read() so that all the > writes() are sequential. Indeed. Using the the test program you provided, I can confirm that reading and writing to the same file via _different_ file descriptors works without hangs and with good performance (similar to "-o noclusterw", see below for details). vfs.numdirtybuffers count remained low, and write rates pretty stable. Since your program is more flexible than bonnie++, I have done all tests in this mail with it. > cluster_write() rarely if ever pushes out > random writes, so the write pressure builds up to a saturation value > almost instantly. Then all subsequent writes work even more poorly > due to the pressure. Truly random writes would give similar buildup > of pressure. I have briefly tested and confirmed this as well (see below). Mounting a file system with "-o async" shows the "quickly raising vfs.numdirtybuffers" symptom already during writing (not re-writing) a file, e.g. with dd. I could not get the system to hang, but I did not test this very thoroughly. > >So while a file system created with the current (32/4k) or old (16/2k) > >defaults does prevent the hangs, it also reduces the sequential write > >performance to 70% and 43% of an 64/8k fs. > XXX > I think bufdaemon and other write-pressure handling code just don't work > as well as upper layers for clustering and ordering the writes. They > are currently broken, as shown by the hangs. Clustering shouldn't be > generating so much pressure. I can explain why it does (see below). > But random i/o can easily generate the same pressure, so bufdaemon etc. > need to handle it. Also, pressure should be the usual case for async > mounts, and bufdaemon etc. should be able to do clustering and ordering > _better_ than upper layers, since they have more blocks to work with. > cluster_write() intentionally leaves as much as possible to bufdaemon > in order to potentially optimize it. > > >In all cases, the vfs.numdirtybuffers count remained fairly small as > >long as the bonnie++ processes were writing the testfiles. It rose to > >vfs.hidirtybuffers (slowler with only one process in "Rewriting", much > >faster when both processes are rewriting). > > The slow buildup is caused by fragmentation. If the files happen to > be laid out contiguously, then cluster_write() acts like bawrite(), > but better. With your 64K-blocks and FreeBSD's 128K max cluster size, > it normally combines just 2 64K-blocks to create a 128K-cluster, and > does 1 write instead of 2. After this write, both methods have the > leave number of dirty buffers (none for data, but a couple for metadata). > > [Analysis of bonnie++'s write pattern] > > However, with fragmentation, cluster_write() "leaks" a dirty buffer > although it is full, while bawrite() just writes all full buffers. The > leakage is because cluster_write() wants physical contiguity. This is > too much to ask for. The leak may be fast if the file system is very > fragmented. In all of my (older) tests with bonnie++, the file system was newfs'ed before mounting it, so there was no fragmentation for me. > >I have tried to tune system parameters as per your advice, in an attempt > >to get a 64/8k fs running stable and with reasonable write performance. > >(dd: results omitted for brevity, all normal for 64/8k fs) > > I got mostly worse behaviour by tuning things related to dirtybuffers. Me too, performance-wise. The only exception was reducing vfs.dirtybufthresh so that vfs.numdirtybuffers no longer reaches vfs.hidirtybuffers; this is needed to prevent the hangs. The exact value does not seem to matter much. > >[... You managed to fix the hangs, at a cost of too much performance. ] > > >By testing with 3, 4 and 5 bonnie++ processes running simultaneously, > >I found that > >(vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly > >less than vfs.hidirtybuffers for reasonable performance without hangs. > > > >vfs.numdirtybuffers rises to > >(vfs.dirtybufthresh) * (number of bonnie++ process) > >and as long as this is below vfs.hidirtybuffers, the system will not > >hang. Actually, further testing revealed that it is not linear with the number of processes, but with the number of files being rewritten. With bonnie++, you cannot make this distiction; two bonnie++ processes cannot re-write the same file. However, with the test program you provided you can actually re-write the same file from two (or more) processes. Testing this shows that the correct formula is actually "vfs.numdirtybuffers rises to (vfs.dirtybufthresh) * (number of files being re-written) and as long as this is below vfs.hidirtybuffers, the system will not hang." So, any number of processes can re-write the same file, and the number of processes has no effect on vfs.numdirtybuffers. > I wonder why it is linear in the number of processes. I saw indications > of similar behaviour (didn't test extensively). 1 bonnie created > about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at > vfs.hidirtybuffers. This is with vfs.hidirtbuffers much smaller than > yours. Does this correlate with vfs.dirtybufthresh on your system as well? (i.e. is vfs.dirtybufthresh about 2/3 of your vfs.hidirtybuffers? > I tried a bdwrite() for the async mount case. Async mounts should give > delayed writes always, but it is a bugfeature that they give delayed > writes for critical metadata but normally async() writes for data (via > cluster_write()). Using bdwrite here let me control the behaviour > using a combination of mount flags. It became clearer that the > bugfeature is not just a bug. Without cluster_write() forcing out > writes for async mounts, the write pressure is much the same as from > bonnie rewrite. I have tested this as well, and I can confirm that with a file system mounted with "-o async" I see the problematic behavior with "normal" writes (not re-writes) as well: while writing a large file (e.g. via dd), vfs.numdirtybuffers rises quickly to vfs.dirtybufthresh). On non-async file systems, I could only provoke this via re-writing. > >After reverting the source change, I have decided to try mounting the > >file system with "-o noclusterr,noclusterw", and re-test. > > [...] > >Further tests confirmed that "-o noclusterw" is sufficient to prevent > >the hangs and provide good performance. > > I thought -noclusterw would be worse until now... I wanted to check the performance effects of write clustering more thoroughly, so I have tested 3 variants (with the test program you provided, using a 64 GB file (2x RAM)): 1) Re-writing a single file with two file descriptors, file system mounted with default options. 2) Re-writing a single file with two file descriptors, file system mounted with "-o noclusterw". 3) Re-writing a single file via a single file descriptor, file system mounted with "-o noclusterw" (required in this case to prevent the system hanging). First, I have tested on a logical volume on the RAID array (3 test runs each): 1) (2 desc, w/ clustering) 0.669u 97.119s 3:23.34 48.0% 5+187k 524441+524311io 0pf+0w 0.472u 95.501s 3:22.40 47.4% 5+188k 523095+524311io 0pf+0w 0.606u 93.263s 3:21.13 46.6% 5+188k 522999+524311io 0pf+0w 2) (2 desc, -o noclusterw) 0.524u 94.180s 3:19.76 47.4% 5+189k 524442+1048576io 0pf+0w 0.622u 95.200s 3:23.30 47.1% 5+187k 523090+1048576io 0pf+0w 0.475u 94.232s 3:19.96 47.3% 5+186k 522767+1048576io 0pf+0w 3) (1 desc, -o noclusterw) 0.922u 95.917s 3:33.74 45.3% 5+187k 524442+1048576io 0pf+0w 0.679u 95.392s 3:32.62 45.1% 5+187k 522256+1048576io 0pf+0w 0.976u 93.639s 3:33.68 44.2% 5+187k 521902+1048576io 0pf+0w As you can see, the performance differences are very small. If you consider the differences significant, then "-o noclusterw" is actually a tiny bit faster than clustering. You also see clustering at work, the number of writes doubles for cases 2) and 3). I have then repeated the tests on a single disk, connected directly to mainboard SATA port, to take the RAID controller out of the equation. (With its CPU and on-board RAM, it certainly has the ability to cluster and re-order writes after the kernel sends them.) Since the tests take a long time to complete, I have only done a single test run for each case. 1) (2 desc, w/ clustering) 3.497u 575.616s 24:39.49 39.1% 5+187k 524445+524314io 0pf+0w 2) (2 desc, -o noclusterw) 5.960u 876.735s 33:35.75 43.7% 5+187k 524445+1048576io 0pf+0w 3) (1 desc, -o noclusterw) 7.014u 741.382s 29:56.98 41.6% 5+188k 524445+1048576io 0pf+0w Here, clustering does seem to have a positive effect, and the differences are more pronounced. So it really seems that clustering does provide performance benefits, but the RAID controller seems to able to able to make up for the lack of clustering (either because clustering is disabled, or because it does not work effectively due to interspersed reads and seeks on the same file descriptor). > >I am now looking at vfs_cluster.c to see whether I can find which part > >is responsible for letting numdirtybuffers raise without bounds and > >why only *re* writing a file causes problems, not the initial > >writing. Any suggestions on where to start looking are very welcome. > > It is very complicated, but it was easy to find its comments saying that > it tries not to force out the writes for non-seqential accesses. I > am currently trying the following workarounds: I have decided to start testing with only a single change from the list of changes you provided: > % diff -u2 vfs_cluster.c~ vfs_cluster.c > % @@ -726,8 +890,13 @@ > % * are operating sequentially, otherwise let the buf or > % * update daemon handle it. > % + * > % + * Algorithm changeback: ignore seqcount here, at least for > % + * now, to work around readers breaking it for writers. It > % + * is too late to start ignoring after write pressure builds > % + * up, since not writing out here is the main source of the > % + * buildup. > % */ > % bdwrite(bp); > % - if (seqcount > 1) > % - cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, > vp->v_clen + 1); > % + cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + > 1); > % vp->v_clen = 0; > % vp->v_cstart = lbn + 1; And sure enough, a kernel with only this one line change[*] is able to handle reading and re-writing a file through a single file descriptor just fine, good performance, no hangs, vfs.numdirtybuffers remains low: 4) (1 desc, cluster_wbuild_wb unconditionally) 0.864u 96.145s 3:32.34 45.6% 5+188k 524441+524265io 0pf+0w 0.976u 94.081s 3:29.70 45.3% 5+187k 489756+524265io 0pf+0w 0.903u 95.676s 3:33.38 45.2% 5+188k 523124+524265io 0pf+0w Same test on a single disk, as above: 4) (1 desc, cluster_wbuild_wb unconditionally) 8.399u 822.055s 33:01.01 41.9% 5+187k 524445+524262io 0pf+0w Unfortunately, for the single disk case, the performance is the same as for the "-o noclusterw" case (even though the writes _are_ clustered now, see the number of write operation). [*] I still need to confirm that this is really the only change in the kernel, but I'm running out of time now. I will test this next weekend. > This works fairly well. Changing the algorithm back in all cases reduces > performance, but here we have usually finished with the buffer. Even if > seqcount is fixed so that it isn't clobbered by reads, it might be right > to ignoe it here, so as to reduce write pressure, until write pressure is > handled better. For me, this change makes the difference between a working system with good performance vs a system that limps along and occasionally hangs completely; for this specific workload. Thank you for this! Next weekend, I want to test some scenarios with and without this change. Most importantly, I want to investigate a) behavior with random I/O (any suggestions on a benchmark or test program that can generate "representative" random I/O? IOzone?) b) behavior with sequential writes c) behavior with re-writing files, both via one and two file descriptors d) maybe test the other changes to vfs_cluster.c you provided, to see whether they make a difference. e) Investigate the role of vfs.dirtybufthresh. When vfs.numdirtybuffer reaches this number, the system switches to a mode where vfs.numdirtybuffer no longer increases and seems to handle the load just fine. I want to understand why the system sees no need to limit numdirtybuffer earlier. Maybe this can be used to improve re-write performance to the "2 desc, with clustering" case. f) I have experienced system freezes when unmounting my test file system after running the respective tests. I'm not sure whether this related to the current problem, or whether this caused by some test patch that I accidently left in the code. I need to check wether this reproducible, and whether there is a specific workload that causes this. Klaus On Mon, 10 Jun 2013, Klaus Weber wrote: > On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote: >> On Fri, 31 May 2013, Klaus Weber wrote: This thread is getting very long, and I will only summarize a couple of things that I found last week here. Maybe more later. o Everything seems to be working as well as intended (not very well) except in bufdaemon and friends. Perhaps it is already fixed there. I forgot to check which version of FreeBSD you are using. You may be missing some important fixes. There were some by kib@ a few months ago, and some by jeff@ after this thread started. I don't run any version of FreeBSD new enough to have these, and the version that I run also doesn't seem to have any serious bugs in bufdaemon. It just works mediocrely. o Writing in blocks of size less than the fs block size, as bonnie normally does, gives much the same rewriting effect as bonnie does explicitly, because the system is forced to read each block before doing a partial write to it. This at best doubles the amount of i/o and halves the throughput of the writes. o There are some minor bugs in the read-before-write system code, and some differences in the "much the same" that may be important in some cases: - when bonnie does the read-before-write, the system normally uses cluster_read() on the read descriptor and thus uses the system's idea of sequentiality on the read descriptor. For both ffs and msdosfs, this normally results in a reading small cluster (up to the end of the current read) followed by async read-ahead of a larger cluster or 2. The separate clusters improve latency but reduce performance. scottl@ recently committed a sysctl vfs.read_min for avoiding the earlier splitting. Using it made some interesting but ultimately unimportant differences to the 2-bonnie problem. - when the system does the read-before-write, ffs normally uses cluster_read() on the write descriptor and thus uses the system's idea of sequentially on the write descriptor. ffs doesn't know the correct amount to read in this case, and it always asks for MAXBSIZE, which is both too small and too large. This value is the amount that should be read synchronously. It is too large since the normal amount is the application's block size which is normally smaller, and it is too small since MAXBSIZE is only half of the max cluster size. The correct tradeoff of latency vs throughput is even less clear than for a user read, and further off from being dynamic. msdosfs doesn't even use cluster_read() for this (my bad). It uses plain bread(). This gives very low performance when the block size is small. So msdosfs worked much better in the 2-bonnie benchmark than for rewrites generated by dd just writing with a small block size and conv=notrunc. After fixing thus, msdosfs worked slightly better than ffs in all cases. - whoever does the read-before-write, cluster reading tends to generate a bad i/o pattern. I saw patterns like the following (on ~5.2 where the max cluster size is only 64K, after arranging to mostly use this size): file1: read 64K offset 0 file1: read ahead 64K offset 64K file2: read 64K offset 0 file1: read ahead 64K offset 64K file1: write 64K offset 0 file1: read 64K offset 128K file1: read ahead 64K offset 192K file2: write 64K offset 0 The 2 files make the disk seek a lot, and the read-and-read-ahead gives even more seeks to get back to the write position. My drives are old and have only about 2MB of cache. Seeks with patterns like the above are apparently just large enough to break the drives' caching. OTOH, if I use your trick of mounting with -noclusterw, the seeks are reduced signficantly and my throughput increases by almost a factor of 2, even though this gives writes of only 16K. Apparently the seeks are reduced just enough for the drives' caches to work well. I think the same happens for you. Your i/o system is better, but it only takes a couple of bonnies and perhaps the read pointers getting even further ahead of the write pointers to defeat the drive's caching. Small timing differences probably allow the difference to build up. Mounting with -noclusterw also gives some synchronization that will prevent this buildup. - when the system does the read-before-write, the sequential heuristic isn't necessarily clobbered, but it turns out that the clobbering gives the best possible behaviour, except for limitations and bugs in bufdaemon!... > [... good stuff clipped] > So it really seems that clustering does provide performance benefits, > but the RAID controller seems to able to able to make up for the lack > of clustering (either because clustering is disabled, or because it > does not work effectively due to interspersed reads and seeks on the > same file descriptor). Yes, the seek pattern caused by async-but-not-long-delayed writes (whether done by cluster_write() a bit later or bawrite() directly) combined with reading far ahead (whether done explicitly or implicitly) is very bad even for 1 file, but can often be compensated for by caching in the drives. With 2 files or random writes on 1 file it is much worse, but appaerently mounting with -noclusterw limits it enough for the drives to compensate in the case of 2 bonnies. I think the best we can do in general is delay writes as long as possible and then schedule them perfectly. But scheduling them perfectly is difficult and only happens accidentally. >>> I am now looking at vfs_cluster.c to see whether I can find which part >>> is responsible for letting numdirtybuffers raise without bounds and >>> why only *re* writing a file causes problems, not the initial >>> writing. Any suggestions on where to start looking are very welcome. >> >> It is very complicated, but it was easy to find its comments saying that >> it tries not to force out the writes for non-seqential accesses. I >> am currently trying the following workarounds: > > I have decided to start testing with only a single change from the > list of changes you provided: > >> % diff -u2 vfs_cluster.c~ vfs_cluster.c >> % @@ -726,8 +890,13 @@ >> % * are operating sequentially, otherwise let the buf or >> % * update daemon handle it. >> % + * >> % + * Algorithm changeback: ignore seqcount here, at least for >> % + * now, to work around readers breaking it for writers. It >> % + * is too late to start ignoring after write pressure builds >> % + * up, since not writing out here is the main source of the >> % + * buildup. >> % */ >> % bdwrite(bp); >> % - if (seqcount > 1) >> % - cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, >> vp->v_clen + 1); >> % + cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + >> 1); >> % vp->v_clen = 0; >> % vp->v_cstart = lbn + 1; > > And sure enough, a kernel with only this one line change[*] is able to > handle reading and re-writing a file through a single file descriptor > just fine, good performance, no hangs, vfs.numdirtybuffers remains > low: After more testing, I found that this was almost perfectly backwards for my hardware! I think for your hardware it allows the drives to compensate, much like with -noclusterw but with a slightly improved throughput due to the larger writes. But with my drives, it mostly just gives more seeks. After changing this back and being more careful with the comparisons, I found that best results are obtained (in ~5.2) by letting numdirtybuffers build up. The breakage of the sequential heuristic cause the above to never force out the cluster immediately for the 2-bonnie case. I get similar behaviour by always using delayed writes in ffs_write(). This might depend on setting B_CLUSTEROK in more cases, so that the clustering always gets done later. Typical throughputs for me: - my drives can do 55MB/sec max and get 48 for writing 1 file with large blocks using dd - 48 drops to half of 20-24 with read-before-write for 1 file. That's a 4-fold reduction. One half is for the doubled i/o and the other half is for the seeks. - half of 20-24 drops to half of 10-12 with 2 files and read-before-write of each, in the best case. That's an 8-fold reduction. Another factor of 2 is apparently lost to more seeks. - half of 10-12 drops to half of 5-6, as in the previous point but in the worst case. That's a 16-fold reduction. The worst case is with my modification above. It maximizes the seeks. My original idea for a fix (in the above diff) gave this case. It gave almost perfect clustering and almost no buildup of numdirtybuffers, but throughput was still worst. (My drives can do 16K blocks at full speed provided the blocks are contiguous, so they don't benifit much from clustering except for its side effect of reducing seeks to other blocks in between accessing the contiguous ones.) Some of this typical behaviour is not very dependent on block sizes. The drives become seek-bound, and anything that doubles the number of seeks halves the throughput. Bruce Just a quick reply while I am away from the system.
On Mon, Jun 10, 2013 at 11:00:29AM +1000, Bruce Evans wrote:
> o Everything seems to be working as well as intended (not very well)
> except in bufdaemon and friends. Perhaps it is already fixed there.
> I forgot to check which version of FreeBSD you are using. You may
> be missing some important fixes. There were some by kib@ a few
> months ago, and some by jeff@ after this thread started. I don't
> run any version of FreeBSD new enough to have these, and the version
> that I run also doesn't seem to have any serious bugs in bufdaemon.
> It just works mediocrely.
~All tests were done with 9-STABLE from early May. I also tested with
10-CURRENT kernel, downloaded from an ftp mirror on May 5th,
2013. This kernel showed the same behavior, so I did all further
testing with the 9-STABLE version.
I will update my sources to 9-STABLE first thing next weekend, and
will also re-test with an up-to-date CURRENT kernel.
Klaus
On Mon, Jun 10, 2013 at 11:00:29AM +1000, Bruce Evans wrote: > On Mon, 10 Jun 2013, Klaus Weber wrote: > >On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote: > >>On Fri, 31 May 2013, Klaus Weber wrote: > > This thread is getting very long, and I will only summarize a couple > of things that I found last week here. Maybe more later. > > o Everything seems to be working as well as intended (not very well) > except in bufdaemon and friends. Perhaps it is already fixed there. > I forgot to check which version of FreeBSD you are using. You may > be missing some important fixes. There were some by kib@ a few > months ago, and some by jeff@ after this thread started. I have now tested with both an up-to-date 9-STABLE as well as a 10-CURRENT kernel from ftp.freebsd.org (8.6.2016): FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #27 r251798M: Sun Jun 16 16:19:18 CEST 2013 root@filepile:/usr/obj/usr/src/sys/FILEPILE amd64 FreeBSD filepile 10.0-CURRENT FreeBSD 10.0-CURRENT #0: Sat Jun 8 22:10:23 UTC 2013 root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 The bad news is that I can still reproduce the hangs reliably, with the same methods as before. I do have the feeling that the machine survives the re-write load a bit longer, but its just a gut feeling - I have no means to quantify this. The "dirtybuf" sysctl values have changed their default values; I believe you were involved in the discussion about how they are computed that lead to changing them. They are now: vfs.lodirtybuffers: 13251 vfs.hidirtybuffers: 26502 vfs.dirtybufthresh: 23851 The 10-CURRENT kernel, shows a message on the console when the hang occurs (but see below for a twist): g_vfs_done():da0p1[WRITE(offset=<x>, length=<y>)]error = 11 The message is printed to the console about once a second when the hang occurs, until I hard-reset the server. <x> keeps changing, <y> is mostly 65536, sometimes 131072 (like, one screenful of 64k's then one 128k line, etc.) Error=11 seems to be EDEADLK, so it looks like a -CURRENT kernel does detect a deadlock sitation whereas the -STABLE kernel doesn't. (I have also re-checked the -CURRENT from 5.5.2013 which I briefly tested in my initial report also shows these messages. My apologies for not noticing this earlier). And now there is an additional point where things get really weird: The "g_vfs_done()..." messages _only_ appear when the system hangs as a result of bonnie++'s rewriting load. If I repeat the same test with the test program you provided earlier, the system will still hang, but without the console output. (I have repeated the tests 10 times to make sure I wasn't just imagining this; 5 times with bonnie++ (message appears every time), 5 times with your test program (message did not appear a single time)). I'm not really sure what difference between these two programs causes this. Bonnie++ writes the file immediately before starting to rewrite it, while your program works on pre-fabricated files - maybe this it. Anyway: Is there a way to find out which resource(s) are involved in the EDEADLK-situation? Please keep in mind that I cannot enter the debugger via the console, and panicing the machine leaves me with a dead keyboard, which seems to make any useful debugging pretty hard. (If you believe that a working debugger is required to make progress on this, let me know and I will try to get a serial console working somehow.) >>[observations regarding seek behavior] I did not get to do much other testing this weekend (hanging the system is pretty time-consuming, as after the hard-reset I have to wait for the root partition to re-mirror). I have thought about how I could observe the seek behavior of my disks when testing, but I have come to the conclusion that I cannot do this really. My RAID controller presents the array as a single disk to the OS, and its on-board RAM allows it to cache writes and perform read-aheads without the kernel knowing, so neither reading from nor writing to two blocks far apart will neccessarily result in a physical head move. I think I'll concentrate next on finding out where/why the hangs are occuring. Getting rid of the hangs and hard resets will hopefully speed up testing cycles. Klaus Sorry for the late reply, did lots of testing. I believe to have identified the root cause for the buffer depletion and subsequent i/o hangs. Have a look at the beginning of ffs_bdflush() in ffs_snapshot.c: if (bo->bo_dirty.bv_cnt <= dirtybufthresh) return; This short-circuits the entire flushing code if _this bufobj_ is using less than vfs.dirtybufthresh buffers (by default, vfs.dirtybufthresh is 90% of all available buffers). In other words, one bufobj may hog up to 90% of all buffers. This works well if only one file is being re-written. As documented up-thread, with just one file vfs.numdirtybuffers rises until it reaches vfs.dirtybufthresh, and then remains relatively stable at that point, and the system will not hang. With two (or more) files being rewritten, the code above tries to let _each_ file have 90% of all available buffers, which lets vfs.numdirtybuffers hit vfs.hidirtybuffers. Even worse, with two processes competing for buffer space, often both remain under the magic 90% mark, effectively disabling all flushing of buffers entirely. At that point, the only way that buffers can be freed is flushing in different code paths, or if other processes not involved in this buffer space competion are releasing some. Sooner or later, the system reaches a state where none of the bufobjs has more than 90% of the buffers, and none of the other (normal) processes can release any buffers. The only way out then is a hard reset. With the following patch (also attached), the system performs reasonably well and does not hang: Index: sys/kern/vfs_bio.c =================================================================== --- sys/kern/vfs_bio.c (revision 253339) +++ sys/kern/vfs_bio.c (working copy) @@ -1146,7 +1146,7 @@ if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) { (void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread); altbufferflushes++; - } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) { + } else if ( bdflush_required(bo) ) { BO_LOCK(bo); /* * Try to find a buffer to flush. @@ -1438,6 +1438,28 @@ } /* + * Flushing is deemed necessary if at least one of the following is true: + * - bo is hogging more than dirtybufthresh buffers (previous behavior) + * - numdirtybuffers exceeds dirtybufthresh + * - we are or act as bufdaemon (TDP_NORUNNINGBUF) + * + * Used by bufbdflush (above) and ffs_bdflush (in ffs_snapshot). + */ +int +bdflush_required(struct bufobj *bo) +{ + struct thread *td = curthread; + + KASSERT(bo != NULL, ("bdflush_required: NULL bo")); + + if ((bo->bo_dirty.bv_cnt > dirtybufthresh) || + (numdirtybuffers > dirtybufthresh) || + (td && (td->td_pflags & TDP_NORUNNINGBUF)) ) + return(1); + return(0); + } + +/* * brelse: * * Release a busy buffer and, if requested, free its resources. The Index: sys/sys/bio.h =================================================================== --- sys/sys/bio.h (revision 253339) +++ sys/sys/bio.h (working copy) @@ -151,6 +151,10 @@ #define physread physio #define physwrite physio +extern struct bufobj *bo; +int bdflush_required(struct bufobj *bo); + + #endif /* _KERNEL */ #endif /* !_SYS_BIO_H_ */ Index: sys/ufs/ffs/ffs_snapshot.c =================================================================== --- sys/ufs/ffs/ffs_snapshot.c (revision 253339) +++ sys/ufs/ffs/ffs_snapshot.c (working copy) @@ -2161,7 +2161,7 @@ struct buf *nbp; int bp_bdskip; - if (bo->bo_dirty.bv_cnt <= dirtybufthresh) + if ( !bdflush_required(bo) ) return; td = curthread; I have modified ffs_bdflush() in ffs_snapshot.c and the generic bufbdflush() in vfs_bio.c, which contains similar logic. (I believe a kernel of 5.2 vintage will have this code still in bdwrite().) While the patch makes rewriting work reliably, it may need some further refinements: - Right now, it limits numdirtybuffers to dirtybufthresh. It might be better to use hidirtybuffers as a limit instead? - the set of conditions for which the flush is performed needs to be reviewed by someone who understands this better than I do. Limiting numdirtybuffers is neccessary to fix the hangs. Limiting bo->bo_dirty.bv_cnt was what the code did before, so I left this in as well. And finally, I'm testing for TDP_NORUNNINGBUF based on the idea that if bufdaemon, syncer or any process acting on their behalf sees the need to flush buffers, then it should not be up to *bdflush() to second-guess their intention. - the patch may have style issues Regarding the threshold and conditions: I am still in the process of doing some basic benchmarks with and without the patch. However, it seems that it causes a slight performance decrease in the cases that work well without the patch (i.e. only one process re-writing). My guess is that it now flushes more often than absolutely neccessary. Performance is unchanged for cases that do not cause a huge buffer buildup (reading and writing via separate file descriptors). Of course, in cases that did result in hangs before and now don't, the patch is a clear win. I still believe there are some bugs in the clustering code that affect this case, making the re-writing less efficient than it could be. I wish I had the time to do more testing, but I need to bring the server into production soon. I'm putting some notes here for things I intended to test, but probably won't have the time to: - in vfs_cluster.c, cluster_write_gb(), where it says "Change to algorithm...": Also push the previous cluster when it has reached maximum size, not only when seqcount > 0. No point waiting for buf_daemon when the cluster cannot grow. Verify that condition "vp->v_clen <= cursize" really tests "reached max. size". XXX "maxclen <= cursize"? Add rate-limited printf()s! - create a new "cluster_awrite", from DragonFly BSD: http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/9d4e78c77684f1de021eb8ae17767c210205c3c3 (plug-compatible to bawrite(), but with clustering). Do not replace vfs_bio_awrite (yet). Then, selectively replace some bawrite() calls with calls to cluster_awrite() to identify where current bawrite() calls are inefficient. XXX requires mods to cluster_wbuild()! If anyone wants me to test any additional things, or variants of the patch above, I can still do this next weekend. Since the system no longer hangs with the patch, I think this bug can now be closed. Klaus On Sun, Jul 21, 2013 at 07:42:15PM +0200, Klaus Weber wrote:
> With the following patch (also attached)
Attachment was missing.
Klaus
After studying style(9), I have now made a revised version of the patch, relative to 9/STABLE of about 3 weeks ago: Index: sys/kern/vfs_bio.c =================================================================== --- sys/kern/vfs_bio.c (revision 253339) +++ sys/kern/vfs_bio.c (working copy) @@ -1146,7 +1146,7 @@ if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) { (void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread); altbufferflushes++; - } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) { + } else if (bdflush_required(bo)) { BO_LOCK(bo); /* * Try to find a buffer to flush. @@ -1438,6 +1438,24 @@ } /* + * Flushing is deemed necessary if at least one of the following is true: + * - bo is hogging more than dirtybufthresh buffers (previous behavior) + * - numdirtybuffers exceeds dirtybufthresh + * - we are or act as bufdaemon (TDP_NORUNNINGBUF) + */ +int +bdflush_required(struct bufobj *bo) +{ + struct thread *td = curthread; + + KASSERT(bo != NULL, ("bdflush_required: NULL bo")); + + return ((bo->bo_dirty.bv_cnt > dirtybufthresh) || + (numdirtybuffers > dirtybufthresh) || + (td && (td->td_pflags & TDP_NORUNNINGBUF))); +} + +/* * brelse: * * Release a busy buffer and, if requested, free its resources. The Index: sys/sys/buf.h =================================================================== --- sys/sys/buf.h (revision 253339) +++ sys/sys/buf.h (working copy) @@ -486,6 +486,7 @@ void bdata2bio(struct buf *bp, struct bio *bip); void bwillwrite(void); int buf_dirty_count_severe(void); +int bdflush_required(struct bufobj *bo); void bremfree(struct buf *); void bremfreef(struct buf *); /* XXX Force bremfree, only for nfs. */ int bread(struct vnode *, daddr_t, int, struct ucred *, struct buf **); Index: sys/ufs/ffs/ffs_snapshot.c =================================================================== --- sys/ufs/ffs/ffs_snapshot.c (revision 253339) +++ sys/ufs/ffs/ffs_snapshot.c (working copy) @@ -2161,7 +2161,7 @@ struct buf *nbp; int bp_bdskip; - if (bo->bo_dirty.bv_cnt <= dirtybufthresh) + if (!bdflush_required(bo)) return; td = curthread; [End of patch] Besides being more concise, I hope this patch no longer has style-issues, although I was not sure regarding the indention of the second and third line of the return statement. I chose to left-align the condtions. I have also tested limiting numdirtybuffers to hidirtybuffers (instead of dirtybufthresh), but that does not prevent the system from hanging. I would appreciate if someone could review and eventually commit this patch (or a variant thereof), and then close this bug. Thanks, Klaus Author: kib Date: Sun Sep 22 19:23:48 2013 New Revision: 255797 URL: http://svnweb.freebsd.org/changeset/base/255797 Log: Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius) Modified: head/sys/kern/vfs_bio.c Modified: head/sys/kern/vfs_bio.c ============================================================================== --- head/sys/kern/vfs_bio.c Sun Sep 22 19:15:24 2013 (r255796) +++ head/sys/kern/vfs_bio.c Sun Sep 22 19:23:48 2013 (r255797) @@ -2624,6 +2624,8 @@ flushbufqueues(struct vnode *lvp, int ta int hasdeps; int flushed; int queue; + int error; + bool unlock; flushed = 0; queue = QUEUE_DIRTY; @@ -2699,7 +2701,16 @@ flushbufqueues(struct vnode *lvp, int ta BUF_UNLOCK(bp); continue; } - if (vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT | LK_CANRECURSE) == 0) { + if (lvp == NULL) { + unlock = true; + error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT); + } else { + ASSERT_VOP_LOCKED(vp, "getbuf"); + unlock = false; + error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 : + vn_lock(vp, LK_UPGRADE | LK_NOWAIT); + } + if (error == 0) { mtx_unlock(&bqdirty); CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); @@ -2711,7 +2722,8 @@ flushbufqueues(struct vnode *lvp, int ta notbufdflushes++; } vn_finished_write(mp); - VOP_UNLOCK(vp, 0); + if (unlock) + VOP_UNLOCK(vp, 0); flushwithdeps += hasdeps; flushed++; _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org" Author: kib Date: Sun Sep 29 18:04:57 2013 New Revision: 255941 URL: http://svnweb.freebsd.org/changeset/base/255941 Log: Reimplement r255797 using LK_TRYUPGRADE. The r255797 was: Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius) Modified: head/sys/kern/vfs_bio.c Modified: head/sys/kern/vfs_bio.c ============================================================================== --- head/sys/kern/vfs_bio.c Sun Sep 29 18:02:23 2013 (r255940) +++ head/sys/kern/vfs_bio.c Sun Sep 29 18:04:57 2013 (r255941) @@ -2624,6 +2624,8 @@ flushbufqueues(struct vnode *lvp, int ta int hasdeps; int flushed; int queue; + int error; + bool unlock; flushed = 0; queue = QUEUE_DIRTY; @@ -2699,7 +2701,16 @@ flushbufqueues(struct vnode *lvp, int ta BUF_UNLOCK(bp); continue; } - if (vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT | LK_CANRECURSE) == 0) { + if (lvp == NULL) { + unlock = true; + error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT); + } else { + ASSERT_VOP_LOCKED(vp, "getbuf"); + unlock = false; + error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 : + vn_lock(vp, LK_TRYUPGRADE); + } + if (error == 0) { mtx_unlock(&bqdirty); CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); @@ -2711,7 +2722,8 @@ flushbufqueues(struct vnode *lvp, int ta notbufdflushes++; } vn_finished_write(mp); - VOP_UNLOCK(vp, 0); + if (unlock) + VOP_UNLOCK(vp, 0); flushwithdeps += hasdeps; flushed++; _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org" This is still a problem on 10.0-RELEASE. HP DL380G8. 2*E5-2630 2*6 Cores 64GB RAM 2*300 GB Disk in RAID1, 2 GB Cache. cd /usr/ports/databases/rrdtool12 make install cd /usr/ports/databases/rrdtool12/work/rrdtool-1.2.30/examples ./perftest.pl Can someone comment on how (and when) this is going to be fixed? (In reply to rainer from comment #15) We seem to experience this problem independently too (tested on 10.2-RELEASE and 10.1-RELEASE-p21). In the testing environment the system behaves as described by rainer. In production, the symptoms seem to be the same (but the traffic is not generated artificially). Is there any progress or plans about this issue? Klaus, please, did you solve the problem? Thanks in advance, Pavel For bugs matching the following conditions: - Status == In Progress - Assignee == "bugs@FreeBSD.org" - Last Modified Year <= 2017 Do - Set Status to "Open" Keyword: patch or patch-ready – in lieu of summary line prefix: [patch] * bulk change for the keyword * summary lines may be edited manually (not in bulk). Keyword descriptions and search interface: <https://bugs.freebsd.org/bugzilla/describekeywords.cgi> |