Bug 178997

Summary: [vfs] [patch] Heavy disk I/O may hang system
Product: Base System Reporter: Klaus Weber <fbsd-bugs-2013-1>
Component: kernAssignee: freebsd-bugs mailing list <bugs>
Status: Open ---    
Severity: Affects Only Me CC: chris, emaste, pr, rainer
Priority: Normal    
Version: unspecified   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
bo_dirty.bv_cnt_FIX_prod.patch
none
bo_dirty.bv_cnt_FIX_FINAL.patch none

Description Klaus Weber 2013-05-26 21:00:00 UTC
During heavy disk I/O (two bonnie++ processes working on the same disk simultaneously) causes an extreme degradation in disk throughput (combined throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows when both bonnie++ processes are in the "Rewriting..." phase. 

Sometimes, the systems "hangs" (throughput reaches 0 MB/sec); in that case, the system does not recover from this state. "Hang" means that the system responds to ping etc., processes that are already running will continue, but every command that requires disk accesses (even on disks other than the disk being tested) just hangs. If shutdown can be started without disk access (i.e. still in cache), the system will begin the shutdown, but fails to sync all buffers.

Most testing was done on 9/Stable and an Areca ARC-1882 controller (arcmsr), but I have test-booted a 10.0-current kernel from early May ("kernel-10.0-CURRENT-5.5.2013"), and verified that the problem still exists there. I have also tested with with an on-board SATA port ("Intel Cougar Point AHCI SATA controller"); with this setup, I can only observe extreme throughput degradation, but no hangs (see below for details).

Test-setup:
da0 is a 600 GB volume on the Areca controller:
da0 at arcmsr0 bus 0 scbus0 target 0 lun 0
da0: <Areca ARC-1882-VOL#000 R001> Fixed Direct Access SCSI-5 device 
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 572202MB (1171870720 512 byte sectors: 255H 63S/T 72945C)

gpart create -s gpt da0
gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
newfs -b 64k -f 8k /dev/da0p1
mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}

To observe the problem, I start two bonnie++ instances (installed from ports) like this:
1) bonnie++ -s 64g -n 0 -d /mnt/t1
2) bonnie++ -s 64g -n 0 -d /mnt/t2

(in order to better see the various stages, I start the second bonnie++ about a minute after the first. There is nothing magic about the 64gb, the size should be large enough to keep the disk busy for a few minutes.)

I then observe disk throughput for da0 with 

iostat -dC -w 1 da0

             da0             cpu
  KB/t tps  MB/s  us ni sy in id
[... first bonnie++ starts, "writing 1 byte at a time" phase ...]
 64.00   6  0.37   0  0 12  0 88
 64.00   4  0.25   0  0 13  0 87
 64.00   6  0.37   0  0 12  0 88
[first bonnie++ enters "writing intelligently" phase]
 63.97 3847 240.33   0  0 12  0 87
 64.00 20824 1301.51   0  0 14  0 85
 64.00 20449 1278.04   0  0 14  1 85
 64.00 17840 1115.01   0  0 14  1 85
 64.00 16595 1037.21   0  0 13  1 85
[this is my baseline for upper limit on throughput: ~1 GB/sec]
[...second bonnie++ starts, no big changes]
 64.00 12869 804.30   0  0 15  1 84
[second bonnie++ gets to "Writing intelligently" phase]
 64.00 7773 485.78   0  0  9  0 90
 64.00 7310 456.86   0  0  7  1 92
 64.00 7430 464.40   0  0 10  0 89
 64.00 7151 446.93   0  0  6  1 93
 64.00 7410 463.10   0  0  8  0 91
[...]
 64.00 7349 459.29   0  0 10  0 90
 64.00 7478 467.35   0  0  7  0 93
 64.00 7153 447.06   0  0  9  0 90
 64.00 7335 458.41   0  0  7  0 93
 64.00 7918 494.88   0  0 10  0 90
[first bonnie++ gets to "Rewriting" phase]
 64.00 7426 464.10   0  0 12  1 87
 64.00 7251 453.17   0  0  9  0 90
 64.00 7036 439.75   0  0  6  1 93
 64.00 7296 455.98   0  0  8  1 90
 64.00 6808 425.51   0  0  6  1 93
 64.00 7531 470.66   0  0  8  1 91
 64.00 7323 457.67   0  0  8  0 91
 64.00 7666 479.14   0  0  6  0 94
 64.00 11921 745.07   0  0 12  1 87
 64.00 12043 752.69   0  0 12  1 87
 64.00 10814 675.89   0  0 11  1 89
 64.00 10925 682.82   0  0 12  0 88
[...]
[second bonnie++ starts "Rewriting" phase]
 64.00 10873 679.56   0  0 10  1 89
 64.00 12242 765.12   0  0 13  1 86
 64.00 10739 671.19   0  0 11  0 88
 64.00 3486 217.91   0  0  8  0 91
 64.00 6327 395.42   0  0 14  0 86
 64.00 6637 414.84   0  0  9  0 90
 64.00 4931 308.19   0  0 10  0 89
 64.00 4874 304.63   0  0 13  0 87
 64.00 4322 270.11   0  0 18  0 82
 64.00 4616 288.52   0  0 19  0 80
 64.00 3567 222.96   0  0 20  0 80
 64.00 3395 212.16   0  0 30  0 70
 64.00 1694 105.89   0  0 33  0 67
 64.00 1898 118.63   0  0 36  0 64
  0.00   0  0.00   0  0 12  0 88
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  4  0 96
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
[etc. The system does not recover from this state. All attempts to do disk I/O hang.]

When both bonnie++ processes are in their "Rewriting" phase, the system hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". bufdaemon takes about 40% CPU time and is in state "qsleep" when not active.

I have also briefly tested a single disk instead of a RAID volume. In that case, I could not get the system to hang, put performance drops from ~150 MB/sec to ~1-3 MB/sec. I do not know if that indicates a problem with the arcmsr driver, or whether I cannot create "enough" I/O load with a single disk:

[same setup as above, but on single disk connected to onboard SATA port.]
Baseline: 150 MB/sec (first bonnie++ in "Writing intelligently" phase):
            ada2             cpu
  KB/t tps  MB/s  us ni sy in id
 128.00 1221 152.60   0  0  2  0 97
 128.00 1173 146.60   0  0  2  0 98
 128.00 1171 146.35   0  0  2  0 98
[... first bonnie++ in "Rewriting", second in "Writing intelligently" phase:]
 127.83 1140 142.30   0  0  2  0 97
 128.00 1121 140.11   0  0  2  0 97
 128.00 1114 139.24   0  0  1  0 98
 127.29 904 112.39   0  0  5  0 95
[second bonnie++ starts rewriting]
 120.24  66  7.74   0  0  0  0 100
 128.00  59  7.37   0  0  0  0 100
 122.43  69  8.24   0  0  1  0 99
 128.00 222 27.72   0  0  0  0 100
 128.00 1122 140.23   0  0  3  0 97
 128.00 1080 134.99   0  0  3  0 97
 128.00 1157 144.61   0  0  3  0 97
 127.75 504 62.94   0  0  2  0 98
 128.00  59  7.37   0  0  1  0 99
 128.00 708 88.54   0  0  2  0 98
 128.00 969 121.13   0  0 11  0 89
 127.89 1161 144.99   0  0  3  0 97
 127.60 479 59.63   0  0  1  0 99
 128.00 596 74.55   0  0  2  0 98
 128.00 1169 146.10   0  0  3  0 97
 127.88 1076 134.37   0  0  2  0 97
 127.89 1202 150.10   0  0  3  0 97
 128.00 1074 134.24   0  0  4  0 96
 128.00 1039 129.87   0  0  8  0 92
 128.00 913 114.14   0  0 10  0 90
 127.73 896 111.78   0  0 11  0 89
 128.00 930 116.26   0  0 10  0 90
 127.93 909 113.57   0  0 11  0 89
 128.00 887 110.89   0  0 13  0 87
 128.00 743 92.91   0  0 18  0 81
 128.00 548 68.56   0  0 23  0 76
 128.00 523 65.43   0  0 24  0 76
 127.89 587 73.36   0  0 21  0 79
 127.89 586 73.24   0  0 22  0 78
 128.00 465 58.07   0  0 28  0 72
 127.89 558 69.74   0  0 25  0 75
 127.79 595 74.30   0  0 21  0 79
 128.00 593 74.18   0  0 23  0 77
 128.00 546 68.31   0  0 22  0 78
 127.70 428 53.32   0  0 32  0 68
 128.00 561 70.18   0  0 25  0 75
 128.00 645 80.67   0  0 22  0 78
 127.78 582 72.68   0  0 23  0 77
 114.68 518 58.07   0  0 27  0 73
 103.33 497 50.20   0  0 29  0 71
 110.00 128 13.74   0  0 35  0 65
 65.94  33  2.12   0  0 37  0 63
 74.97 286 20.92   0  0 36  0 64
 105.83 306 31.59   0  0 34  0 66
 95.12 399 37.03   0  0 30  0 70
 108.57 336 35.59   0  0 29  0 71
 127.22 246 30.53   0  0 41  0 59
 104.16 255 25.91   0  0 32  0 68
 103.68 528 53.51   0  0 23  0 77
 127.14 373 46.27   0  0 29  0 71
 90.65 341 30.16   0  0 35  0 65
 127.30 368 45.70   0  0 30  0 70
 66.56  25  1.62   0  0 37  0 63
 66.00  32  2.06   0  0 38  0 62
 94.99  95  8.80   0  0 38  0 62
 94.00 433 39.71   0  0 35  0 65
 107.66 666 70.06   0  0 26  0 74
 119.43 478 55.69   0  0 27  0 73
 95.64 441 41.15   0  0 28  0 72
 116.45 449 51.01   0  0 29  0 71
 102.22 503 50.26   0  0 28  0 72
 127.62 509 63.50   0  0 29  0 71
 111.84 388 42.33   0  0 29  0 71

(i.e. disk throughput jumps wildly between full speed and ~1% of maximum, despite continuous I/O pressure)


Unfortunately, I cannot use kdb on this machine - "ALT-CTRL-ESC" does nothing and "sysctl debug.kdb.enter=1" just says "debug.kdb.enter: 0 -> 0". "sysctl debug.kdb.panic=1" does panic the machine, but the keyboard is not working.

I have, however, logged the sysctl vas subtree every second while the problem occurs: 
As long as both bonnie++ are in "Writing intelligently..." phase, vfs.numdirtybuffers stays mostly between 80-100. When the first bonnie++ goes to "Rewriting..." (at 3:40 below) starts to increase (about 150-350 new dirty buffers per second). When the second bonnie++ goes to "Rewriting..." as well (at 4:24), this speeds up a lot (about 3000-5000 new dirty buffers/sec). A few seconds later, vfs.numdirtybuffers reaches its limit, and remains there until the machine is rebooted:

00-03-28.log:vfs.numdirtybuffers: 107
00-03-29.log:vfs.numdirtybuffers: 106
00-03-30.log:vfs.numdirtybuffers: 105
00-03-31.log:vfs.numdirtybuffers: 105
00-03-32.log:vfs.numdirtybuffers: 110
00-03-33.log:vfs.numdirtybuffers: 112
00-03-34.log:vfs.numdirtybuffers: 110
00-03-35.log:vfs.numdirtybuffers: 115
00-03-36.log:vfs.numdirtybuffers: 117
00-03-37.log:vfs.numdirtybuffers: 118
00-03-38.log:vfs.numdirtybuffers: 118
00-03-39.log:vfs.numdirtybuffers: 118
00-03-40.log:vfs.numdirtybuffers: 118
[first bonnie++ starts Rewriting]
00-03-41.log:vfs.numdirtybuffers: 294
00-03-42.log:vfs.numdirtybuffers: 404
00-03-43.log:vfs.numdirtybuffers: 959
00-03-44.log:vfs.numdirtybuffers: 1071
00-03-45.log:vfs.numdirtybuffers: 1155
00-03-46.log:vfs.numdirtybuffers: 1253
00-03-47.log:vfs.numdirtybuffers: 1366
00-03-48.log:vfs.numdirtybuffers: 1457
00-03-49.log:vfs.numdirtybuffers: 1549
00-03-50.log:vfs.numdirtybuffers: 1680
00-03-51.log:vfs.numdirtybuffers: 1904
00-03-52.log:vfs.numdirtybuffers: 2136
00-03-53.log:vfs.numdirtybuffers: 2363
00-03-54.log:vfs.numdirtybuffers: 2503
00-03-55.log:vfs.numdirtybuffers: 3023
00-03-56.log:vfs.numdirtybuffers: 3268
00-03-57.log:vfs.numdirtybuffers: 3601
00-03-58.log:vfs.numdirtybuffers: 3743
00-03-59.log:vfs.numdirtybuffers: 3904
00-04-00.log:vfs.numdirtybuffers: 4106
00-04-01.log:vfs.numdirtybuffers: 4325
00-04-02.log:vfs.numdirtybuffers: 4461
00-04-03.log:vfs.numdirtybuffers: 4643
00-04-04.log:vfs.numdirtybuffers: 5063
00-04-05.log:vfs.numdirtybuffers: 5202
00-04-06.log:vfs.numdirtybuffers: 5456
00-04-07.log:vfs.numdirtybuffers: 5593
00-04-08.log:vfs.numdirtybuffers: 5819
00-04-09.log:vfs.numdirtybuffers: 6092
00-04-10.log:vfs.numdirtybuffers: 6249
00-04-11.log:vfs.numdirtybuffers: 6412
00-04-12.log:vfs.numdirtybuffers: 6591
00-04-13.log:vfs.numdirtybuffers: 6698
00-04-14.log:vfs.numdirtybuffers: 6853
00-04-15.log:vfs.numdirtybuffers: 7031
00-04-16.log:vfs.numdirtybuffers: 7192
00-04-17.log:vfs.numdirtybuffers: 7387
00-04-18.log:vfs.numdirtybuffers: 7609
00-04-19.log:vfs.numdirtybuffers: 7757
00-04-20.log:vfs.numdirtybuffers: 7888
00-04-21.log:vfs.numdirtybuffers: 8156
00-04-22.log:vfs.numdirtybuffers: 8362
00-04-23.log:vfs.numdirtybuffers: 8558
[second bonnie goes Rewriting as well]
00-04-24.log:vfs.numdirtybuffers: 11586
00-04-25.log:vfs.numdirtybuffers: 16325
00-04-26.log:vfs.numdirtybuffers: 24333
00-04-27.log:vfs.numdirtybuffers: 29058
00-04-28.log:vfs.numdirtybuffers: 33752
00-04-29.log:vfs.numdirtybuffers: 38792
00-04-30.log:vfs.numdirtybuffers: 42663
00-04-31.log:vfs.numdirtybuffers: 47305
00-04-33.log:vfs.numdirtybuffers: 52070
00-04-34.log:vfs.numdirtybuffers: 52066
00-04-35.log:vfs.numdirtybuffers: 52063
00-04-37.log:vfs.numdirtybuffers: 52098
00-04-39.log:vfs.numdirtybuffers: 52098
00-04-41.log:vfs.numdirtybuffers: 52097
00-04-44.log:vfs.numdirtybuffers: 52097
00-04-46.log:vfs.numdirtybuffers: 52098
00-04-49.log:vfs.numdirtybuffers: 52096
00-04-52.log:vfs.numdirtybuffers: 52098
00-04-54.log:vfs.numdirtybuffers: 52096
00-04-57.log:vfs.numdirtybuffers: 52098
00-05-00.log:vfs.numdirtybuffers: 52096
[ etc. ]

vfs.hidirtybuffers: 52139
vfs.lodirtybuffers: 26069
(the machine has 32 GB RAM)

(I could upload the archive of the vfs logs somewhere, so you can check other values over time, if that would be helpful)


I have found http://freebsd.1045724.n5.nabble.com/7-1-RELEASE-I-O-hang-td3966945.html , which seems to describe the same issue (also in combination with a RAID controller, albeit a different one).

The symptoms described in r189878 (issues in bufdaemon, causing processes to hang in nbufkv) sound simliar. 

Please let me know if there is any other info that I can provide.

Fix: 

No fix known.
How-To-Repeat: gpart create -s gpt da0
gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
newfs -b 64k -f 8k /dev/da0p1
mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}

Start two bonnie++ instances (installed from ports) like this:
1) bonnie++ -s 64g -n 0 -d /mnt/t1
2) bonnie++ -s 64g -n 0 -d /mnt/t2

Observe disk throughput for da0 with 

iostat -dC -w 1 da0

in particular when both bonnie++ processes are in their "Rewriting..." phase.
Comment 1 Bruce Evans freebsd_committer 2013-05-27 06:57:56 UTC
On Sun, 26 May 2013, Klaus Weber wrote:

>> Environment:
> FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #5 r250475: Sun May 12 19:14:21 CEST 2013     root@filepile:/usr/obj/usr/src/sys/FILEPILE  amd64
> (Kernel config has some drivers removed and debugging options added. Tested with GENERIC as well)
>> Description:
> During heavy disk I/O (two bonnie++ processes working on the same disk simultaneously) causes an extreme degradation in disk throughput (combined throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows when both bonnie++ processes are in the "Rewriting..." phase.

Please use the unix newline character in mail.

A huge drop in disk throughput is normal with multiple processes,
unfortunately.  Perhaps not quite this much.

> Sometimes, the systems "hangs" (throughput reaches 0 MB/sec); in that case, the system does not recover from this state. "Hang" means that the system responds to ping etc., processes that are already running will continue, but every command that requires disk accesses (even on disks other than the disk being tested) just hangs. If shutdown can be started without disk access (i.e. still in cache), the system will begin the shutdown, but fails to sync all buffers.

Hangs are bugs though.

I have been arguing with kib@ about some methods of handling heavy disk
i/o being nonsense since they either make the problem worse (by switching
to direct unclustered writes, so that slow i/o goes even slower) or have
no effect except to complicate the analysis of the problem (because they
are old hackish methods, and newer better methods make the cases handled
by the old methods unreachable).  But the analysis is complicated, and
we couldn't agree on anything.

One thing I noticed more recently than my last mail to kib about this
is that even a single heavy disk write (say dd bs=256m count=2
</dev/zero >foo) can block another non-write-bound process (say iostat
1 or systat -v 1 to watch the progress) from starting up (sometimes
even when it has just been run to load it into cache(s)) until the dd
completes 10 seconds later.  This was on a 2-core system running a
fairly old version of FreeBSD (8.0-CURRENT from ~2008).  I found that
the problem could be fixed by killing cluster_write() by turning it into
bdwrite() (by editing the running kernel using ddb, since this is easier
than rebuilding the kernel).  I was trying many similar things since I
had a theory that cluster_write() is useless.  Actually, cluster_write()
gives some minor optimizations, but this misbehaviour indicates that it
has a resource starvation bug.  My theory for what the bug is is that
cluster_write() and cluster_read() share the limit resource of pbufs.
pbufs are not managed as carefully as normal buffers.  In particular,
there is nothing to limit write pressure from pbufs like there is for
normal buffers.  However, this behaviour doesn't happen on the much older
version of FreeBSD that I normally use (5.2-CURRENT from 2004 with many
local modifications).  It starts iostat to watch dd instantly, even running
on a single core system.

> gpart create -s gpt da0
> gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
> newfs -b 64k -f 8k /dev/da0p1
> mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}

The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
Someone increased the default from 16k to 32k without changing the buffer
cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
small, but on 32-bit arches kernel virtual memory is too limited to have
a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
although this problem doesn't affetc 64-bit arches.

-b 64k is worse.

> When both bonnie++ processes are in their "Rewriting" phase, the system hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". bufdaemon takes about 40% CPU time and is in state "qsleep" when not active.

You got the buffer cache fragmentation that you asked for.

Buffer cache fragmentation caused serious problems 10-20 years ago, but the
ones causing hangs were supposed to be fixed about 12 years ago and the ones
causing extreme slowness for just the software parts of handling
fragmentation were fixed a little later.  But there apparently still some.

However, I have never been able to reproduce serious fragmentation problems
from using too-large-block sizes, or demonstrate significant improvements
from avoiding the known fragmentation problem by increasing BKVASIZE.
Perhaps my systems are too small, or have tuning or local changes that
accidentally avoid the problem.

Apparently you found a way to reproduce the serious fragmentaion
problems.  Try using a block size that doesn't ask for the problem.
Increasing BKVASIZE would take more work than this, since although it
was intended to be a system parameter which could be changed to reduce
the fragmentation problem, one of the many bugs in it is that it was
never converted into a "new" kernel option.  Another of the bugs in
it is that doubling it halves the number of buffers, so doubling it
does more than use twice as much kva.  This severely limited the number
of buffers back when memory sizes were 64MB.  It is not a very
significant limitation if the memory size is 1GB or larger.

> I have also briefly tested a single disk instead of a RAID volume. In that case, I could not get the system to hang, put performance drops from ~150 MB/sec to ~1-3 MB/sec. I do not know if that indicates a problem with the arcmsr driver, or whether I cannot create "enough" I/O load with a single disk:

I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to
a nearly full old ffs file system on an old PATA disk, and 2 times
20MB/sec with 2 dd's.  This is is almost as good as possible.  The
configuration was somewhat flawed:
- the writes all fitted in memory because my disks are too full to hold
   more.  This shouldn't be a problem for write benchmarks, because write
   pressure is (too) limited by writing out dirty buffers (too) early to
   limit write pressure, so writes are mostly physical.
- the fs block size was 16K.  This not a flaw, just the right size, but
   other sizes should be tested.
- the file system was mounted async.  This should have little affect,
   since for large writes all data is written async anyway (the write
   pressure handling (causes async writes instead of delayed writes even
   for not so large writes).
- ffs was modified to remove its bogus write pressure handling.  This
   should have no affect, but I didn't verify it.

> I have, however, logged the sysctl vas subtree every second while the problem occurs:
> As long as both bonnie++ are in "Writing intelligently..." phase, vfs.numdirtybuffers stays mostly between 80-100. When the first bonnie++ goes to "Rewriting..." (at 3:40 below) starts to increase (about 150-350 new dirty buffers per second). When the second bonnie++ goes to "Rewriting..." as well (at 4:24), this speeds up a lot (about 3000-5000 new dirty buffers/sec). A few seconds later, vfs.numdirtybuffers reaches its limit, and remains there until the machine is rebooted:
>
> 00-03-28.log:vfs.numdirtybuffers: 107
> 00-03-29.log:vfs.numdirtybuffers: 106
> 00-03-30.log:vfs.numdirtybuffers: 105
> ...
> [first bonnie++ starts Rewriting]
> 00-03-41.log:vfs.numdirtybuffers: 294
> 00-03-42.log:vfs.numdirtybuffers: 404
> 00-03-43.log:vfs.numdirtybuffers: 959
> 00-03-44.log:vfs.numdirtybuffers: 1071

There are various write pressure handling mechanisms that have different
thresholds and get in each way a bit.

The dirty buffer watermarks are presumably much higher than 105 or even
1071.  Other mechanisms apparently work well and limit the number of
dirty buffers to only 107 with 1 writer.  Writes to memory can go much
faster than writes to the disk even when the latter go at full speed, but
some mechanism blocks writers (too) early so the 1 writer soon reaches a
steady state with not many dirty buffers.

> 00-03-45.log:vfs.numdirtybuffers: 1155
> 00-03-46.log:vfs.numdirtybuffers: 1253
> 00-03-47.log:vfs.numdirtybuffers: 1366
> 00-03-48.log:vfs.numdirtybuffers: 1457
> ...
> 00-04-21.log:vfs.numdirtybuffers: 8156
> 00-04-22.log:vfs.numdirtybuffers: 8362
> 00-04-23.log:vfs.numdirtybuffers: 8558
> [second bonnie goes Rewriting as well]
> 00-04-24.log:vfs.numdirtybuffers: 11586
> 00-04-25.log:vfs.numdirtybuffers: 16325
> 00-04-26.log:vfs.numdirtybuffers: 24333
> ...
> 00-04-54.log:vfs.numdirtybuffers: 52096
> 00-04-57.log:vfs.numdirtybuffers: 52098
> 00-05-00.log:vfs.numdirtybuffers: 52096
> [ etc. ]

This is a rather large buildup and may indicate a problem.  Try reducing
the dirty buffer watermarks.  Their default values are mostly historical
nonsense.  They are just some fixed fraction (about 1/4) of the number
of buffers.  With multiple GB of memory, the number of buffers may be
huge.  On a system with 24GB, vfs.bbuf is 157800, and vfs.dirtybufthresh
is 35523.  But the size of main memory has nothing to do with the
correct limit on the number of dirty buffers (unless the latter is not
being used to limit write pressure, but only to limit starvation of
buffers for readers.  This is perhaps all it is supposed to do, but see
below).  We want buildups of dirty buffers to be limited so that the
system never gets too far behind on physical writes.  The correct limit
depends on the physical write capabilities of the machine and on which
write channels have pressure on them.  Default global limits don't work
at all for this.  You can fix the defaults but you can't distribute the
limits properly accross channels, or easily keep the limits up to date
as channels are added.  It is easy to add slow USB channels.  These
won't increase the write capability much, but it is very easy for a
slow disk or channel to it to grab many dirty buffers and keep them for
a long time so that faster disks are starved.

Other mechanisms handle write pressure better than the simple dirty
buffer count one can (others count the amount of i/o, which is more
relevant to how long it will take).  So assume that any affect of
the dirty buffer count mechanism on write pressure is unintentional.
There are still the following problems with it:
- as above, slow channels can starve fast channels, thus giving a
   severe side effect on the write pressure (at best the users of the
   fast channels block so that the are slowed down to the speed of
   the slow channels)
- buffer starvation for readers can happen anyway.  vfs.nbuf is the
   number of buffers that are allocated, but not all of these can be
   used when some file systems use buffer sizes larger than about
   BKVASIZE.  The worst case is when all file systems a block size of
   64KB.  Then only about 1/4 of the allocated buffers can be used.
   The number that can be used is limited by vfs.maxbufspace and
   related variables.  The default for vfs.maxbufspace is controlled
   indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf =
   128K; the ratio is 16K = BKVASIZE).

So when you ask for buffer cache fragmentation with -b 64k, you get
even more.  The effective nbuf is reduced by a factor of 4.  This
means that all the dirty buffer count watermarks except one are
slightly above the effective nbuf, so if the higher watermarks
cannot quite be reached and when they are nearly reached readers
are starved of buffers.

> vfs.hidirtybuffers: 52139
> vfs.lodirtybuffers: 26069
> (the machine has 32 GB RAM)

Fairly consistent with my examples of a machine with 24GB RAM.
The upper watermark is 33% higher, etc.  Your nbuf will be more like
200K than 150K.  You reached numdirtybuffers = 52096.  At 64K each,
that's about 3GB.  maxbufspace will be about the same, and bufspace
(the amount used) will be not much smaller.  There will be little
space for buffers for readers, and the 2 writers apparently manage
to starve even each other, perhaps for similar reasons.  The backlog
of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec
it would take almost an hour, so it is not much better than a hang.

> I have found http://freebsd.1045724.n5.nabble.com/7-1-RELEASE-I-O-hang-td3966945.html , which seems to describe the same issue (also in combination with a RAID controller, albeit a different one).

I think the hardware is usually not the problem (at least after kib's fix
for the above), except when it is large it can stress the software more.

Bruce
Comment 2 Klaus Weber 2013-05-27 19:57:32 UTC
First of all, thank you very much for looking into this, and for your
detailed explanations. Much appreciated.

On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
> On Sun, 26 May 2013, Klaus Weber wrote:
> >>Description:
> >During heavy disk I/O (two bonnie++ processes working on the same disk 
> >simultaneously) causes an extreme degradation in disk throughput (combined 
> >throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows 
> >when both bonnie++ processes are in the "Rewriting..." phase.

 
> Please use the unix newline character in mail.

My apologies. I submitted the report via the web-interface and did not
realize that it would come out this way.

> A huge drop in disk throughput is normal with multiple processes,
> unfortunately.  Perhaps not quite this much.

Yes, the slight performance drop from ~1 GB/sec to ~600 or even 
~400 MB/sec does not concern me (this might even be an issue with the
arcmsr driver or the controller's firmware; I found some hints that
this might be the case.)

However, a performance drop by 3 orders of a magnitude for just two
concurrent processes seemed to indicate a problem.
 
> Hangs are bugs though.
> 
> I have been arguing with kib@ about some methods of handling heavy disk
> i/o being nonsense since they either make the problem worse (by switching
> to direct unclustered writes, so that slow i/o goes even slower) or have
> no effect except to complicate the analysis of the problem (because they
> are old hackish methods, and newer better methods make the cases handled
> by the old methods unreachable).  But the analysis is complicated, and
> we couldn't agree on anything.

The machine where I experience this problem is not in production
yet. I can reboot it at any time, test patches, etc. Just let me know
if I can do anything helpful. The only limitation is that I usually
have access to the server only on weekends.

> [single process with heavy write access can block other
non-write-bound process]

I have not experienced this so far, but I can test this when I have
access to the server next time.

> I found that
> the problem could be fixed by killing cluster_write() by turning it into
> bdwrite() (by editing the running kernel using ddb, since this is easier
> than rebuilding the kernel).  I was trying many similar things since I
> had a theory that cluster_write() is useless.  [...]

If that would provide a useful datapoint, I could try if that make a
difference on my system. What changes would be required to test this?

Surely its not as easy as replacing the function body of
cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?

> My theory for what the bug is is that
> cluster_write() and cluster_read() share the limit resource of pbufs.
> pbufs are not managed as carefully as normal buffers.  In particular,
> there is nothing to limit write pressure from pbufs like there is for
> normal buffers.  

Is there anything I can do to confirm rebut this? Is the number of
pbufs in use visible via a sysctl, or could I add debug printfs that
are triggered when certain limits are reached?

> >newfs -b 64k -f 8k /dev/da0p1
> 
> The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
> Someone increased the default from 16k to 32k without changing the buffer
> cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
> small, but on 32-bit arches kernel virtual memory is too limited to have
> a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
> although this problem doesn't affetc 64-bit arches.
> 
> -b 64k is worse.

Thank you for this explanation. I was not aware that -b 64k (or even
the default values to newfs) would have this effect. I will repeat the
tests with 32/4k and 16/2k, although I seem to remember that 64/8k
provided a significant performance boost over the defaults. This, and
the reduced fsck times was my original motivation to go with the
larger values.

Given the potentially drastic effects of block sizes other than 16/2k,
maybe a warning should be added to the newfs manpage? I only found the
strong advice to maintain a 8:1 buffer:fragment ratio.


> >When both bonnie++ processes are in their "Rewriting" phase, the system 
> >hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". 
> >bufdaemon takes about 40% CPU time and is in state "qsleep" when not 
> >active.
> 
> You got the buffer cache fragmentation that you asked for.

Looking at vfs_bio.c, I see that it has defrag-code in it. Should I
try adding some debug output to this code to get some insight why this
does not work, or not as effective as it should?

> Apparently you found a way to reproduce the serious fragmentaion
> problems.  

A key factor seems to be the "Rewriting" operation. I see no problem
during the "normal" writing, nor could I reproduce it with concurrent
dd runs.

> Try using a block size that doesn't ask for the problem.

Will do, although for production use I would really prefer a 64/8k
system, due to the higher performance.

> Increasing BKVASIZE would take more work than this, since although it
> was intended to be a system parameter which could be changed to reduce
> the fragmentation problem, one of the many bugs in it is that it was
> never converted into a "new" kernel option.  Another of the bugs in
> it is that doubling it halves the number of buffers, so doubling it
> does more than use twice as much kva.  This severely limited the number
> of buffers back when memory sizes were 64MB.  It is not a very
> significant limitation if the memory size is 1GB or larger.

Should I try to experiment with BKVASIZE of 65536? If so, can I
somehow up the number of buffers again? Also, after modifying
BKVASIZE, is it sufficient to compile and install a new kernel, or do
I have to build and install the entire world?


> I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to
> a nearly full old ffs file system on an old PATA disk, and 2 times
> 20MB/sec with 2 dd's.  This is is almost as good as possible. 

I agree. Performance with dd, or during bonnie++'s "Writing
intelligently" phaase is very reasonable, with both a single process
and with two processes simultaneously.

Something specific to the "Rewriting..." workload is triggering the
problem.


> >[second bonnie goes Rewriting as well]
> >00-04-24.log:vfs.numdirtybuffers: 11586
> >00-04-25.log:vfs.numdirtybuffers: 16325
> >00-04-26.log:vfs.numdirtybuffers: 24333
> >...
> >00-04-54.log:vfs.numdirtybuffers: 52096
> >00-04-57.log:vfs.numdirtybuffers: 52098
> >00-05-00.log:vfs.numdirtybuffers: 52096
> >[ etc. ]
> 
> This is a rather large buildup and may indicate a problem.  Try reducing
> the dirty buffer watermarks.  Their default values are mostly historical
> nonsense. 

You mean the vfs.(hi|lo)dirtybuffers? Will do. What would be
reasonable starting values for experimenting? 800/200?

> [Helpful explanation snipped]  

> - buffer starvation for readers can happen anyway.  vfs.nbuf is the
>   number of buffers that are allocated, but not all of these can be
>   used when some file systems use buffer sizes larger than about
>   BKVASIZE.  The worst case is when all file systems a block size of
>   64KB.  Then only about 1/4 of the allocated buffers can be used.
>   The number that can be used is limited by vfs.maxbufspace and
>   related variables.  The default for vfs.maxbufspace is controlled
>   indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf =
>   128K; the ratio is 16K = BKVASIZE).
> 
> So when you ask for buffer cache fragmentation with -b 64k, you get
> even more.  The effective nbuf is reduced by a factor of 4.  This
> means that all the dirty buffer count watermarks except one are
> slightly above the effective nbuf, so if the higher watermarks
> cannot quite be reached and when they are nearly reached readers
> are starved of buffers.

That makes sense - and bonnie++ is both reader and writer in the
rewriting phase, and thus may even "starve itself"?


> >vfs.hidirtybuffers: 52139
> >vfs.lodirtybuffers: 26069
> >(the machine has 32 GB RAM)
> 
> Fairly consistent with my examples of a machine with 24GB RAM.
> The upper watermark is 33% higher, etc.  Your nbuf will be more like
> 200K than 150K.  You reached numdirtybuffers = 52096.  At 64K each,
> that's about 3GB.  maxbufspace will be about the same, and bufspace
> (the amount used) will be not much smaller.  There will be little
> space for buffers for readers, and the 2 writers apparently manage
> to starve even each other, perhaps for similar reasons.  The backlog
> of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec
> it would take almost an hour, so it is not much better than a hang.

OK, I will experiment with these parameters when I have access to
the system again (which won't be before Thursday, unfortunately).

Thanks again so far,

  Klaus
Comment 3 Bruce Evans freebsd_committer 2013-05-28 13:03:10 UTC
On Mon, 27 May 2013, Klaus Weber wrote:

> On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
>> On Sun, 26 May 2013, Klaus Weber wrote:
>>>> Description:
>>> During heavy disk I/O (two bonnie++ processes working on the same disk
>>> simultaneously) causes an extreme degradation in disk throughput (combined
>>> throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows
>>> when both bonnie++ processes are in the "Rewriting..." phase.
>
>> Please use the unix newline character in mail.
>
> My apologies. I submitted the report via the web-interface and did not
> realize that it would come out this way.

Thanks.  The log output somehow came out right.

>> I found that
>> the problem could be fixed by killing cluster_write() by turning it into
>> bdwrite() (by editing the running kernel using ddb, since this is easier
>> than rebuilding the kernel).  I was trying many similar things since I
>> had a theory that cluster_write() is useless.  [...]
>
> If that would provide a useful datapoint, I could try if that make a
> difference on my system. What changes would be required to test this?
>
> Surely its not as easy as replacing the function body of
> cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?

That should work for testing, but it is safer to edit ffs_write()
and remove the block where it calls cluster_write() (or bawrite()),
so that it falls through to call bdwrite() in most cases.

>> My theory for what the bug is is that
>> cluster_write() and cluster_read() share the limit resource of pbufs.
>> pbufs are not managed as carefully as normal buffers.  In particular,
>> there is nothing to limit write pressure from pbufs like there is for
>> normal buffers.
>
> Is there anything I can do to confirm rebut this? Is the number of
> pbufs in use visible via a sysctl, or could I add debug printfs that
> are triggered when certain limits are reached?

Here I don't really know what to look for.  First add a sysctl to read
the number of free pbufs.  The variable for this is cluster_pbuf_freecnt
in vm.

>>> newfs -b 64k -f 8k /dev/da0p1
>>
>> The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
>> Someone increased the default from 16k to 32k without changing the buffer
>> cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
>> small, but on 32-bit arches kernel virtual memory is too limited to have
>> a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
>> although this problem doesn't affetc 64-bit arches.
>>
>> -b 64k is worse.
>
> Thank you for this explanation. I was not aware that -b 64k (or even
> the default values to newfs) would have this effect. I will repeat the
> tests with 32/4k and 16/2k, although I seem to remember that 64/8k
> provided a significant performance boost over the defaults. This, and
> the reduced fsck times was my original motivation to go with the
> larger values.

The reduced fsck time and perhaps the reduced number of cylinder groups
are the main advantages of large clusters.  vfs-level clustering turns
most physical i/o's into 128K-blocks (especially for large files) so
there is little difference between the i/o speed for all fs block sizes
unless the fs block size is very small.

> Given the potentially drastic effects of block sizes other than 16/2k,
> maybe a warning should be added to the newfs manpage? I only found the
> strong advice to maintain a 8:1 buffer:fragment ratio.

Once the kernel misconfiguration understood eniygh for such a warning to
not be FUD, it should be easy to fix.

>>> When both bonnie++ processes are in their "Rewriting" phase, the system
>>> hangs within a few seconds. Both bonnie++ processes are in state "nbufkv".
>>> bufdaemon takes about 40% CPU time and is in state "qsleep" when not
>>> active.
>>
>> You got the buffer cache fragmentation that you asked for.
>
> Looking at vfs_bio.c, I see that it has defrag-code in it. Should I
> try adding some debug output to this code to get some insight why this
> does not work, or not as effective as it should?

Don't start there, since it is complicated and timing-dependent.  Maybe
add some printfs to make it easy to see when it enters and leaves defrag
mode.

>> Apparently you found a way to reproduce the serious fragmentaion
>> problems.
>
> A key factor seems to be the "Rewriting" operation. I see no problem
> during the "normal" writing, nor could I reproduce it with concurrent
> dd runs.

I don't know exactly what bonnie rewrite bmode does.  Is it just read/
[modify]/write of sequential blocks with a fairly small block size?
Old bonnie docs say that the block size is always 8K.  One reason I
don't like bonnie.  Clustering should work fairly normally with that.
Anything with random seeks would break clustering.

>> Increasing BKVASIZE would take more work than this, since although it
>> was intended to be a system parameter which could be changed to reduce
>> the fragmentation problem, one of the many bugs in it is that it was
>> never converted into a "new" kernel option.  Another of the bugs in
>> it is that doubling it halves the number of buffers, so doubling it
>> does more than use twice as much kva.  This severely limited the number
>> of buffers back when memory sizes were 64MB.  It is not a very
>> significant limitation if the memory size is 1GB or larger.
>
> Should I try to experiment with BKVASIZE of 65536? If so, can I
> somehow up the number of buffers again? Also, after modifying
> BKVASIZE, is it sufficient to compile and install a new kernel, or do
> I have to build and install the entire world?

Just the kernel, but changing sys/param.h will make most of the world
want to recompile itself according to dependencies.  I don't like rebuilding
things, and often set timestamps in header files back to what they were
to avoid rebuilding (after rebuilding only the object files that actually
depend on the change).  Use this hack with caution, or rebuild kernels in
a separate tree that doesn't affect the world.

>>> [second bonnie goes Rewriting as well]
>>> 00-04-24.log:vfs.numdirtybuffers: 11586
>>> 00-04-25.log:vfs.numdirtybuffers: 16325
>>> 00-04-26.log:vfs.numdirtybuffers: 24333
>>> ...
>>> 00-04-54.log:vfs.numdirtybuffers: 52096
>>> 00-04-57.log:vfs.numdirtybuffers: 52098
>>> 00-05-00.log:vfs.numdirtybuffers: 52096
>>> [ etc. ]
>>
>> This is a rather large buildup and may indicate a problem.  Try reducing
>> the dirty buffer watermarks.  Their default values are mostly historical
>> nonsense.
>
> You mean the vfs.(hi|lo)dirtybuffers? Will do. What would be
> reasonable starting values for experimenting? 800/200?

1000 or 10000 (if nbuf is 50000).  1000 is probably too conservative, but
I think it is plenty for most loads.

Bruce
Comment 4 Klaus Weber 2013-05-31 17:31:50 UTC
Sorry for the late reply, testing took longer than expected.

(I have combined your replies from separate mails into one, and
reordered some of the text.)

On Tue, May 28, 2013 at 10:03:10PM +1000, Bruce Evans wrote:
> On Mon, 27 May 2013, Klaus Weber wrote:
> >On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
> >>On Sun, 26 May 2013, Klaus Weber wrote:

> However, I have never been able to reproduce serious fragmentation problems
> from using too-large-block sizes, or demonstrate significant improvements
> from avoiding the known fragmentation problem by increasing BKVASIZE.
> Perhaps my systems are too small, or have tuning or local changes that
> accidentally avoid the problem.
> 
> Apparently you found a way to reproduce the serious fragmentaion
> problems.  Try using a block size that doesn't ask for the problem.
> (...)
> The reduced fsck time and perhaps the reduced number of cylinder groups
> are the main advantages of large clusters.  vfs-level clustering turns
> most physical i/o's into 128K-blocks (especially for large files) so
> there is little difference between the i/o speed for all fs block sizes
> unless the fs block size is very small.

I have now repeated the tests with several variations of block- and
fragment sizes. In all cases, I did two tests:

1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000
2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1
    bonnie++ -s 64g -n 0 -f -D -d /mnt/t2

The dd is simply to give a rough idea of the performance impact of the
fs parameters, with the two bonnie++ processes, I was mainly
interested in performance and hangs when both bonnie++ processes are
in their "Rewriting" phase. I have also tested variations where the
block:fragment ratio does not follow the 8:1 recommendation.

64/8k, kernel unpatched:
dd: 1218155712 bytes/sec
bonnie++: around 300 MB/sec, then drops to 0 and system hangs

32/4k, kernel unpatched:
dd: 844188881 bytes/sec
bonnie++: jumps between 25 and 900 MB/sec, no hang

16/2k, kernel unpatched:
dd: 517996142 bytes/sec
bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of 
                  400-650 MB/sec, no hang

64/4k, kernel unpatched:
dd: 1156041159 bytes/sec
bonnie++: hangs system quickly once both processes are in rewriting

32/8k, kernel unpatched:
dd: 938072430 bytes/sec
bonnie++: 29-50 MB/sec, with 3-10 second "bursts" of 
                  up to 650 MB/sec, no hang (but I canceled the test
		  after an hour or so).

So while a file system created with the current (32/4k) or old (16/2k)
defaults  does prevent the hangs, it also reduces the sequential write
performance to 70% and 43% of an 64/8k fs.

The problem seems to be 64k block size, not the 8k fragment size.

In all cases, the vfs.numdirtybuffers count remained fairly small as
long as the bonnie++ processes were writing the testfiles. It rose to
vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
faster when both processes are rewriting).

> >00-04-57.log:vfs.numdirtybuffers: 52098
> >00-05-00.log:vfs.numdirtybuffers: 52096
> >[ etc. ]
> 
> This is a rather large buildup and may indicate a problem.  Try reducing
> the dirty buffer watermarks. 

I have tried to tune system parameters as per your advice, in an attempt
to get a 64/8k fs running stable and with reasonable write performance.
(dd: results omitted for brevity, all normal for 64/8k fs)

all with 64/8k, kernel unpatched:
vfs.lodirtybuffers=250
vfs.hidirtybuffers=1000
vfs.dirtybufthresh=800

bonnie++: 40-150 MB/sec, no hang

vfs.numdirtybuffers raises to 1000 when both processes are rewriting.


vfs.lodirtybuffers=1000
vfs.hidirtybuffers=4000
vfs.dirtybufthresh=3000

bonnie++: 380-50 MB/sec, no hang

For the next tests, I kept lo/hidirtybuffers at 1000/4000, and only
varied dirtybufthresh:
1200: bonnie++: 80-370 MB/sec
1750: bonnie++: around 600 MB/sec
1900: bonnie++: around 580 MB/sec. vfs.numdirtybuffers=3800 (i.e. does
          reach vfs.hidirtybuffers anymore!)
(no hangs in any of the tests).

I then re-tested with lo/hidirtybuffers at their defaults, and only
dirtybufthresh set to slightly less half of hidirtybuffers:

vfs.lodirtybuffers=26069
vfs.hidirtybuffers=52139
vfs.dirtybufthresh=26000

dd: 1199121549 bytes/sec
bonnie++: 180-650 MB/sec, mostly around 500, no hang


By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
I found that
(vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
less than vfs.hidirtybuffers for reasonable performance without hangs.

vfs.numdirtybuffers rises to
(vfs.dirtybufthresh) * (number of bonnie++ process) 
and as long as this is below vfs.hidirtybuffers, the system will not
hang.


> >>I found that
> >>the problem could be fixed by killing cluster_write() by turning it into
> >>bdwrite() (by editing the running kernel using ddb, since this is easier
> >>than rebuilding the kernel).  I was trying many similar things since I
> >>had a theory that cluster_write() is useless.  [...]
> >
> >If that would provide a useful datapoint, I could try if that make a
> >difference on my system. What changes would be required to test this?
> >
> >Surely its not as easy as replacing the function body of
> >cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?
> 
> That should work for testing, but it is safer to edit ffs_write()
> and remove the block where it calls cluster_write() (or bawrite()),
> so that it falls through to call bdwrite() in most cases.

I was not sure whether to disable the "bawrite(bp);" in the else part
as well. Here is what I used for the next test (in ffs_write):

  } else if (vm_page_count_severe() ||
              buf_dirty_count_severe() ||
              (ioflag & IO_ASYNC)) {
          bp->b_flags |= B_CLUSTEROK;
          bawrite(bp);
          /* KWKWKW       } else if (xfersize + blkoffset == fs->fs_bsize) {
          if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) {
                  bp->b_flags |= B_CLUSTEROK;
                  cluster_write(vp, bp, ip->i_size, seqcount);
          } else {
                  bawrite(bp);
                  } KWKWKW */
  } else if (ioflag & IO_DIRECT) {
          bp->b_flags |= B_CLUSTEROK;
          bawrite(bp);
  } else {
          bp->b_flags |= B_CLUSTEROK;
          bdwrite(bp);
  }

dd: 746804775 bytes/sec

During the dd tests, iostat shows a weird, sawtooth-like behavior:
 64.00 26730 1670.61   0  0 12  3 85
 64.00 13308 831.73   0  0  4  1 95
 64.00 5534 345.85   0  0 10  1 89
 64.00  12  0.75   0  0 16  0 84
 64.00 26544 1658.99   0  0 10  2 87
 64.00 12172 760.74   0  0  3  1 95
 64.00 8190 511.87   0  0  8  1 91
 64.00  10  0.62   0  0 14  0 86
 64.00 22578 1411.11   0  0 14  3 83
 64.00 12634 789.63   0  0  3  1 95
 64.00 11695 730.96   0  0  6  2 92
 48.00   7  0.33   0  0 13  0 87
 64.00 11801 737.58   0  0 17  1 82
 64.00 19113 1194.59   0  0  6  2 92
 64.00 15996 999.77   0  0  4  2 94
 64.00   3  0.19   0  0 13  0 87
 64.00 10202 637.63   0  0 16  1 83
 64.00 20443 1277.71   0  0  8  2 90
 64.00 15586 974.10   0  0  4  1 95
 64.00 682 42.64   0  0 13  0 87

With two bonnie++ processes in "Writing intelligently" phase, iostat
jumped between 9 and 350 MB/sec. I had cancelled the test before the
first bonnie++ process reached "Rewriting..." phase, due to the dismal
performance.

Already during the "Writing intelligently" phase,  vfs.numdirtybuffers
reaches vfs.hidirtybuffers (during previous tests, vfs.numdirtybuffers
only raises to high numbers in the "Rewriting..." phase).


After reverting the source change, I have decided to try mounting the
file system with "-o noclusterr,noclusterw", and re-test. This is
equivalent to disabling only the if-part of the expression in the
source snippet above.

dd: 1206199580 bytes/sec
bonnie++: 550-700 MB/sec, no hang

During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers
and dirtybufthresh are at their defaults:
vfs.dirtybufthresh: 46925
vfs.hidirtybuffers: 52139
vfs.lodirtybuffers: 26069
vfs.numdirtybuffers: 15

So it looks like you were spot-on by suspecting cluster_write(). 
Further tests confirmed that "-o noclusterw" is sufficient to prevent
the hangs and provide good performance. "-o noclusterr" on its own
makes no difference; the system will hang.


I have also have tested with write-clustering enabled, but with
vfs.write_behind=0 and vfs.write_behind=2, respectively. In both
cases, the system hangs with two bonnie++ in "Rewriting...".


I have also tested with BKVASIZE set to 65536. As you explained, this
reduced the number of buffers:
vfs.dirtybufthresh: 11753
vfs.hidirtybuffers: 13059
vfs.lodirtybuffers: 6529

dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats
with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and
numdirtybuffers reaches max:

vfs.numdirtybuffers: 13059

Even though numdirtybuffers reaches hidirtybuffers, the system does
not hang, but performance is not very good.

With BKVASIZE set to 65536 _and_ the fs mounted "-o noclusterw",
the performance is same as with BKVASIZE of 16k, and the system does
not hang.

I have now reverted BKVASIZE to its default, as the main factor for a
stable and fast system seems to be the noclusterw mount option.


> >>Apparently you found a way to reproduce the serious fragmentaion
> >>problems.
> >
> >A key factor seems to be the "Rewriting" operation. I see no problem
> >during the "normal" writing, nor could I reproduce it with concurrent
> >dd runs.
> 
> I don't know exactly what bonnie rewrite bmode does.  Is it just read/
> [modify]/write of sequential blocks with a fairly small block size?
> Old bonnie docs say that the block size is always 8K.  One reason I
> don't like bonnie.  Clustering should work fairly normally with that.
> Anything with random seeks would break clustering.

Here is the relevant part from bonnie++'s source (in C++):
---------------
bufindex = 0;
for(words = 0; words < num_chunks; words++)
{ // for each chunk in the file
  dur.start();
  if (file.read_block(PVOID(buf)) == -1)
    return 1;
  bufindex = bufindex % globals.io_chunk_size();
  buf[bufindex]++;
  bufindex++;
  if (file.seek(-1, SEEK_CUR) == -1)
    return 1;
  if (file.write_block(PVOID(buf)) == -1)
    return io_error("re write(2)");
-----------
globals.io_chunk_size() is 8k (by default and in all of my tests), and
bonnie++ makes sure that buf is page-aligned. 

So you are correct: bonnie++ re-reads the file that was created
previously in the "Writing intelligently..." phase in blocks, modifies
one byte in the block, and writes the block back.

Something in this specific workload is triggering the huge buildup of
numdirtybuffers when write-clustering is enabled.


I am now looking at vfs_cluster.c to see whether I can find which part
is responsible for letting numdirtybuffers raise without bounds and
why only *re* writing a file causes problems, not the initial
writing. Any suggestions on where to start looking are very welcome.

Klaus
Comment 5 Bruce Evans freebsd_committer 2013-06-03 22:09:59 UTC
On Fri, 31 May 2013, Klaus Weber wrote:

> Sorry for the late reply, testing took longer than expected.

This is very interesting.  It will take me a long time to complete
testing and replying too.  Maybe more later.

> (I have combined your replies from separate mails into one, and
> reordered some of the text.)

Very readable now.

> ...
> I have now repeated the tests with several variations of block- and
> fragment sizes. In all cases, I did two tests:
>
> 1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000
> 2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1
>    bonnie++ -s 64g -n 0 -f -D -d /mnt/t2
>
> The dd is simply to give a rough idea of the performance impact of the
> fs parameters, with the two bonnie++ processes, I was mainly
> interested in performance and hangs when both bonnie++ processes are
> in their "Rewriting" phase. I have also tested variations where the
> block:fragment ratio does not follow the 8:1 recommendation.
>
> 64/8k, kernel unpatched:
> dd: 1218155712 bytes/sec
> bonnie++: around 300 MB/sec, then drops to 0 and system hangs

My hardware can barely do 40 MB/sec, but under FreeBSD-5.2 (modified),
there is at most a 50% drop with 2 bonnies and no tuning, and no hangs.
Except it hung in my debugging code for understanding a related problem.
I had this code enabled for a month or 2 and had forgotten about it.
Only 2 bonnies caused enough load to reach it.

> 32/4k, kernel unpatched:
> dd: 844188881 bytes/sec
> bonnie++: jumps between 25 and 900 MB/sec, no hang
>
> 16/2k, kernel unpatched:
> dd: 517996142 bytes/sec
> bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of
>                  400-650 MB/sec, no hang

The non-hangs might be due to BKVASIZE not being a problem.

> So while a file system created with the current (32/4k) or old (16/2k)
> defaults  does prevent the hangs, it also reduces the sequential write
> performance to 70% and 43% of an 64/8k fs.

I think bufdaemon and other write-pressure handling code just don't work
as well as upper layers for clustering and ordering the writes.  They
are currently broken, as shown by the hangs.  Clustering shouldn't be
generating so much pressure.  I can explain why it does (see below).
But random i/o can easily generate the same pressure, so bufdaemon etc.
need to handle it.  Also, pressure should be the usual case for async
mounts, and bufdaemon etc. should be able to do clustering and ordering
_better_ than upper layers, since they have more blocks to work with.
cluster_write() intentionally leaves as much as possible to bufdaemon
in order to potentially optimize it.

> In all cases, the vfs.numdirtybuffers count remained fairly small as
> long as the bonnie++ processes were writing the testfiles. It rose to
> vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
> faster when both processes are rewriting).

The slow buildup is caused by fragmentation.  If the files happen to
be laid out contiguously, then cluster_write() acts like bawrite(),
but better.  With your 64K-blocks and FreeBSD's 128K max cluster size,
it normally combines just 2 64K-blocks to create a 128K-cluster, and
does 1 write instead of 2.  After this write, both methods have the
leave number of dirty buffers (none for data, but a couple for metadata).
In more detail, the write pattern with bonnie's 8K-blocks is:

   write; bdwrite
   ...
   write; bdwrite  (7 bdwrites to the same buffer)
   write; cluster_write or bawrite (buffer is now full, and dirty iff clusterw)
   write; bdwrite
   ...
   write; bdwrite  (7 bdwrites to the next buffer)
   write; cluster_write or bawrite (2 buffers are now full and both are clean)

However, with fragmentation, cluster_write() "leaks" a dirty buffer
although it is full, while bawrite() just writes all full buffers.  The
leakage is because cluster_write() wants physical contiguity.  This is
too much to ask for.  The leak may be fast if the file system is very
fragmented.

I happened to test on a file system which was almost full and thus very
fragmented.  Two test files of size 256M each had 452 extents (counting
indirect blocks as an extent although the might be contiguous).  The
buildup with my test program (which tries to produce bonnie's i/o pattern)
was often to the same magic number (something like 768, a little higher
than the number of extents).

This much leakage was enough to make clustering a pessimization, although
it worked to combine 4 blocks in most cases (my fs block size is 16K and
my max cluster size is 64K).  The disks are old, so 16K is as good an i/o
size as 64K for them.  But smaller sizes wouldn't be.

>>> 00-04-57.log:vfs.numdirtybuffers: 52098
>>> 00-05-00.log:vfs.numdirtybuffers: 52096
>>> [ etc. ]
>>
>> This is a rather large buildup and may indicate a problem.  Try reducing
>> the dirty buffer watermarks.
>
> I have tried to tune system parameters as per your advice, in an attempt
> to get a 64/8k fs running stable and with reasonable write performance.
> (dd: results omitted for brevity, all normal for 64/8k fs)

I got mostly worse behaviour by tuning things related to dirtybuffers.

> [... You managed to fix the hangs, at a cost of too much performance. ]

> By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
> I found that
> (vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
> less than vfs.hidirtybuffers for reasonable performance without hangs.
>
> vfs.numdirtybuffers rises to
> (vfs.dirtybufthresh) * (number of bonnie++ process)
> and as long as this is below vfs.hidirtybuffers, the system will not
> hang.

I wonder why it is linear in the number of processes.  I saw indications
of similar behaviour (didn't test extensively).  1 bonnie created
about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at
vfs.hidirtybuffers.  This is with vfs.hidirtbuffers much smaller than
yours.


> ...
> I was not sure whether to disable the "bawrite(bp);" in the else part
> as well. Here is what I used for the next test (in ffs_write):
>
>  } else if (vm_page_count_severe() ||
>              buf_dirty_count_severe() ||
>              (ioflag & IO_ASYNC)) {
>          bp->b_flags |= B_CLUSTEROK;
>          bawrite(bp);

Note that this bawrite() is used when hidirtybuffers is reached.  This
is OK for this buffer if this buffer is large, but it doesn't stop
the flood -- the system stays saturated at hidirtybuffs until bonnie
stops writing.  This has a bad effect on other parts of the system,
e.g., if they are writing small buffers.

>          /* KWKWKW       } else if (xfersize + blkoffset == fs->fs_bsize) {
>          if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) {
>                  bp->b_flags |= B_CLUSTEROK;
>                  cluster_write(vp, bp, ip->i_size, seqcount);
>          } else {
>                  bawrite(bp);
>                  } KWKWKW */

OK.

I tried a bdwrite() for the async mount case.  Async mounts should give
delayed writes always, but it is a bugfeature that they give delayed
writes for critical metadata but normally async() writes for data (via
cluster_write()).  Using bdwrite here let me control the behaviour
using a combination of mount flags.  It became clearer that the
bugfeature is not just a bug.  Without cluster_write() forcing out
writes for async mounts, the write pressure is much the same as from
bonnie rewrite.

> During the dd tests, iostat shows a weird, sawtooth-like behavior:
> 64.00 26730 1670.61   0  0 12  3 85
> ...

Some bad interaction of bufdaemon with the writers.  I believe there are
locking problems, but didn't know anything about them until recently.
FreeBSD-5.2 is simpler and seems to have fewer of them.  I got a deadlock
in it by putting a sleep waiting for the write pressure to clear in the
above.  This gave deadlock, since bufdaemon is apparently prevented from
doing anything while we hold a lock.

> ...
> After reverting the source change, I have decided to try mounting the
> file system with "-o noclusterr,noclusterw", and re-test. This is
> equivalent to disabling only the if-part of the expression in the
> source snippet above.

Good idea.

> dd: 1206199580 bytes/sec
> bonnie++: 550-700 MB/sec, no hang
>
> During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers
> and dirtybufthresh are at their defaults:
> vfs.dirtybufthresh: 46925
> vfs.hidirtybuffers: 52139
> vfs.lodirtybuffers: 26069
> vfs.numdirtybuffers: 15
>
> So it looks like you were spot-on by suspecting cluster_write().

Not for the reasons that I suspected it!

> Further tests confirmed that "-o noclusterw" is sufficient to prevent
> the hangs and provide good performance. "-o noclusterr" on its own
> makes no difference; the system will hang.

I thought -noclusterw would be worse until now...

> I have also tested with BKVASIZE set to 65536. As you explained, this
> reduced the number of buffers:
> vfs.dirtybufthresh: 11753
> vfs.hidirtybuffers: 13059
> vfs.lodirtybuffers: 6529
>
> dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats
> with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and
> numdirtybuffers reaches max:
>
> vfs.numdirtybuffers: 13059
>
> Even though numdirtybuffers reaches hidirtybuffers, the system does
> not hang, but performance is not very good.

... I hoped BKVASIZE would be the problem, but it is at most a secondary
problem.

> Here is the relevant part from bonnie++'s source (in C++):
> ...
> So you are correct: bonnie++ re-reads the file that was created
> previously in the "Writing intelligently..." phase in blocks, modifies
> one byte in the block, and writes the block back.
>
> Something in this specific workload is triggering the huge buildup of
> numdirtybuffers when write-clustering is enabled.

I can explain this.  Readers and writers share the offset for the
sequential heuristic (it is per-open-file), so cluster_write() cannot
tell that bonnie's writes are sequential.  The sequential_heuristic()
function sees the offset moving back and forth, which normally means
random access, so it tells cluster_write() that the access is random,
allthiugh all bonnie seeks back after each read() so that all the
writes() are sequential.  cluster_write() rarely if ever pushes out
random writes, so the write pressure builds up to a saturation value
almost instantly.  Then all subsequent writes work even more poorly
due to the pressure.  Truly random writes would give similar buildup
of pressure.

> I am now looking at vfs_cluster.c to see whether I can find which part
> is responsible for letting numdirtybuffers raise without bounds and
> why only *re* writing a file causes problems, not the initial
> writing. Any suggestions on where to start looking are very welcome.

It is very complicated, but it was easy to find its comments saying that
it tries not to force out the writes for non-seqential accesses.  I
am currently trying the following workarounds:

% diff -u2 vfs_cluster.c~ vfs_cluster.c
% --- vfs_cluster.c~	Tue Apr  6 20:11:55 2004
% +++ vfs_cluster.c	Tue Jun  4 04:53:52 2013
% @@ -593,4 +714,6 @@
%  }
% 
% +int	buf_dirty_count_large(void);
% +
%  /*
%   * Do clustered write for FFS.

This is like buf_dirty_count_severe() except it has a lower threshold.
It doesn't really help.

% @@ -630,4 +753,44 @@
%  		vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0;
% 
% +	/*
% +	 * XXX the 13 year old changes in the algorithm to use seqcount
% +	 * are too agressive.  Avoid them by faking sequential access if
% +	 * we are under write pressure.  This causes us to write out
% +	 * clusters in more cases (in particular, when clusters are
% +	 * complete).

This doesn't really help.  It's too late to start faking after pressure
has built up.

% +	 *
% +	 * XXX we should probably use a more conservative condition for
% +	 * write pressure than is currently available.  Try not to leave
% +	 * anything to bufdaemon, since it can too easily fail to clear
% +	 * write pressure because the processes creating the write
% +	 * pressure have the vnode locked.  E.g., use lodirtybuffers
% +	 * instead of hidirtybuffers in buf_dirty_count_severe(), and
% +	 * also check dirtybufthesh.

I use lodirtybuffers as a default.  This is still too high, and almost
any value reduces performance.

% +	 *
% +	 * XXX seqcount itself is broken.  It gets cleared by reads of
% +	 * the same open file, even if the application seeks back after
% +	 * reading so that writes are perfectly sequential.  This case
% +	 * is probably unimportant, but bonnie does it.

As describe differently above.

% +	 *
% +	 * XXX I think that if bonnie didn't seek back then the write
% +	 * pressure would be the same (provided the bonnie write size is
% +	 * smaller than the fs block size).  Then the writes wouldn't be
% +	 * sequential, so a fixed seqcount wouldn't help force them out.
% +	 * But they need to be forced out under write pressure, and I
% +	 * don't see anything to do it.  Check this.
% +	 *
% +	 * XXX the writing system shouldn't collapse when we get this
% +	 * slightly wrong.  Why doesn't bdwrite() force things when
% +	 * we don't?  It has some code to do so.  It is still better
% +	 * for us to force out full clusters than to leave it to
% +	 * the writing system to possibly force out sub-buffers.
% +	 *
% +	 * XXX try killing async under the same conditions.

This is too late to reduce pressure from async too.

% +	 */
% +	if (vm_page_count_severe() || buf_dirty_count_large()) {
% +		seqcount = 2;
% +		async = 0;
% +	}
% +
%  	if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 ||
%  	    (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) {
% @@ -719,4 +882,5 @@
%  		} else {	/* Wait for rest of cluster */
%  			vp->v_cstart = lbn;
% +			/* XXX problem if we're async and orig maxclen = 0. */
%  			bdwrite(bp);
%  		}

This case for async breaks the bugfeature of not honoring it.  The
bugfeature often prevents buildup of exessive pressure, but some cases
arrive here and then it doesn't.  This is the only old change I have in
cluster_write().

% @@ -726,8 +890,13 @@
%  		 * are operating sequentially, otherwise let the buf or
%  		 * update daemon handle it.
% +		 *
% +		 * Algorithm changeback: ignore seqcount here, at least for
% +		 * now, to work around readers breaking it for writers.  It
% +		 * is too late to start ignoring after write pressure builds
% +		 * up, since not writing out here is the main source of the
% +		 * buildup.
%  		 */
%  		bdwrite(bp);
% -		if (seqcount > 1)
% -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1);
% +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1);
%  		vp->v_clen = 0;
%  		vp->v_cstart = lbn + 1;

This works fairly well.  Changing the algorithm back in all cases reduces
performance, but here we have usually finished with the buffer.  Even if
seqcount is fixed so that it isn't clobbered by reads, it might be right
to ignoe it here, so as to reduce write pressure, until write pressure is
handled better.

I used the following program to generate i/o like bonnie's:

%%%
#include <err.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

#define	IOSIZE	8192

static char buf[IOSIZE] __aligned(4096);

int
main(int argc, char **argv)
{
 	ssize_t n;
 	int ifd, ofd;

 	if (argc != 2 && argc != 3)
 		errx(1, "usage: ./prog file1 [file2]");
 	ifd = open(argv[1], O_RDWR);
 	if (ifd < 0)
 		err(1, "open");
 	if (argc == 3)
 		ofd = open(argv[2], O_RDWR);
 	else
 		ofd = ifd;
 	if (ifd < 0)
 		err(1, "open");
 	for (;;) {
 		n = read(ifd, buf, IOSIZE);
 		if (n == 0) {
 			printf("finished: ioffset %#jx; ooffset = %#jx\n",
 			    (intmax_t)lseek(ifd, 0, SEEK_CUR),
 			    (intmax_t)lseek(ofd, 0, SEEK_CUR));
 			exit(0);
 		}
 		if (n < 0)
 			err(1, "read error");
 		if (n != IOSIZE)
 			errx(1, "short read %d", (int)n);
 		if (ofd == ifd)
 			lseek(ofd, -IOSIZE, SEEK_CUR);
 		n = write(ofd, buf, IOSIZE);
 		if (n < 0)
 			err(1, "write error");
 		if (n != IOSIZE)
 			errx(1, "short write %d", (int)n);
 	}
}
%%%

Usage: create a test file zz of a suitable size (the program normally
doesn't extend it); then ./prog zz does read/write like bonnie;
./prog zz zz does the same with independent file positions.

Bruce
Comment 6 Klaus Weber 2013-06-10 00:07:21 UTC
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20130604052658.K1039@besplex.bde.org>
User-Agent: Mutt/1.4.2.3i

On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
> On Fri, 31 May 2013, Klaus Weber wrote:

Again, I have reordered some sections of your mail in my reply. I have
also removed freebsd-bugs from Cc: - it seems that replies to PRs get
forwarded there automatically, and judging from the web interface to
the mailing list archives, the previous mails were all duplicated.)

> >So you are correct: bonnie++ re-reads the file that was created
> >previously in the "Writing intelligently..." phase in blocks, modifies
> >one byte in the block, and writes the block back.
> >
> >Something in this specific workload is triggering the huge buildup of
> >numdirtybuffers when write-clustering is enabled.
> 
> I can explain this.  Readers and writers share the offset for the
> sequential heuristic (it is per-open-file), so cluster_write() cannot
> tell that bonnie's writes are sequential.  The sequential_heuristic()
> function sees the offset moving back and forth, which normally means
> random access, so it tells cluster_write() that the access is random,
> allthiugh all bonnie seeks back after each read() so that all the
> writes() are sequential. 

Indeed. Using the the test program you provided, I can confirm that
reading and writing to the same file via _different_ file descriptors
works without hangs and with good performance (similar to "-o
noclusterw", see below for details). vfs.numdirtybuffers count
remained low, and write rates pretty stable.

Since your program is more flexible than bonnie++, I have done all
tests in  this mail with it.

> cluster_write() rarely if ever pushes out
> random writes, so the write pressure builds up to a saturation value
> almost instantly.  Then all subsequent writes work even more poorly
> due to the pressure.  Truly random writes would give similar buildup
> of pressure.

I have briefly tested and confirmed this as well (see below). Mounting
a file system with "-o async" shows the "quickly raising
vfs.numdirtybuffers" symptom already during writing (not re-writing) a
file, e.g. with dd. I could not get the system to hang, but I did not
test this very thoroughly.


> >So while a file system created with the current (32/4k) or old (16/2k)
> >defaults  does prevent the hangs, it also reduces the sequential write
> >performance to 70% and 43% of an 64/8k fs.
> XXX
> I think bufdaemon and other write-pressure handling code just don't work
> as well as upper layers for clustering and ordering the writes.  They
> are currently broken, as shown by the hangs.  Clustering shouldn't be
> generating so much pressure.  I can explain why it does (see below).
> But random i/o can easily generate the same pressure, so bufdaemon etc.
> need to handle it.  Also, pressure should be the usual case for async
> mounts, and bufdaemon etc. should be able to do clustering and ordering
> _better_ than upper layers, since they have more blocks to work with.
> cluster_write() intentionally leaves as much as possible to bufdaemon
> in order to potentially optimize it.
> 
> >In all cases, the vfs.numdirtybuffers count remained fairly small as
> >long as the bonnie++ processes were writing the testfiles. It rose to
> >vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
> >faster when both processes are rewriting).
> 
> The slow buildup is caused by fragmentation.  If the files happen to
> be laid out contiguously, then cluster_write() acts like bawrite(),
> but better.  With your 64K-blocks and FreeBSD's 128K max cluster size,
> it normally combines just 2 64K-blocks to create a 128K-cluster, and
> does 1 write instead of 2.  After this write, both methods have the
> leave number of dirty buffers (none for data, but a couple for metadata).
>
> [Analysis of bonnie++'s write pattern]
>
> However, with fragmentation, cluster_write() "leaks" a dirty buffer
> although it is full, while bawrite() just writes all full buffers.  The
> leakage is because cluster_write() wants physical contiguity.  This is
> too much to ask for.  The leak may be fast if the file system is very
> fragmented.

In all of my (older) tests with bonnie++, the file system was newfs'ed
before mounting it, so there was no fragmentation for me.


> >I have tried to tune system parameters as per your advice, in an attempt
> >to get a 64/8k fs running stable and with reasonable write performance.
> >(dd: results omitted for brevity, all normal for 64/8k fs)
> 
> I got mostly worse behaviour by tuning things related to dirtybuffers.

Me too, performance-wise. The only exception was reducing
vfs.dirtybufthresh so that vfs.numdirtybuffers no longer reaches
vfs.hidirtybuffers; this is needed to prevent the hangs. The exact
value does not seem to matter much.


> >[... You managed to fix the hangs, at a cost of too much performance. ]
> 
> >By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
> >I found that
> >(vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
> >less than vfs.hidirtybuffers for reasonable performance without hangs.
> >
> >vfs.numdirtybuffers rises to
> >(vfs.dirtybufthresh) * (number of bonnie++ process)
> >and as long as this is below vfs.hidirtybuffers, the system will not
> >hang.

Actually, further testing revealed that it is not linear with the
number of processes, but with the number of files being
rewritten. With bonnie++, you cannot make this distiction; two
bonnie++ processes cannot re-write the same file.

However, with the test program you provided you can actually re-write
the same file from two (or more) processes. Testing this shows that
the correct formula is actually 

"vfs.numdirtybuffers rises to
(vfs.dirtybufthresh) * (number of files being re-written)
and as long as this is below vfs.hidirtybuffers, the system will not
hang."

So, any number of processes can re-write the same file, and the number
of processes has no effect on vfs.numdirtybuffers.


> I wonder why it is linear in the number of processes.  I saw indications
> of similar behaviour (didn't test extensively).  1 bonnie created
> about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at
> vfs.hidirtybuffers.  This is with vfs.hidirtbuffers much smaller than
> yours.

Does this correlate with vfs.dirtybufthresh on your system as well?
(i.e. is vfs.dirtybufthresh about 2/3 of your vfs.hidirtybuffers?


> I tried a bdwrite() for the async mount case.  Async mounts should give
> delayed writes always, but it is a bugfeature that they give delayed
> writes for critical metadata but normally async() writes for data (via
> cluster_write()).  Using bdwrite here let me control the behaviour
> using a combination of mount flags.  It became clearer that the
> bugfeature is not just a bug.  Without cluster_write() forcing out
> writes for async mounts, the write pressure is much the same as from
> bonnie rewrite.

I have tested this as well, and I can confirm that with a file system
mounted with "-o async" I see the problematic behavior with "normal"
writes (not re-writes) as well: while writing a large file (e.g. via
dd), vfs.numdirtybuffers rises quickly to vfs.dirtybufthresh). On
non-async file systems, I could only provoke this via re-writing.



> >After reverting the source change, I have decided to try mounting the
> >file system with "-o noclusterr,noclusterw", and re-test.
> > [...]
> >Further tests confirmed that "-o noclusterw" is sufficient to prevent
> >the hangs and provide good performance.
> 
> I thought -noclusterw would be worse until now...

I wanted to check the performance effects of write clustering more
thoroughly, so I have tested 3 variants (with the test program you
provided, using a 64 GB file (2x RAM)):

1) Re-writing a single file with two file descriptors, file system
   mounted with default options.

2) Re-writing a single file with two file descriptors, file system
   mounted with "-o noclusterw".

3) Re-writing a single file via a single file descriptor, file system
   mounted with  "-o noclusterw" (required in this case to prevent the
   system hanging).

First, I have tested on a logical volume on the RAID array (3 test
runs each):

1) (2 desc, w/ clustering)
0.669u 97.119s 3:23.34 48.0%	5+187k 524441+524311io 0pf+0w
0.472u 95.501s 3:22.40 47.4%	5+188k 523095+524311io 0pf+0w
0.606u 93.263s 3:21.13 46.6%	5+188k 522999+524311io 0pf+0w

2) (2 desc, -o noclusterw)
0.524u 94.180s 3:19.76 47.4%	5+189k 524442+1048576io 0pf+0w
0.622u 95.200s 3:23.30 47.1%	5+187k 523090+1048576io 0pf+0w
0.475u 94.232s 3:19.96 47.3%	5+186k 522767+1048576io 0pf+0w

3) (1 desc, -o noclusterw)
0.922u 95.917s 3:33.74 45.3%	5+187k 524442+1048576io 0pf+0w
0.679u 95.392s 3:32.62 45.1%	5+187k 522256+1048576io 0pf+0w
0.976u 93.639s 3:33.68 44.2%	5+187k 521902+1048576io 0pf+0w

As you can see, the performance differences are very small. If you
consider the differences significant, then "-o noclusterw" is actually
a tiny bit faster than clustering. You also see clustering at work,
the number of writes doubles for cases 2) and 3).

I have then repeated the tests on a single disk, connected directly to
mainboard SATA port, to take the RAID controller out of the
equation. (With its CPU and on-board RAM, it certainly has the ability
to cluster and re-order writes after the kernel sends them.) Since the
tests take a long time to complete, I have only done a single test run
for each case.


1) (2 desc, w/ clustering)
3.497u 575.616s 24:39.49 39.1%	5+187k 524445+524314io 0pf+0w

2) (2 desc, -o noclusterw)
5.960u 876.735s 33:35.75 43.7%	5+187k 524445+1048576io 0pf+0w

3) (1 desc, -o noclusterw)
7.014u 741.382s 29:56.98 41.6%	5+188k 524445+1048576io 0pf+0w

Here, clustering does seem to have a positive effect, and the
differences are more pronounced.

So it really seems that clustering does provide performance benefits,
but the RAID controller seems to able to able to make up for the lack
of clustering (either because clustering is disabled, or because it
does not work effectively due to interspersed reads and seeks on the
same file descriptor).


> >I am now looking at vfs_cluster.c to see whether I can find which part
> >is responsible for letting numdirtybuffers raise without bounds and
> >why only *re* writing a file causes problems, not the initial
> >writing. Any suggestions on where to start looking are very welcome.
> 
> It is very complicated, but it was easy to find its comments saying that
> it tries not to force out the writes for non-seqential accesses.  I
> am currently trying the following workarounds:

I have decided to start testing with only a single change from the
list of changes you provided:
 
> % diff -u2 vfs_cluster.c~ vfs_cluster.c
> % @@ -726,8 +890,13 @@
> %  		 * are operating sequentially, otherwise let the buf or
> %  		 * update daemon handle it.
> % +		 *
> % +		 * Algorithm changeback: ignore seqcount here, at least for
> % +		 * now, to work around readers breaking it for writers.  It
> % +		 * is too late to start ignoring after write pressure builds
> % +		 * up, since not writing out here is the main source of the
> % +		 * buildup.
> %  		 */
> %  		bdwrite(bp);
> % -		if (seqcount > 1)
> % -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, 
> vp->v_clen + 1);
> % +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 
> 1);
> %  		vp->v_clen = 0;
> %  		vp->v_cstart = lbn + 1;

And sure enough, a kernel with only this one line change[*] is able to
handle reading and re-writing a file through a single file descriptor
just fine, good performance, no hangs, vfs.numdirtybuffers remains
low:

4) (1 desc, cluster_wbuild_wb unconditionally)
0.864u 96.145s 3:32.34 45.6%	5+188k 524441+524265io 0pf+0w
0.976u 94.081s 3:29.70 45.3%	5+187k 489756+524265io 0pf+0w
0.903u 95.676s 3:33.38 45.2%	5+188k 523124+524265io 0pf+0w

Same test on a single disk, as above:

4) (1 desc, cluster_wbuild_wb unconditionally)
8.399u 822.055s 33:01.01 41.9%	5+187k 524445+524262io 0pf+0w

Unfortunately, for the single disk case, the performance is the same
as for the "-o noclusterw" case (even though the writes _are_
clustered now, see the number of write operation).


[*] I still need to confirm that this is really the only change in the
kernel, but I'm running out of time now. I will test this next weekend.

> This works fairly well.  Changing the algorithm back in all cases reduces
> performance, but here we have usually finished with the buffer.  Even if
> seqcount is fixed so that it isn't clobbered by reads, it might be right
> to ignoe it here, so as to reduce write pressure, until write pressure is
> handled better.

For me, this change makes the difference between a working system with
good performance vs a system that limps along and occasionally hangs
completely; for this specific workload.

Thank you for this!

Next weekend, I want to test some scenarios with and without this
change. Most importantly, I want to investigate

a) behavior with random I/O (any suggestions on a benchmark or test
program that can generate "representative" random I/O? IOzone?)

b) behavior with sequential writes

c) behavior with re-writing files, both via one and two file
descriptors

d) maybe test the other changes to vfs_cluster.c you provided, to see
whether they make a difference.

e) Investigate the role of vfs.dirtybufthresh. When vfs.numdirtybuffer
reaches this number, the system switches to a mode where
vfs.numdirtybuffer no longer increases and seems to handle the load
just fine. I want to understand why the system sees no need to limit
numdirtybuffer earlier. Maybe this can be used to improve re-write
performance to the "2 desc, with clustering" case.

f) I have experienced system freezes when unmounting my test file
system after running the respective tests. I'm not sure whether this
related to the current problem, or whether this caused by some test
patch that I accidently left in the code. I need to check wether this
reproducible, and whether there is a specific workload that causes
this.

Klaus
Comment 7 Bruce Evans freebsd_committer 2013-06-10 02:00:29 UTC
On Mon, 10 Jun 2013, Klaus Weber wrote:

> On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
>> On Fri, 31 May 2013, Klaus Weber wrote:

This thread is getting very long, and I will only summarize a couple
of things that I found last week here.  Maybe more later.

o Everything seems to be working as well as intended (not very well)
   except in bufdaemon and friends.  Perhaps it is already fixed there.
   I forgot to check which version of FreeBSD you are using.  You may
   be missing some important fixes.  There were some by kib@ a few
   months ago, and some by jeff@ after this thread started.  I don't
   run any version of FreeBSD new enough to have these, and the version
   that I run also doesn't seem to have any serious bugs in bufdaemon.
   It just works mediocrely.

o Writing in blocks of size less than the fs block size, as bonnie
   normally does, gives much the same rewriting effect as bonnie does
   explicitly, because the system is forced to read each block before
   doing a partial write to it.  This at best doubles the amount of
   i/o and halves the throughput of the writes.

o There are some minor bugs in the read-before-write system code, and
   some differences in the "much the same" that may be important in
   some cases:
   - when bonnie does the read-before-write, the system normally uses
     cluster_read() on the read descriptor and thus uses the system's
     idea of sequentiality on the read descriptor.  For both ffs and
     msdosfs, this normally results in a reading small cluster (up to
     the end of the current read) followed by async read-ahead of a
     larger cluster or 2.  The separate clusters improve latency but
     reduce performance.  scottl@ recently committed a sysctl
     vfs.read_min for avoiding the earlier splitting.  Using it made
     some interesting but ultimately unimportant differences to the
     2-bonnie problem.
   - when the system does the read-before-write, ffs normally uses
     cluster_read() on the write descriptor and thus uses the system's
     idea of sequentially on the write descriptor.  ffs doesn't know
     the correct amount to read in this case, and it always asks
     for MAXBSIZE, which is both too small and too large.  This value
     is the amount that should be read synchronously.  It is too large
     since the normal amount is the application's block size which is
     normally smaller, and it is too small since MAXBSIZE is only
     half of the max cluster size.  The correct tradeoff of latency
     vs throughput is even less clear than for a user read, and further
     off from being dynamic.  msdosfs doesn't even use cluster_read()
     for this (my bad).  It uses plain bread().  This gives very low
     performance when the block size is small.  So msdosfs worked much
     better in the 2-bonnie benchmark than for rewrites generated by
     dd just writing with a small block size and conv=notrunc.  After
     fixing thus, msdosfs worked slightly better than ffs in all cases.
   - whoever does the read-before-write, cluster reading tends to generate
     a bad i/o pattern.  I saw patterns like the following (on ~5.2 where
     the max cluster size is only 64K, after arranging to mostly use this
     size):
       file1: read       64K offset 0
       file1: read ahead 64K offset 64K
       file2: read       64K offset 0
       file1: read ahead 64K offset 64K
       file1: write      64K offset 0
       file1: read       64K offset 128K
       file1: read ahead 64K offset 192K
       file2: write      64K offset 0
     The 2 files make the disk seek a lot, and the read-and-read-ahead
     gives even more seeks to get back to the write position.  My drives
     are old and have only about 2MB of cache.  Seeks with patterns like
     the above are apparently just large enough to break the drives'
     caching.  OTOH, if I use your trick of mounting with -noclusterw,
     the seeks are reduced signficantly and my throughput increases by
     almost a factor of 2, even though this gives writes of only 16K.
     Apparently the seeks are reduced just enough for the drives' caches
     to work well.  I think the same happens for you.  Your i/o system
     is better, but it only takes a couple of bonnies and perhaps the
     read pointers getting even further ahead of the write pointers
     to defeat the drive's caching.  Small timing differences probably
     allow the difference to build up.  Mounting with -noclusterw also
     gives some synchronization that will prevent this buildup.
   - when the system does the read-before-write, the sequential heuristic
     isn't necessarily clobbered, but it turns out that the clobbering
     gives the best possible behaviour, except for limitations and bugs
     in bufdaemon!...

> [... good stuff clipped]
> So it really seems that clustering does provide performance benefits,
> but the RAID controller seems to able to able to make up for the lack
> of clustering (either because clustering is disabled, or because it
> does not work effectively due to interspersed reads and seeks on the
> same file descriptor).

Yes, the seek pattern caused by async-but-not-long-delayed writes
(whether done by cluster_write() a bit later or bawrite() directly)
combined with reading far ahead (whether done explicitly or implicitly)
is very bad even for 1 file, but can often be compensated for by caching
in the drives.  With 2 files or random writes on 1 file it is much worse,
but appaerently mounting with -noclusterw limits it enough for the
drives to compensate in the case of 2 bonnies.  I think the best we
can do in general is delay writes as long as possible and then
schedule them perfectly.  But scheduling them perfectly is difficult
and only happens accidentally.

>>> I am now looking at vfs_cluster.c to see whether I can find which part
>>> is responsible for letting numdirtybuffers raise without bounds and
>>> why only *re* writing a file causes problems, not the initial
>>> writing. Any suggestions on where to start looking are very welcome.
>>
>> It is very complicated, but it was easy to find its comments saying that
>> it tries not to force out the writes for non-seqential accesses.  I
>> am currently trying the following workarounds:
>
> I have decided to start testing with only a single change from the
> list of changes you provided:
>
>> % diff -u2 vfs_cluster.c~ vfs_cluster.c
>> % @@ -726,8 +890,13 @@
>> %  		 * are operating sequentially, otherwise let the buf or
>> %  		 * update daemon handle it.
>> % +		 *
>> % +		 * Algorithm changeback: ignore seqcount here, at least for
>> % +		 * now, to work around readers breaking it for writers.  It
>> % +		 * is too late to start ignoring after write pressure builds
>> % +		 * up, since not writing out here is the main source of the
>> % +		 * buildup.
>> %  		 */
>> %  		bdwrite(bp);
>> % -		if (seqcount > 1)
>> % -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart,
>> vp->v_clen + 1);
>> % +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen +
>> 1);
>> %  		vp->v_clen = 0;
>> %  		vp->v_cstart = lbn + 1;
>
> And sure enough, a kernel with only this one line change[*] is able to
> handle reading and re-writing a file through a single file descriptor
> just fine, good performance, no hangs, vfs.numdirtybuffers remains
> low:

After more testing, I found that this was almost perfectly backwards for
my hardware!  I think for your hardware it allows the drives to
compensate, much like with -noclusterw but with a slightly improved
throughput due to the larger writes.  But with my drives, it mostly
just gives more seeks.  After changing this back and being more careful
with the comparisons, I found that best results are obtained (in ~5.2)
by letting numdirtybuffers build up.  The breakage of the sequential
heuristic cause the above to never force out the cluster immediately
for the 2-bonnie case.  I get similar behaviour by always using delayed
writes in ffs_write().  This might depend on setting B_CLUSTEROK in more
cases, so that the clustering always gets done later.

Typical throughputs for me:
- my drives can do 55MB/sec max and get 48 for writing 1 file with large
   blocks using dd
- 48 drops to half of 20-24 with read-before-write for 1 file.  That's
   a 4-fold reduction.  One half is for the doubled i/o and the other half
   is for the seeks.
- half of 20-24 drops to half of 10-12 with 2 files and read-before-write
   of each, in the best case.  That's an 8-fold reduction.  Another factor
   of 2 is apparently lost to more seeks.
- half of 10-12 drops to half of 5-6, as in the previous point but in the
   worst case.  That's a 16-fold reduction.  The worst case is with my
   modification above.  It maximizes the seeks.  My original idea for a
   fix (in the above diff) gave this case.  It gave almost perfect
   clustering and almost no buildup of numdirtybuffers, but throughput
   was still worst.  (My drives can do 16K blocks at full speed provided
   the blocks are contiguous, so they don't benifit much from clustering
   except for its side effect of reducing seeks to other blocks in between
   accessing the contiguous ones.)
Some of this typical behaviour is not very dependent on block sizes.  The
drives become seek-bound, and anything that doubles the number of seeks
halves the throughput.

Bruce
Comment 8 Klaus Weber 2013-06-10 17:38:19 UTC
Just a quick reply while I am away from the system.

On Mon, Jun 10, 2013 at 11:00:29AM +1000, Bruce Evans wrote:
> o Everything seems to be working as well as intended (not very well)
>   except in bufdaemon and friends.  Perhaps it is already fixed there.
>   I forgot to check which version of FreeBSD you are using.  You may
>   be missing some important fixes.  There were some by kib@ a few
>   months ago, and some by jeff@ after this thread started.  I don't
>   run any version of FreeBSD new enough to have these, and the version
>   that I run also doesn't seem to have any serious bugs in bufdaemon.
>   It just works mediocrely.

~All tests were done with 9-STABLE from early May. I also tested with
10-CURRENT kernel, downloaded from an ftp mirror on May 5th,
2013. This kernel showed the same behavior, so I did all further
testing with the 9-STABLE version.

I will update my sources to 9-STABLE first thing next weekend, and
will also re-test with an up-to-date CURRENT kernel.

Klaus
Comment 9 Klaus Weber 2013-06-17 01:17:16 UTC
On Mon, Jun 10, 2013 at 11:00:29AM +1000, Bruce Evans wrote:
> On Mon, 10 Jun 2013, Klaus Weber wrote:
> >On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
> >>On Fri, 31 May 2013, Klaus Weber wrote:
>
> This thread is getting very long, and I will only summarize a couple
> of things that I found last week here.  Maybe more later.
>
> o Everything seems to be working as well as intended (not very well)
>   except in bufdaemon and friends.  Perhaps it is already fixed there.
>   I forgot to check which version of FreeBSD you are using.  You may
>   be missing some important fixes.  There were some by kib@ a few
>   months ago, and some by jeff@ after this thread started. 

I have now tested with both an up-to-date 9-STABLE as well as a
10-CURRENT kernel from ftp.freebsd.org (8.6.2016):

FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #27 r251798M: Sun Jun
16 16:19:18 CEST 2013 root@filepile:/usr/obj/usr/src/sys/FILEPILE
amd64

FreeBSD filepile 10.0-CURRENT FreeBSD 10.0-CURRENT #0: Sat Jun 8
22:10:23 UTC 2013 root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC
amd64

The bad news is that I can still reproduce the hangs reliably, with
the same methods as before. I do have the feeling that the machine
survives the re-write load a bit longer, but its just a gut feeling -
I have no means to quantify this.

The "dirtybuf" sysctl values have changed their default values; I
believe you were involved in the discussion about how they are
computed that lead to changing them. They are now:
vfs.lodirtybuffers: 13251
vfs.hidirtybuffers: 26502
vfs.dirtybufthresh: 23851

The 10-CURRENT kernel, shows a message on the console when
the hang occurs (but see below for a twist):

g_vfs_done():da0p1[WRITE(offset=<x>, length=<y>)]error = 11

The message is printed to the console about once a second when the
hang occurs, until I hard-reset the server. <x> keeps changing,
<y> is mostly 65536, sometimes 131072 (like, one screenful of 64k's
then one 128k line, etc.)

Error=11 seems to be EDEADLK, so it looks like a -CURRENT kernel does
detect a deadlock sitation whereas the -STABLE kernel doesn't. (I have
also re-checked the -CURRENT from 5.5.2013 which I briefly tested in
my initial report also shows these messages. My apologies for not
noticing this earlier).

And now there is an additional point where things get really weird:
The "g_vfs_done()..." messages _only_ appear when the system hangs as
a result of bonnie++'s rewriting load. If I repeat the same test with
the test program you provided earlier, the system will still hang, but
without the console output.

(I have repeated the tests 10 times to make sure I wasn't just
imagining this; 5 times with bonnie++ (message appears every time), 
5 times with your test program (message did not appear a single
time)).

I'm not really sure what difference between these two programs causes
this. Bonnie++ writes the file immediately before starting to rewrite
it, while your program works on pre-fabricated files - maybe this it.


Anyway: Is there a way to find out which resource(s) are involved in
the EDEADLK-situation? Please keep in mind that I cannot enter the
debugger via the console, and panicing the machine leaves me with a
dead keyboard, which seems to make any useful debugging pretty hard.

(If you believe that a working debugger is required to make progress
on this, let me know and I will try to get a serial console working
somehow.)

>>[observations regarding seek behavior] 

I did not get to do much other testing this weekend (hanging the
system is pretty time-consuming, as after the hard-reset I have to
wait for the root partition to re-mirror). 

I have thought about how I could observe the seek behavior of my disks
when testing, but I have come to the conclusion that I cannot do this
really. My RAID controller presents the array as a single disk to the
OS, and its on-board RAM allows it to cache writes and perform
read-aheads without the kernel knowing, so neither reading from nor
writing to two blocks far apart will neccessarily result in a physical
head move.


I think I'll concentrate next on finding out where/why the hangs are
occuring. Getting rid of the hangs and hard resets will hopefully
speed up testing cycles.

Klaus
Comment 10 Klaus Weber 2013-07-21 18:42:15 UTC
Sorry for the late reply, did lots of testing. I believe to have
identified the root cause for the buffer depletion and subsequent i/o
hangs.

Have a look at the beginning of ffs_bdflush() in ffs_snapshot.c:

        if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
	                 return;

This short-circuits the entire flushing code if _this bufobj_ is using
less than vfs.dirtybufthresh buffers (by default, vfs.dirtybufthresh
is 90% of all available buffers). In other words, one bufobj may hog
up to 90% of all buffers.

This works well if only one file is being re-written. As documented
up-thread, with just one file vfs.numdirtybuffers rises until it 
reaches vfs.dirtybufthresh, and then remains relatively stable at that
point, and the system will not hang.

With two (or more) files being rewritten, the code above tries to let
_each_ file have 90% of all available buffers, which lets
vfs.numdirtybuffers hit vfs.hidirtybuffers.

Even worse, with two processes competing for buffer space, often both
remain under the magic 90% mark, effectively disabling all flushing of
buffers entirely. At that point, the only way that buffers can be freed
is flushing in different code paths, or if other processes not
involved in this buffer space competion are releasing some. Sooner or
later, the system reaches a state where none of the bufobjs has more
than 90% of the buffers, and none of the other (normal) processes can
release any buffers. The only way out then is a hard reset.


With the following patch (also attached), the system performs
reasonably well and does not hang:

Index: sys/kern/vfs_bio.c
===================================================================
--- sys/kern/vfs_bio.c  (revision 253339)
+++ sys/kern/vfs_bio.c  (working copy)
@@ -1146,7 +1146,7 @@
        if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) {
                (void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread);
                altbufferflushes++;
-       } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) {
+       } else if ( bdflush_required(bo) ) {
                BO_LOCK(bo);
                /*
                 * Try to find a buffer to flush.
@@ -1438,6 +1438,28 @@
 }
 
 /*
+ * Flushing is deemed necessary if at least one of the following is true:
+ * - bo is hogging more than dirtybufthresh buffers (previous behavior)
+ * - numdirtybuffers exceeds dirtybufthresh
+ * - we are or act as bufdaemon (TDP_NORUNNINGBUF)
+ *
+ * Used by bufbdflush (above) and ffs_bdflush (in ffs_snapshot).
+ */
+int
+bdflush_required(struct bufobj *bo)
+{
+  struct thread *td = curthread;
+
+  KASSERT(bo != NULL, ("bdflush_required: NULL bo"));
+
+  if ((bo->bo_dirty.bv_cnt > dirtybufthresh) ||
+      (numdirtybuffers > dirtybufthresh) ||
+      (td && (td->td_pflags & TDP_NORUNNINGBUF)) )
+    return(1);
+  return(0);
+  }
+
+/*
  *     brelse:
  *
  *     Release a busy buffer and, if requested, free its resources.  The
Index: sys/sys/bio.h
===================================================================
--- sys/sys/bio.h       (revision 253339)
+++ sys/sys/bio.h       (working copy)
@@ -151,6 +151,10 @@
 #define physread physio
 #define physwrite physio
 
+extern struct bufobj *bo;
+int bdflush_required(struct bufobj *bo);
+
+
 #endif /* _KERNEL */
 
 #endif /* !_SYS_BIO_H_ */
Index: sys/ufs/ffs/ffs_snapshot.c
===================================================================
--- sys/ufs/ffs/ffs_snapshot.c  (revision 253339)
+++ sys/ufs/ffs/ffs_snapshot.c  (working copy)
@@ -2161,7 +2161,7 @@
        struct buf *nbp;
        int bp_bdskip;
 
-       if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
+       if ( !bdflush_required(bo) )
                return;
 
        td = curthread;


I have modified ffs_bdflush() in ffs_snapshot.c and the generic
bufbdflush() in vfs_bio.c, which contains similar logic. (I believe a
kernel of 5.2 vintage will have this code still in bdwrite().)

While the patch makes rewriting work reliably, it may need some
further refinements:

- Right now, it limits numdirtybuffers to dirtybufthresh. It might be
  better to use hidirtybuffers as a limit instead?

- the set of conditions for which the flush is performed needs to be
  reviewed by someone who understands this better than I do. Limiting
  numdirtybuffers is neccessary to fix the hangs. Limiting
  bo->bo_dirty.bv_cnt was what the code did before, so I left this in
  as well. And finally, I'm testing for TDP_NORUNNINGBUF based on the
  idea that if bufdaemon, syncer or any process acting on their behalf
  sees the need to flush buffers, then it should not be up to *bdflush()
  to second-guess their intention.

- the patch may have style issues

Regarding the threshold and conditions: I am still in the process of
doing some basic benchmarks with and without the patch. However, it
seems that it causes a slight performance decrease in the cases that
work well without the patch (i.e. only one process re-writing). My
guess is that it now flushes more often than absolutely
neccessary. 

Performance is unchanged for cases that do not cause a huge buffer
buildup (reading and writing via separate file descriptors). Of
course, in cases that did result in hangs before and now don't, the
patch is a clear win.


I still believe there are some bugs in the clustering code that affect
this case, making the re-writing less efficient than it could be.

I wish I had the time to do more testing, but I need to bring the
server into production soon. I'm putting some notes here for
things I intended to test, but probably won't have the time to:

- in vfs_cluster.c, cluster_write_gb(), where it says "Change to
  algorithm...": Also push the previous cluster when it has reached
  maximum size, not only when seqcount > 0. No point waiting for
  buf_daemon when the cluster cannot grow. Verify that condition
  "vp->v_clen <=  cursize" really tests "reached max. size". XXX
  "maxclen <=  cursize"?  Add rate-limited printf()s!

- create a new "cluster_awrite", from DragonFly BSD:
  http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/9d4e78c77684f1de021eb8ae17767c210205c3c3
  (plug-compatible to bawrite(), but with clustering). Do not replace
  vfs_bio_awrite (yet). Then, selectively replace some bawrite() calls
  with calls to cluster_awrite() to identify where current bawrite()
  calls are inefficient. XXX requires mods to cluster_wbuild()!


If anyone wants me to test any additional things, or variants of the
patch above, I can still do this next weekend. 

Since the system no longer hangs with the patch, I think this bug can
now be closed.

Klaus
Comment 11 Klaus Weber 2013-07-21 18:45:25 UTC
On Sun, Jul 21, 2013 at 07:42:15PM +0200, Klaus Weber wrote:
> With the following patch (also attached)

Attachment was missing.

Klaus
Comment 12 Klaus Weber 2013-07-31 17:18:59 UTC
After studying style(9), I have now made a revised version of the
patch, relative to 9/STABLE of about 3 weeks ago:

Index: sys/kern/vfs_bio.c
===================================================================
--- sys/kern/vfs_bio.c  (revision 253339)
+++ sys/kern/vfs_bio.c  (working copy)
@@ -1146,7 +1146,7 @@
        if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) {
                (void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread);
                altbufferflushes++;
-       } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) {
+       } else if (bdflush_required(bo)) {
                BO_LOCK(bo);
                /*
                 * Try to find a buffer to flush.
@@ -1438,6 +1438,24 @@
 }

 /*
+ *     Flushing is deemed necessary if at least one of the following is true:
+ *     - bo is hogging more than dirtybufthresh buffers (previous behavior)
+ *     - numdirtybuffers exceeds dirtybufthresh
+ *     - we are or act as bufdaemon (TDP_NORUNNINGBUF)
+ */
+int
+bdflush_required(struct bufobj *bo)
+{
+       struct thread *td = curthread;
+
+       KASSERT(bo != NULL, ("bdflush_required: NULL bo"));
+
+       return ((bo->bo_dirty.bv_cnt > dirtybufthresh) ||
+               (numdirtybuffers > dirtybufthresh) ||
+               (td && (td->td_pflags & TDP_NORUNNINGBUF)));
+}
+
+/*
  *     brelse:
  *
  *     Release a busy buffer and, if requested, free its resources.  The
Index: sys/sys/buf.h
===================================================================
--- sys/sys/buf.h       (revision 253339)
+++ sys/sys/buf.h       (working copy)
@@ -486,6 +486,7 @@
 void   bdata2bio(struct buf *bp, struct bio *bip);
 void   bwillwrite(void);
 int    buf_dirty_count_severe(void);
+int    bdflush_required(struct bufobj *bo);
 void   bremfree(struct buf *);
 void   bremfreef(struct buf *);        /* XXX Force bremfree, only for nfs. */
 int    bread(struct vnode *, daddr_t, int, struct ucred *, struct buf **);
Index: sys/ufs/ffs/ffs_snapshot.c
===================================================================
--- sys/ufs/ffs/ffs_snapshot.c  (revision 253339)
+++ sys/ufs/ffs/ffs_snapshot.c  (working copy)
@@ -2161,7 +2161,7 @@
        struct buf *nbp;
        int bp_bdskip;

-       if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
+       if (!bdflush_required(bo))
                return;

        td = curthread;

[End of patch]

Besides being more concise, I hope this patch no longer has
style-issues, although I was not sure regarding the indention of the
second and third line of the return statement. I chose to left-align
the condtions.

I have also tested limiting numdirtybuffers to hidirtybuffers (instead
of dirtybufthresh), but that does not prevent the system from hanging.

I would appreciate if someone could review and eventually commit this
patch (or a variant thereof), and then close this bug.

Thanks, Klaus
Comment 13 dfilter service freebsd_committer 2013-09-22 20:23:56 UTC
Author: kib
Date: Sun Sep 22 19:23:48 2013
New Revision: 255797
URL: http://svnweb.freebsd.org/changeset/base/255797

Log:
  Increase the chance of the buffer write from the bufdaemon helper
  context to succeed.  If the locked vnode which owns the buffer to be
  written is shared locked, try the non-blocking upgrade of the lock to
  exclusive.
  
  PR:	kern/178997
  Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
  Sponsored by:	The FreeBSD Foundation
  MFC after:	1 week
  Approved by:	re (marius)

Modified:
  head/sys/kern/vfs_bio.c

Modified: head/sys/kern/vfs_bio.c
==============================================================================
--- head/sys/kern/vfs_bio.c	Sun Sep 22 19:15:24 2013	(r255796)
+++ head/sys/kern/vfs_bio.c	Sun Sep 22 19:23:48 2013	(r255797)
@@ -2624,6 +2624,8 @@ flushbufqueues(struct vnode *lvp, int ta
 	int hasdeps;
 	int flushed;
 	int queue;
+	int error;
+	bool unlock;
 
 	flushed = 0;
 	queue = QUEUE_DIRTY;
@@ -2699,7 +2701,16 @@ flushbufqueues(struct vnode *lvp, int ta
 			BUF_UNLOCK(bp);
 			continue;
 		}
-		if (vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT | LK_CANRECURSE) == 0) {
+		if (lvp == NULL) {
+			unlock = true;
+			error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT);
+		} else {
+			ASSERT_VOP_LOCKED(vp, "getbuf");
+			unlock = false;
+			error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 :
+			    vn_lock(vp, LK_UPGRADE | LK_NOWAIT);
+		}
+		if (error == 0) {
 			mtx_unlock(&bqdirty);
 			CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X",
 			    bp, bp->b_vp, bp->b_flags);
@@ -2711,7 +2722,8 @@ flushbufqueues(struct vnode *lvp, int ta
 				notbufdflushes++;
 			}
 			vn_finished_write(mp);
-			VOP_UNLOCK(vp, 0);
+			if (unlock)
+				VOP_UNLOCK(vp, 0);
 			flushwithdeps += hasdeps;
 			flushed++;
 
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 14 dfilter service freebsd_committer 2013-09-29 19:05:10 UTC
Author: kib
Date: Sun Sep 29 18:04:57 2013
New Revision: 255941
URL: http://svnweb.freebsd.org/changeset/base/255941

Log:
  Reimplement r255797 using LK_TRYUPGRADE.
  
  The  r255797 was:
  Increase the chance of the buffer write from the bufdaemon helper
  context to succeed.  If the locked vnode which owns the buffer to be
  written is shared locked, try the non-blocking upgrade of the lock to
  exclusive.
  
  PR:	kern/178997
  Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
  Sponsored by:	The FreeBSD Foundation
  MFC after:	1 week
  Approved by:	re (glebius)

Modified:
  head/sys/kern/vfs_bio.c

Modified: head/sys/kern/vfs_bio.c
==============================================================================
--- head/sys/kern/vfs_bio.c	Sun Sep 29 18:02:23 2013	(r255940)
+++ head/sys/kern/vfs_bio.c	Sun Sep 29 18:04:57 2013	(r255941)
@@ -2624,6 +2624,8 @@ flushbufqueues(struct vnode *lvp, int ta
 	int hasdeps;
 	int flushed;
 	int queue;
+	int error;
+	bool unlock;
 
 	flushed = 0;
 	queue = QUEUE_DIRTY;
@@ -2699,7 +2701,16 @@ flushbufqueues(struct vnode *lvp, int ta
 			BUF_UNLOCK(bp);
 			continue;
 		}
-		if (vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT | LK_CANRECURSE) == 0) {
+		if (lvp == NULL) {
+			unlock = true;
+			error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT);
+		} else {
+			ASSERT_VOP_LOCKED(vp, "getbuf");
+			unlock = false;
+			error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 :
+			    vn_lock(vp, LK_TRYUPGRADE);
+		}
+		if (error == 0) {
 			mtx_unlock(&bqdirty);
 			CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X",
 			    bp, bp->b_vp, bp->b_flags);
@@ -2711,7 +2722,8 @@ flushbufqueues(struct vnode *lvp, int ta
 				notbufdflushes++;
 			}
 			vn_finished_write(mp);
-			VOP_UNLOCK(vp, 0);
+			if (unlock)
+				VOP_UNLOCK(vp, 0);
 			flushwithdeps += hasdeps;
 			flushed++;
 
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 15 rainer 2014-08-20 14:08:05 UTC
This is still a problem on 10.0-RELEASE.

HP DL380G8.
2*E5-2630
2*6 Cores
64GB RAM
2*300 GB Disk in RAID1, 2 GB Cache.

cd /usr/ports/databases/rrdtool12
make install
cd /usr/ports/databases/rrdtool12/work/rrdtool-1.2.30/examples
./perftest.pl

Can someone comment on how (and when) this is going to be fixed?
Comment 16 pr 2015-12-04 13:47:02 UTC
(In reply to rainer from comment #15)

We seem to experience this problem independently too (tested on 10.2-RELEASE and 10.1-RELEASE-p21). In the testing environment the system behaves as described by rainer. In production, the symptoms seem to be the same (but the traffic is not generated artificially).

Is there any progress or plans about this issue?

Klaus, please, did you solve the problem? 

Thanks in advance, Pavel
Comment 17 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:52:02 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"