209680 – ipfw: when enabled, net connections time out/ssh results in "broken pipe"

Bug 209680 - ipfw: when enabled, net connections time out/ssh results in "broken pipe"

Summary: ipfw: when enabled, net connections time out/ssh results in "broken pipe"

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Many People
Assignee:	freebsd-ipfw (Nobody)

URL:
Keywords:	patch

Depends on:
Blocks:

Reported:	2016-05-21 17:11 UTC by O. Hartmann
Modified:	2021-01-16 21:19 UTC (History)
CC List:	9 users (show)

See Also:

Attachments
(Hopefully) make TCP/IP connections reliable under memory pressure again (2.27 KB, patch) 2016-05-23 14:41 UTC, Fabian Keil	no flags	Details \| Diff
ipfw: Prefill the dynamic rule zone and prevent uma from freeing unused items (1.03 KB, patch) 2016-05-23 14:48 UTC, Fabian Keil	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description O. Hartmann 2016-05-21 17:11:22 UTC

Since a couple of weeks (if not more than a months for now) I observe the fact that when IPFW is enabled (in kernel, no module load!), network performance is sometime worse, connections server/client drops erratically (PostgreSQL 9.5, Apache 2.4 webservices,  copies of large files (> 200GB, I think it is the time that takes the copy that is relevant, not the size, the connection is 1GBit) via rsync and especially ssh connections to remote systems (remote maintenance is a nightmare recently).

I'm  not deeply in debugging, I observe, and I can give you this information. The problem occurs on different systems, all in common running most recent CURRENT (at the moment r300375). The systems do have different x86_amd64 architecture - Core2Duo dual socket XEONs as well as Haswell single socket XEONs, with different NICs (i210, i219, Broadcom, some Realtek, some Intel em). Also in common on these systems is the usage of IPFW statically in-kernel. Some private systems also habe libalias/in-kernel-NAT and pppoe, but that doesn't matter as well as the fact the problems occur with the vanilla ipfw-scripts delivered with FreeBSD (usage via type WORKSTATION) or with custom ipfw ruleset scripts.

On a erratic basis, the connection drops or has a kind of hang that lasts for seconds. This prevents us from uploading large vector maps for GIS applications into PostgreSQL databases provided by a FBSD server. The connection has timeouts or drops. A nightmare is the usage of SSH for maintenance. Sometimes after several seonds after establishing the connection or after 30 minutes and more the connection dies with a broken pipe (ssh: Fssh_packet_write_wait: Connection to XXX.XXX.XXX.XXX port 22: Broken pipe).

All of those reported problems do vanish if I disable IPFW via "ipfw disable firewall".

My in-kernel config for IPFW is (this is the config of a home system, beware that NAT is not enabled on the servers):

#
#       IPFW Firewall
#
options         IPFIREWALL              # firewall
options         IPFIREWALL_VERBOSE      # enable logging to syslogd(8)
options         IPFIREWALL_VERBOSE_LIMIT=10    #limit verbosity
#options         IPFIREWALL_NAT          # ipfw kernel nat support
#options         LIBALIAS                # ipfw kernel nat support
options         IPDIVERT                # divert sockets
options         DUMMYNET        # traffic shaper, bandwidth manager and delay emulator
#options                HZ=2000         # strongly recommended
#
#options                IPFIREWALL_DEFAULT_TO_ACCEPT    # allow everything by default

Comment 1 graham 2016-05-22 07:57:02 UTC

I suspect I'm having the same problem. I backup my system vi "s3cmd sync" each week. The backup file is about 2.5Gb in size and the s3 usually dies after a few hundred Mb. I've broken the backup file into 500Mb chunks and it eventually got through after a few tries.

I have only seen this in the last few weeks. But I hadn't updated for a few weeks before then, so the problem could have started any time in the last 6 weeks or so.

I'm running 11-current amd64 and using ipfw with kernel NAT.

I'm happy to do any diagnosis or testing if required.

Comment 2 Fabian Keil 2016-05-23 14:41:33 UTC

Created attachment 170568 [details]
(Hopefully) make TCP/IP connections reliable under memory  pressure again

I don't use ipfw, but have occasionally seen similar issues recently
and am currently testing the attached patch in an attempt to prevent them.

While I haven't seen the problem since applying the patch,
I'm not absolutely sure yet that the patch is responsible for
this.

Given that you seem to be able to reliably reproduce the issue
I'd be interested to know if the patch makes a difference for your
workloads.

Comment 3 Fabian Keil 2016-05-23 14:48:57 UTC

Created attachment 170569 [details]
ipfw: Prefill the dynamic rule zone and prevent uma from freeing unused items

If the previous patch doesn't make a difference you could try
adding this one which may work around the problem. If it does,
this could help diagnosing the cause of the problem.

As I don't use ipfw myself I only compile-tested the patch.
It will increase the memory used by ipfw.

Comment 4 graham 2016-05-23 22:43:39 UTC

(In reply to graham from comment #1)

Sorry, I'm an idiot. This isn't happening on my 11-current box - it's on my 10-stable box. However, the point still stands - it was reliable up until a few weeks ago and now it's not. I'll attempt to diagnose some more.

Comment 5 flat 2016-06-01 18:48:00 UTC

I believe that I'm seeing the same issue when doing backup transfers via a SSH tunnel (Fssh_packet_write_wait: Connection to <IP> port 22: Broken pipe)

However I'm using PF and not ipfw. This is happening using an up-to-date FreeBSD 10.3 (10.3-RELEASE-p4) with the default kernel.
I can't say when the issue was introduced since it is a freshly installed machine, but I'm running the exact same SSH tunnel setup on a Linux machine without any issues.

A workaround for me seems to be to limit the transfer speed to something way below the link speed. At least it's better than connections breaking.

Comment 6 O. Hartmann 2016-06-04 10:12:50 UTC

Today, I made another observation in this matter. On a server that has in-kernel NAT and LIBALIAS and attached to the net via ADSL SoHo connection, serving as a server accessible from the outside world isn't possible anymore. It worked a couple of weeks ago with the ipfw-rules I use, inclusive the proper forwarding rules, but since ~ two weeks, when these "broken pipe issues" started getting worse and worse, connecting to the provided www server or ssh wasn't possible anymore. I started then checking for mistakes in the ipwf ruleset. Today, I had the chance to access the box from the outside world simultanously with access to the server and its IPFW itself and after a clean reboot of 

FreeBSD 11.0-ALPHA2 #10 r301307: Sat Jun  4 11:03:17 CEST 2016 amd64

trying to connect to the server's Apache  server or ssh failed. Then we restarted simply several times the local ipfw via "service ipfw restart" and voila - it worked!

Sorry for the poor material I can provide at the moment, but time constraints are tight and my abilities of debugging are limited and seting up alternative serving systems circumventing the issue reporting here eat a lot of time.

Comment 7 O. Hartmann 2016-06-04 19:45:57 UTC

Applying both patches seems to solve the problem of the "broken pipe" with ssh. So far, connections from one system under load to another server also under heavy load is now with three ssh sessions still active after two hours. This wasn't the case before, the connections died even under relaxed conditions rather quickly.

It does not solve the problem with NAT/port forwarding.

Comment 8 Michael Osipov 2016-10-15 10:14:43 UTC

I assume it is enough to apply the patches to /usr/src make run make buildkernel && installkernel?

Comment 9 Fabian Keil 2016-10-15 10:37:24 UTC

That's correct, rebuilding the userland isn't necessary.

Comment 10 Michael Osipov 2016-10-18 12:24:33 UTC

This patch does not work for me. Same issue happens even with the patch and if I switch from graid3 to ZFS raidz, everything is fine. It must be the geom class in my case.

Comment 11 Len White 2017-01-28 06:39:31 UTC

I've been having the same issue, it's very random.  I've spent A LOT of time debugging it, adding extra print statements in ipfw... unfortunately I can't trigger the issue at will.  It does seem to happen more often if I start up World of Warcraft from a system behind the ipfw machine.  But it seems like whatever the issue is, it's causing the connections to "expire" prematurely.  When it happens new connections will die in 5-15 seconds over and over.  I can reboot the system and it will come back up, still doing the same thing, then 5-10 mins later it will be fine.  Never any errors in logs or dmesg when it happens.

Running 11.0-RELEASE-p5

Comment 12 Fabian Keil 2017-01-28 10:17:19 UTC

I'm still not using ipfw, but the patch from comment
two seems to have fixed the issue for me.

The patch from comment three should be safe to test.

Running "vmstat -z" while the system is showing symptoms
could help to decide whether or not the patches might be
worth trying.

Comment 13 Charles Mercadal 2017-02-21 20:16:03 UTC

Fabian:  Thank you for the patch, attachment 170568 [details] appears to have fixed my issues.

I was having similar issues that others were describing here, in my case 11.0-RELEASE on arm:  I would enable ipfw and begin adding some firewall rules, and I'd start to lose SIP registration on my IP phones, had unexpected delays and stalls in interactive ssh sessions, etc.  Initially I thought it was an ipfw bug, but then tried pf, and found similar behavior.

Comment 14 Len White 2017-02-21 20:23:35 UTC

Yes thank you very much Fabian, it fixed my issues too.

Is there any possible way this can get pushed upstream?  I personally feel it's a rather serious bug and there is no doubt many other people running into the same issues.

Comment 15 Charles Mercadal 2018-03-12 19:00:47 UTC

Has anyone else upgraded to 11.1-RELEASE?

I rebuilt kernel & world about a week ago, and forgot to re-apply the changes in attachment 170568 [details].

I'm still running ipfw.  Even without the changes in the patches, so far there have been no stalling issues/no connection drops.

Comment 16 O. Hartmann 2018-03-12 19:06:19 UTC

ipfw has undergone changes in the meanwhile and while running 11.1-RELENG-p7 and CURRENT, I haven't seen the reported issue for a while now.