Summary: | mbuf cluster leak with on pf+bird2 bgp routers | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Thomas Steen Rasmussen / Tykling <thomas> | ||||||
Component: | kern | Assignee: | freebsd-pf (Nobody) <pf> | ||||||
Status: | Closed Unable to Reproduce | ||||||||
Severity: | Affects Some People | CC: | borjam, glebius, markj, melifaro, mike, net, olivier, stefan+freebsd, zarychtam, zlei | ||||||
Priority: | --- | ||||||||
Version: | 13.2-STABLE | ||||||||
Hardware: | Any | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Description
Thomas Steen Rasmussen / Tykling
2024-02-07 15:25:11 UTC
ps. The very rudimentary netstat -m exporter is here, needs jq and sponge installed: [tykling@dgncr2a ~]$ cat /etc/cron.d/netstat_mbuf_exporter # Run netstat_mbuf_exporter.sh every minute and put the output in prometheus textfile collector directory * * * * * root /usr/local/bin/netstat_mbuf_exporter.sh | /usr/local/bin/sponge /var/tmp/node_exporter/netstat-mbuf.prom [tykling@dgncr2a ~]$ cat /usr/local/bin/netstat_mbuf_exporter.sh #!/bin/sh /usr/bin/netstat -m --libxo json | /usr/local/bin/jq -r '."mbuf-statistics" | keys_unsorted[] as $k | "\($k) \(.[$k])"' | /usr/bin/tr "-" "_" | /usr/bin/sed "s/^/freebsd_netstat_mbuf_/g" [tykling@dgncr2a ~]$ head -5 /var/tmp/node_exporter/netstat-mbuf.prom freebsd_netstat_mbuf_mbuf_current 1495568 freebsd_netstat_mbuf_mbuf_cache 3547 freebsd_netstat_mbuf_mbuf_total 1499115 freebsd_netstat_mbuf_cluster_current 749044 freebsd_netstat_mbuf_cluster_cache 3558 [tykling@dgncr2a ~]$ Gleb, based on the report this sounds more like a leak in the routing socket code, no? There's no mention of pf except in the bug title.
> I at one point tried adding the missing kernel export filter (as to at least silence the noisy warnings in the logs), and imagine my surprise when the mbuf cluster leak stopped.
I'm not too familiar with how this works - does this basically install a bunch of routes in the kernel, so most likely you're hitting an mbuf leak in the routing socket code? This may be fixed in 14.0 by virtue of having reimplemented parts of that interface using netlink.
On Wed Feb 7 15:25:11 2024 UTC, thomas@gibfest.dk wrote: > Over the holidays I upgraded from bird 2.0.9 to bird 2.14, as well as upgrading > FreeBSD from 13-STABLE-384a885111ad to 13-STABLE-2cbd132986a7. I suspect one of > these two changes made this problem appear. I made no changes to bird or router > config other than the upgrades. What I would suspect here is NETLINK. Lots of stuff merged between 384a885111ad and 2cbd132986a7. Thomas, is it possible for you to work more on isolating the regression? Things to check: 1) Did bird upgrade 2.0.9 to 2.14 switch bird to use NETLINK instead of route socket? If 1) is false, there are two options: 2.0.9 and 2.14 both used NETLINK or both used route socket. If the latter, than my guess is totally wrong and Mark's guess is much better. If the former, than we need to bisect between 384a885111ad and 2cbd132986a7. 2) If 1) is true, then please compile 2.14 with NETLINK disabled and check if leak has gone. If 1) and 2) are true it could be the problem was in 384a885111ad as well, but you did not use NETLINK. 3) Check if running with NETLINK on 384a885111ad reproduces the leak or not? (Be careful, as lots of bugs were removed after 384a885111ad) Depending on 3) we may need to run bisection. Anyway, please keep us updated when you got more info, starting with 1). Now, when we have only FreeBSD 13, 14 and CURRENT branches supported and all of them have reworked routing stack with NETLINK support included, bird2-netlink is better suited to run on FreeBSD and probably should become the default flavor of net/bird2 port. The transition is important to avoid such situations in the future. Netlink flavor supports ECMP, the memory footprint is much lower compared to rtsock version, and it will run with the same config file, though small config changes are recommended. The user experience with bird2-netlink is better since it can run undisturbed for months on FreeBSD 13.2+ without any observable drawbacks. I may have something related (or not). I have observed a severe mbuf leak with the following sysctl variables growing steadily. vm.uma.mbuf_packet.stats.current vm.uma.mbuf_cluster.stats.current vm.uma.mbuf.stats.current THe server on which I have been observing this runs quite a lot of stuff, so, suspecting some particular process(es) causing this I checked for the culprit. Turns out the culprit (trigger I should say) is a version of nfsen that uses threads. My main suspects are the nfcapd or sfcapd daemons which create and destroy threads. The funny thing is, I run two concurrent versions of these daemons. The old ones do *not* leak mbufs, while the recent ones do. The daemons are supposed to be pretty straightforward, receiving UDP packets in Netflow/sFlow/IPFix format. I may have something related (or not). I have observed a severe mbuf leak with the following sysctl variables growing steadily. vm.uma.mbuf_packet.stats.current vm.uma.mbuf_cluster.stats.current vm.uma.mbuf.stats.current THe server on which I have been observing this runs quite a lot of stuff, so, suspecting some particular process(es) causing this I checked for the culprit. Turns out the culprit (trigger I should say) is a version of nfsen that uses threads. My main suspects are the nfcapd or sfcapd daemons which create and destroy threads. The funny thing is, I run two concurrent versions of these daemons. The old ones do *not* leak mbufs, while the recent ones do. The daemons are supposed to be pretty straightforward, receiving UDP packets in Netflow/sFlow/IPFix format. Created attachment 252362 [details]
vm.uma.mbuf_cluster.stats.current growth
The number grows *only* when the version 1.7.4 nfcapd/sfcapd daemons are running.
It goes back to near zero only when I reboot the server.
(In reply to Borja Marcos from comment #7) > The number grows *only* when the version 1.7.4 nfcapd/sfcapd daemons are running. Perhaps you would rather submit a new PR reporting the leak for net-mgmt/nfdump? The original PR here refers to net/bird2 and seems to be only loosely related to the problems you have encountered. (In reply to Marek Zarychta from comment #8) Sorry, given that I only suffer this leak with a recent version of nfdump which uses threads (the other old version which doesn't trigger this bug is single theaded) I thought this might add a useful data point and ring a bell. Is thread usage one of the differences between your Bird versions? Or some thread usage pattern? In my case I have identified the leak on FreeBSD 14 but in hindsight I had similar problems in the 13- branch. (In reply to Borja Marcos from comment #9) > Is thread usage one of the differences between your Bird versions? Or some thread usage pattern? The BIRD daemon from the port net/bird2 is still single-threaded. BIRD 3, which is currently in the early ALHPA stage will be multi-threaded, but we don't have it in the ports tree. Moreover, I am not able to reproduce the original problem reported here. Running 4 full view sessions 2x(IP+IPv6) on BIRD 2.15.1 after 1,5 months of system and session uptime I see no memory leaks 2492 root 1 20 0 365M 339M select 1 51.4H 0.00% bird 31306/2717384/2748690 mbufs in use (current/cache/total`2/3554/3556/1014047 mbuf clusters in use (current/cache/total/max) 2/3554 mbuf+clusters out of packet secondary zone in use (current/cache) 26740/3740/30480/507023 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/150229 9k jumbo clusters in use (current/cache/total/max) 0/0/0/84503 16k jumbo clusters in use (current/cache/total/max) 114801K/701414K/816215K bytes allocated to network (current/cache/total) Hello, It has been quite a few months and the issue has not reappeared and at this point it is unlikely to do so by itself. bird2 is now at 2.16 and the routers still run 13-STABLE somewhere after 13.4 at the moment. Originally the issue appeared to be tied to bird2 exporting a route to the kernel which already existed in the kernel. This was and is the rtsock flavour. So I tried just now removing the route filters I added back in January to make the leak go away, and nothing really happened. I don't know if it is solved or I just haven't been able to tickle out the right set of circumstances. Will report back if I ever encounter it again! Closing for now. |