Bug 247163 - ipsec / cesa memory issue
Summary: ipsec / cesa memory issue
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: 11.2-RELEASE
Hardware: arm Any
: --- Affects Some People
Assignee: freebsd-arm (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-11 11:02 UTC by Graham Collinson
Modified: 2020-07-10 19:22 UTC (History)
2 users (show)

See Also:


Attachments
cesa session limit patch (untested) (1.58 KB, patch)
2020-07-10 19:22 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Graham Collinson 2020-06-11 11:02:51 UTC
We have approximately 30 ipsec tunnels on a netgate SG-3100.  We've been getting errors that stop tunnels from coming up and require a reboot to recover.

Found that we're getting this kind of error message at the time of the problems:

charon: 12[KNL] <con13000|200> unable to add SAD entry with SPI cb36cc85: Cannot allocate memory (12)

turned on ipsec debugging and find this message when the memory problem happens:

kernel: key_setsaval: unable to initialize SA type 3.

This error appears to be output in sys/netipsec/key.c during xform_init.  (type 3 = SADB_SATYPE_ESP)

think the error is passed back somewhere in this code?

case SADB_SATYPE_ESP:
		if ((sav->flags & (SADB_X_EXT_OLD | SADB_X_EXT_DERIV)) ==
		    (SADB_X_EXT_OLD | SADB_X_EXT_DERIV)) {
			ipseclog((LOG_DEBUG, "%s: invalid flag (derived) "
			    "given to old-esp.\n", __func__));
			error = EINVAL;
			goto fail;
		}
		error = xform_init(sav, XF_ESP);
		break;

I think the xform_init eventually ends up in sys/opencrypto/crypto.c
crypto_newsession
and then the error is likely returned at
err = CRYPTODEV_NEWSESSION(cap->cc_dev, res, csp);

(I may not have been looking at the right versions of the source code when I researched this, sorry)

We had the same config working on an older model of netgate.  The difference seems to be that the 3100 has a "Marvell Cryptographic Engine and Security Accelerator".
To eliminate this we've set net.inet.ipsec.crypto_support to software only (33554432) and the problem has gone away.

Using cesa we were having to reboot every 2 to 3 days.
We've now been running for 10 days and there's no sign of the issue.

Unfortunately I don't have any spare hardware available to try out any debug builds to further track down the issue but it appears to be related to cesa in some way.  I also don't have access to anything I can try a newer build on.
I've searched existing bugs and can't find anything similar but I could easily have missed something.

uname -a gives:
FreeBSD pfSense 11.2-RELEASE-p10 FreeBSD 11.2-RELEASE-p10 #12 10fea60fdde(factory-RELENG_2_4_4): Thu May 16 06:02:25 EDT 2019     root@buildbot1-nyi.netgate.com:/build/factory-crossbuild-244/obj/armv6/tDULKC6G/arm.armv6/build/factory-crossbuild-244/pfSense/tmp/FreeBSD-src/sys/pfSense-SG-3100  arm
Comment 1 Mark Johnston freebsd_committer 2020-06-15 20:26:24 UTC
I'm having trouble seeing how cesa_newsession() could raise an error here.  The only place where it returns a non-zero error number is if cesa_prep_aes_key() returns an error, which only happens if the IPSec code passes an invalid key length.
Comment 2 Graham Collinson 2020-06-19 13:29:08 UTC
(In reply to Mark Johnston from comment #1)
I've probably got something wrong in the way I'm interpreting the source code.

The error value appears to be checked straight after doing the xform_init and then outputs the message I've seen.

	if (error) {
		ipseclog((LOG_DEBUG, "%s: unable to initialize SA type %u.\n",
		    __func__, mhp->msg->sadb_msg_satype));
		goto fail;
	}

We're still running fine since taking cesa out.  Perhaps it wasn't cesa directly causing our issues but something to do with using a hardware driver.
Comment 3 Graham Collinson 2020-06-19 14:05:04 UTC
Looking at the correct version of the cesa code now I think
https://github.com/freebsd/freebsd/blob/release/11.2.0/sys/dev/cesa/cesa.c

I see there was a point where it would return ENOMEM

	cs = cesa_alloc_session(sc);
	if (!cs)
		return (ENOMEM);

This code has changed since the release we're on.  Appears to be this commit : https://github.com/freebsd/freebsd/commit/99ba792d73cb1765bd7271160d3d81500308a2c6

so this is probably not a problem in later versions.
Comment 4 Graham Collinson 2020-06-19 15:17:41 UTC
Checking through git it looks as though that commit made its way into the 12 release.  11 releases still use the old memory management and likely still have the issue we're seeing.
Comment 5 Graham Collinson 2020-06-21 12:29:22 UTC
It looks like the enomem will be returned when cesa has run out of available sessions which is set at 64.  (CESA_SESSIONS in https://github.com/freebsd/freebsd/blob/release/11.2.0/sys/dev/cesa/cesa.h)
Perhaps there's a session leak somehow or we just get to a point where our system is demanding more than 64 sessions at a time?
There doesn't seem to be a way I can track the allocation of sessions.

Perhaps there could be a way for crypto_select_driver in https://github.com/freebsd/freebsd/blob/release/11.2.0/sys/opencrypto/crypto.c to know that a device has hit maximum number of sessions and not select it?

Or a way to fallback to software if a CRYPTODEV_NEWSESSION call fails on a hardware device?

The same restriction of 64 sessions in cesa appears to still be in place in 11.4
Comment 6 Mark Johnston freebsd_committer 2020-07-10 19:22:36 UTC
Created attachment 216367 [details]
cesa session limit patch (untested)

(In reply to Graham Collinson from comment #5)
Nice catch, I forgot that this had been refactored since FreeBSD 11.

It looks like the driver imposes a session limit only because it pre-allocates session structures.  That limit is gone now in head, where session management has factored out of the original drivers.  I see no such limit in 11 in the software crypto driver.

I would guess that your workload simply requires more than 64 crypto sessions.  I would be interesting to see the output of "vmstat -m | grep crypto" from a system that has been up for a while.

I wrote an untested patch that bumps the session limit in cesa and makes it configurable at boot time.  I'll let the pfsense folks know about it; I'm not sure if the issue you're seeing has been observed elsewhere.