Bug 257788

Summary: databases/postgres{12|13|14}-{server|client}: severe Kernel TLS issues
Product: Ports & Packages Reporter: O. Hartmann <ohartmann>
Component: Individual Port(s)Assignee: pgsql
Status: New ---    
Severity: Affects Many People CC: lwhsu, mizhka, peter, ygy
Priority: ---    
Version: Latest   
Hardware: Any   
OS: Any   

Description O. Hartmann 2021-08-12 14:46:52 UTC
Since May 2021 we face severe DB issues with a couple of systems running 14-CURRENT, at this time FreeBSD 14.0-CURRENT #11 main-n248668-aecd31a8a3b: Thu Aug 12 15:15:58 CEST 2021 amd64, dual stack (IPv4/IPv6) configurations. The ports database/postgresql13-{server|client|contrib} have been recompiled via "portmaster -df postgresql" for several times now on two specific hosts without success so far. Before I describe the phenomenon, I state that we use customized kernel configurations, kernel TLS is enabled in the kernel by default and we also played with the kernel OID


but I'll report later. For the tests described below, kern.ipc.tls.enable=0 is set to ZERO ("0"). Otherwise an error occurs, see below.

For the record: both systems in question I report are running on an older Intel IvyBridge hardware (Intel(R) Core(TM) i5-3470 CPU and Intel(R) Xeon(R) CPU E3-1245 V2).

The XEON host also acts as a poudriere package builder, see below, it seems important to me to mention this here.

The phenomenon is as follows. On the hosts running PostgreSQL 12, 13 or 14 as server, login via "psql -U postgres -d postgres" is always possible via local socket, but "psql -U postgres -d postgres -h localhost" (or replace localhost by or ::1 to exclude any misunderstandings) fails, after a while the client hit a timeout:

#: psql -U postgres -d postgres -h
psql: error: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Checking via sockstat -4|-6 indicates, that postgresql is listening on its default port 5432 on those machines in question and IPFW is setup properly or disabled (simply set to "OPEN") for test pusrposes. 
Configuring the PostgreSQL server's logging to debug does not give anything useful, the only thing one can see in the log is, if logging is set to "info":

root@:~ # 2021-08-12 13:51:05.137 GMT [2132] LOG:  connection received: host=host1.local.net port=41162

Then - silence! As the server went deaf.

To make sure that not a corrupted DB causes the problems or a hidden misconfiguration in either pg_hba.conf and/or postgresql.conf, we installed on both systems version 12, 13 and even 14 of the software (compiled via classical make). It is with all versions the same problem on that hosts.

To exclude any issues regarding self-compiling postgresql, we also fetched the pkg tarball from an official FreeBSD mirror of posygresql13-server and installed that one. The problem remains and leaves us with either a broken world or kernel so far. Recompiling world and kernel with vanilla settings did not change anything so far. Using GENERIC as a kernel does also not mitigate or resolve the problem.

As initially mentioned, the XEON box also acts as a poudriere package host building with the very same make.conf as the host (and so the non working db host itself) packages also for 13-STABLE.

From a client running a recent 13-STABLE and equipted with the packages built from the host in question above, IT IS POSSIBLE to connect to the PostgreSQL 13 server, as long as


is set to =0. If one sets kern.ipc.tls.enable=1 to "1", the client (running psql 13.3) receives:

psql: error: SSL SYSCALL error: EOF detected

So, the Postgresql 13.3 server itself on the failing host is serving as expected, so it seems to be the client having severe problems.

The problems occured on all infected systems almost the same time arounf May, 26th this year, when we did our weekly updates of the 14-CURRENT base system and portmaster jobs for ports, that might be a hint since I do not remember when LLVM 12 has been introduced or KTLS has been activated.

Also, to exclude any issue with iflib and the i350 NICs on the servers, we disabled any hardware checksum offloading for vlan and RX/TX, so that at the end a "naked" interface without any hardware support is used. But that didn't resolve anything, too.

Another test went really sideways. We moved the complete configuration (base system, kernel, sysctl.conf, postgresql13 configs and databases) to another, more modern platform (it is a XEON based system, it's remotely not accessible, so I can't report about its hardware specs). On this box, based on 14-CURRENT and postgresql13 in a jail, the server acts as expected and local connectiosn as well as remote connections are possible. This is really weird and leaves me with a preliminary conclusion, that something is really wrong.

I'm out of ideas here and floating like a dead man in the water ...