Created attachment 225871 [details] Traffic dump collected on lo0 intefrace Some RACK fixes and improvements MFCed lately to stable/13 have made RACK unusable in the jail environment. The last working build is from June 4th and it was probably 92f49c769b4e. Problem description: The database client (mysql/mariadb) running in the jail is no more able to access the database over the network socket on the host (mysql/mariadb) neither on the 127.0.0.1 address nor on the shred interface address (10.x.x.x). It is classic jail, not VNET jail. Switching to net.inet.tcp.functions_default=freebsd solves the issue. The problem occurs irregardless of the used net.inet.tcp.cc.algorithm (tested with htcp and newreno). Conclusion: Testing RACK on FreeBSD is becoming really hard task. Switching to default net.inet.tcp.functions_default=freebsd solves a lot of issues and reduces the PR submitting ratio.
Created attachment 225872 [details] sysct.conf
I have almost no experience with jails and mysql/mariadb. Can you reproduce the problem with using nc? Could you share the commands how to reproduce that? I guess you want to run a TCP based server on the host, so you could use nc -4 -k -l 3000 for that. A client on the host could be used as nc -4 -N 127.0.0.1 3000 Can you reproduce the issue using nc? If yes, what are the necessary steps to reproduce the issue?
Created attachment 225874 [details] Traffic dump collected on lo0 intefrace (working fine) No, I am not able to reproduce it neither with nc(1) nor with ssh(1). I can only submit a traffic dump of the working connection. I have tried to disable ECN, but it doesn't help.
This is strange. When using nc, did you enter CTRL-D on the client side right after starting the client? This would be like in the failing trace. The client initiates a connection is terminates it without transferring any data. I'm not aware of any jail specific stuff, which has to be dealt with in the TCP code. I don't think we ever tested RACK in jails, I never test TCP stuff in jails in general. So I'm surprised that there are jail specific issues, but maybe I'm missing some fundamental stuff here.
(In reply to Michael Tuexen from comment #2) >Could you share the commands how to reproduce that? Try to connect from the jail to the socket on the host: mysql -p -u root -h 127.0.0.1 The host has MariaDB server (mariadb103-server-10.3.29) running with skip_networking=off set. (In reply to Michael Tuexen from comment #4) >When using nc, did you enter CTRL-D on the client side right after starting the client? When nc(1) or ssh(1) is used from the jail then the connection to the host can be established and is working without issues. Both traffics dumps were collected on the same lo0 of the jail host with the filter 'tcp and dst or src port 3306'. Connections with ssh(1) and nc(1) were tested and working prior to submitting this PR.
Another question: Does the same problem occur when running the database client and server on the host using RACK?
When your database client hangs, what is the output of sockstat -sSPtcp when running it in the jail and when running it on the host?
(In reply to Michael Tuexen from comment #6) Good question, indeed, the client on the host cannot connect to the database on the same host (via network socket on 127.0.0.1), so it can be easily reproduced.
(In reply to Marek Zarychta from comment #8) OK, good. jails are not relevant here (this is what I expected). It would be great if you could provide the output of sockstat -sSPtcp when running both on the same host.
(In reply to Michael Tuexen from comment #9) >sockstat -sSPtcp I have tested it on another host with the same issue. Before switching to net.inet.tcp.functions_default=freebsd and with net/freeradius3 started when it gets stuck on connecting to the database: freeradius radiusd 3310 6 tcp4 127.0.4.1:17353 127.0.4.1:3306 ESTABLISHED rack mysql mysqld 1277 18 tcp46 *:3306 *:* LISTEN rack After switching to net.inet.tcp.functions_default=freebsd and with net/freeradius3 started when it's connected with the database: freeradius radiusd 3413 6 tcp4 127.0.4.1:31278 127.0.4.1:3306 ESTABLISHED freebsd freeradius radiusd 3413 7 tcp4 127.0.4.1:58141 127.0.4.1:3306 ESTABLISHED freebsd freeradius radiusd 3413 8 tcp4 127.0.4.1:15369 127.0.4.1:3306 ESTABLISHED freebsd freeradius radiusd 3413 9 tcp4 127.0.4.1:62900 127.0.4.1:3306 ESTABLISHED freebsd freeradius radiusd 3413 10 tcp4 127.0.4.1:46794 127.0.4.1:3306 ESTABLISHED freebsd freeradius radiusd 3413 11 tcp4 127.0.4.1:59657 127.0.4.1:3306 ESTABLISHED freebsd freeradius radiusd 3413 12 tcp4 127.0.4.1:31240 127.0.4.1:3306 ESTABLISHED freebsd freeradius radiusd 3413 13 tcp4 127.0.4.1:61097 127.0.4.1:3306 ESTABLISHED freebsd freeradius radiusd 3413 14 tcp4 127.0.4.1:16596 127.0.4.1:3306 ESTABLISHED freebsd freeradius radiusd 3413 15 tcp4 127.0.4.1:32351 127.0.4.1:3306 ESTABLISHED freebsd mysql mysqld 1277 18 tcp46 *:3306 *:* LISTEN rack mysql mysqld 1277 71 tcp4 10.0.0.250:36519 10.0.0.255:3306 SYN_SENT freebsd mysql mysqld 1277 77 tcp4 127.0.4.1:3306 127.0.4.1:31278 ESTABLISHED rack mysql mysqld 1277 78 tcp4 127.0.4.1:3306 127.0.4.1:58141 ESTABLISHED rack mysql mysqld 1277 79 tcp4 127.0.4.1:3306 127.0.4.1:15369 ESTABLISHED rack mysql mysqld 1277 80 tcp4 127.0.4.1:3306 127.0.4.1:62900 ESTABLISHED rack mysql mysqld 1277 81 tcp4 127.0.4.1:3306 127.0.4.1:46794 ESTABLISHED rack mysql mysqld 1277 82 tcp4 127.0.4.1:3306 127.0.4.1:59657 ESTABLISHED rack mysql mysqld 1277 83 tcp4 127.0.4.1:3306 127.0.4.1:31240 ESTABLISHED rack mysql mysqld 1277 84 tcp4 127.0.4.1:3306 127.0.4.1:61097 ESTABLISHED rack mysql mysqld 1277 85 tcp4 127.0.4.1:3306 127.0.4.1:16596 ESTABLISHED rack mysql mysqld 1277 86 tcp4 127.0.4.1:3306 127.0.4.1:32351 ESTABLISHED rack
I'm confused. Which config is working and which is not? Initially you have net.inet.tcp.functions_default=rack and start a server. When you start on the same host the client (which would mean both use the RACK stack), does that work? Then you switch net.inet.tcp.functions_default=freebsd and start the client (that would mean that the client uses the FreeBSD stack and the server uses the RACK stack). Does that work?
(In reply to Michael Tuexen from comment #11) >I'm confused. Which config is working and which is not? So am I. I have done another experiment, switched to freebsd stack, restarted database server then switched back to rack and restarted the client and everything works this way. So the only affected (broken) setup is with RACK enabled on both sides (database client and database server) of the same host.
(In reply to Marek Zarychta from comment #12) OK. So you set the default stack to RACK, start the database server, start the database client and it stalls. Right. What is the output of truss mysql -p -u root -h 127.0.0.1 or (whatever command you use to start the database client)? I guess it must be something special what the client or the server does. Right now focussing on the client.
Created attachment 225883 [details] truss out (In reply to Michael Tuexen from comment #13) >What is the output of >truss mysql -p -u root -h 127.0.0.1 Output attached
(In reply to Marek Zarychta from comment #14) OK. At the end there is the interaction with the TCP stack. Can you keep the server running using RACK, switch the default stack to FreeBSD, and rerun the command? Then the client would use the default stack and it should work. I would like to see what happens at the client side in this case.
Created attachment 225887 [details] truss out (working) (In reply to Michael Tuexen from comment #15) >Can you keep the server running using RACK, switch the default stack to FreeBSD, and rerun the command? Done, output from truss attached.
(In reply to Marek Zarychta from comment #16) Thanks. Interesting. Could you repeat both tests and capture the packets on the loopback interface? Please provide the .pcap files, not only the text output.
(In reply to Michael Tuexen from comment #17) And it would be interesting to see the output of netstat -sptcp before and after the failing test.
I did: sudo pkg install mariadb105-server sudo service mysql-server onestart mysql -u root -h 127.0.0.1 and I get an error message about access denied. I did that with RACK being the default stack. All on stable/13. What am I missing to reproduce the issue?
Created attachment 225890 [details] pcap files (In reply to Michael Tuexen from comment #17) >Please provide the .pcap files, not only the text output. Pcap files attached. The server socket was opened with RACK. (In reply to Michael Tuexen from comment #19) Affected software includes: mariadb103-client-10.3.29 mariadb103-server-10.3.29 both built from ports.
Created attachment 225891 [details] netstat -sptcp diffed output before and after fail (In reply to Michael Tuexen from comment #18) >And it would be interesting to see the output of >netstat -sptcp >before and after the failing test. Diff file attached.
(In reply to Marek Zarychta from comment #21) Will provide some notes tomorrow how to setup TCP Black Box logging. Hopefully we can figure out what the server does not like...
(In reply to Michael Tuexen from comment #22) Thanks for the clue. We should figure it out soon. There is still a hope I will be able to renew my Netflix subscription and watch it without issues during Christmas if we are successful enough with spotting this issue ;) Getting back to the point, I have enabled TCP Black Box logging following this description: https://github.com/Netflix/tcplog_dumper. I am not using BBR but RACK, perhaps it works with RACK as well since I have caught some data which I will share but I am not willing to disclose it publicly. I will send a link to the pcap file in a private email message.
(In reply to Marek Zarychta from comment #23) Yes, please share them privately. RACK works perfectly fine with black box logging.
Fixed in https://cgit.FreeBSD.org/src/commit/?id=66aec14a5391bda1e9a20f5e4381626797c3e0fb
Thank you for fixing it. I can confirm that the committed patch solves the issue. The bug should be probably closed after MFC, so I will leave it to the committers. I really appreciate this great gift from Netflix, thanks for upstreaming TCP RACK and the ability to test it for some time.
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=648c68168c24592967df9d9c6a4d79e1792e4562 commit 648c68168c24592967df9d9c6a4d79e1792e4562 Author: Randall Stewart <rrs@FreeBSD.org> AuthorDate: 2021-06-24 18:42:21 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2021-07-13 18:51:55 +0000 tcp: Rack not being very friendly with V6:4 socket and having a connection from V4 There were two bugs that prevented V4 sockets from connecting to a rack server running a V4/V6 socket. As well as a bug that stops the mapped v4 in V6 address from working. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30885 PR: 256657 (cherry picked from commit 66aec14a5391bda1e9a20f5e4381626797c3e0fb) sys/netinet/tcp_stacks/rack.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
Thank you for the MFC.