Bug 250725 - databases/mariadb103-server: wsrep_sst_rsync checking for lsof
Summary: databases/mariadb103-server: wsrep_sst_rsync checking for lsof
Status: New
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Some People
Assignee: Bernard Spil
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-29 16:15 UTC by Andrew Nicks
Modified: 2021-07-29 13:59 UTC (History)
1 user (show)

See Also:
bugzilla: maintainer-feedback? (brnrd)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Nicks 2020-10-29 16:15:55 UTC
wsrep_sst_rsync has FreeBSD specific code in check_pid_and_port function to use sockstat instead of lsof.

However, there is still a wsrep_check_programs call looking for lsof in the main script, so if lsof is not in path the script still fails.

I removed the check, which allowed my cluster to sync successfully.  Clearly not required on FreeBSD. 

I presume this will be same in other versions besides 10.3

Thanks
Andrew
Comment 1 Andrew Nicks 2021-07-29 13:59:26 UTC
Upstream update to MariaDB 10.3.30 seems to have reworked the WSREP_SST scripts so the specific point described is no longer an issue.

New version though shows same issue as highlighted in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256112 for 10.6, where state transfer is never requested after rsync is started, and the joiner node becomes stuck waiting for the unrequested transfer.  As such, while this issue may be resolved, it is not testable.

My logs match those reported in https://jira.mariadb.org/browse/MDEV-26101 so it seems likely the same bug affects all active versions that include the scripts.

I noticed the issue when trying to add a new (fresh FreeBSD 12.2 install) node to the existing cluster (full SST required).

Existing cluster has 2 FreeBSD 11 nodes still running MariaDB 10.3.29, and another already upgraded to FreeBSD 12.2 and MariaDB 10.3.30.  It is likely that during the upgrade of that node there was not state change on the database so IST wasn't needed.  Goal was to then add new nodes and then retire the old servers.

While troubleshooting, I also dropped the existing node running 10.3.30 out of the cluster and ensured the DB state changed.  Restarting the node left it in the same stuck state while attempting IST.  

Same test on a node still running 10.3.29 worked normally.

I replaced the WSREP_SST scripts on the 10.3.30 nodes with those from 10.3.29 and both new node and IST joined without further issue.