Bug 271460 - ctld ports become inaccessible due to concurrent service restarts
Summary: ctld ports become inaccessible due to concurrent service restarts
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-05-16 23:06 UTC by Alan Somers
Modified: 2024-06-12 19:44 UTC (History)
0 users

See Also:


Attachments
Example ctl configuration file (7.97 KB, text/plain)
2023-05-16 23:06 UTC, Alan Somers
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer freebsd_triage 2023-05-16 23:06:51 UTC
Created attachment 242225 [details]
Example ctl configuration file

If two separate processes do "service ctld restart", then they can race.  The result is ctl ports that are inaccessible (clients can't connect), and the ports don't get torn down after ctld exits.  Attempting to start ctld again fails to fix the stuck ports (though new ports can be added).  The only remedy is to restart.

Steps to reproduce
==================
1) Create about 32 zvols (i've also observed this bug with file-backed LUNs)
2) Configure /etc/ctl.conf as shown in the attached file
3) Run the following in two separate terminals:
for ((i=0; i<10000; i=$i+1)); do  service ctld onerestart|| break; done

After some time, usually < 1 second, one terminal will fail with an error like this:
ctld: LUN modification error: LUN 31 is not managed by the block backend
ctld: failed to modify lun "disk31", CTL lun 31
ctld: CTL_LUN_MAP ioctl failed: Device not configured
ctld: failed to apply configuration; exiting
/etc/rc.d/ctld: WARNING: failed to start ctld


Then, kill the loop in the other terminal.  Then ensure that no ctld process is running, and do "ctladm portlist".  All 32 ports will be shown.  Attempting to start ctld one more time will result in an error like this:

ctld: error returned from port creation request: target "iqn.2018-10.myhost:disk0" for portal group tag 257 already exists
ctld: failed to update port pg0-iqn.2018-10.myhost:disk0
Comment 1 Alan Somers freebsd_committer freebsd_triage 2024-06-06 16:16:17 UTC
I've discovered a highly undocumented command that can allow one to recover from this situation without a reboot.  First shutdown ctld, then remove each iSCSI target port with the undocumented command, and then restart ctld.  The command is:

ctladm port -d iscsi -r -p DONTCARE -O cfiscsi_portal_group_tag=TAG -O cfiscsi_target=TARGET

Where "DONTCARE" must be an integer but otherwise its value does not matter, "TAG" can be found via "ctladm portlist -v" and is typically 257 or greater, and TARGET can also be found via "ctladm portlist -v".

I'll update the man page to document this syntax and also add ATF tests for it.
Comment 2 Alan Somers freebsd_committer freebsd_triage 2024-06-12 19:44:27 UTC
I've confirmed that the cause of the problem is that ctld opens its pidfile too late.  It reads the current list of targets from the kernel, then reads the config file, then opens its pidfile, and then applies changes based on the differences between the kernel's state and the config file.  But the kernel's state could've changed before the pidfile got opened.

I've hacked ctld to open the pidfile earlier and verified that this fixes the problem.  However, doing it properly is hard, because the code for opening the config file is intermingled with the code for interacting with the kernel.  The biggest problem is the conf_pports list, added in 057abcb00413010898f3046f7704444b8f537bab .