Created attachment 242225 [details] Example ctl configuration file If two separate processes do "service ctld restart", then they can race. The result is ctl ports that are inaccessible (clients can't connect), and the ports don't get torn down after ctld exits. Attempting to start ctld again fails to fix the stuck ports (though new ports can be added). The only remedy is to restart. Steps to reproduce ================== 1) Create about 32 zvols (i've also observed this bug with file-backed LUNs) 2) Configure /etc/ctl.conf as shown in the attached file 3) Run the following in two separate terminals: for ((i=0; i<10000; i=$i+1)); do service ctld onerestart|| break; done After some time, usually < 1 second, one terminal will fail with an error like this: ctld: LUN modification error: LUN 31 is not managed by the block backend ctld: failed to modify lun "disk31", CTL lun 31 ctld: CTL_LUN_MAP ioctl failed: Device not configured ctld: failed to apply configuration; exiting /etc/rc.d/ctld: WARNING: failed to start ctld Then, kill the loop in the other terminal. Then ensure that no ctld process is running, and do "ctladm portlist". All 32 ports will be shown. Attempting to start ctld one more time will result in an error like this: ctld: error returned from port creation request: target "iqn.2018-10.myhost:disk0" for portal group tag 257 already exists ctld: failed to update port pg0-iqn.2018-10.myhost:disk0
I've discovered a highly undocumented command that can allow one to recover from this situation without a reboot. First shutdown ctld, then remove each iSCSI target port with the undocumented command, and then restart ctld. The command is: ctladm port -d iscsi -r -p DONTCARE -O cfiscsi_portal_group_tag=TAG -O cfiscsi_target=TARGET Where "DONTCARE" must be an integer but otherwise its value does not matter, "TAG" can be found via "ctladm portlist -v" and is typically 257 or greater, and TARGET can also be found via "ctladm portlist -v". I'll update the man page to document this syntax and also add ATF tests for it.
I've confirmed that the cause of the problem is that ctld opens its pidfile too late. It reads the current list of targets from the kernel, then reads the config file, then opens its pidfile, and then applies changes based on the differences between the kernel's state and the config file. But the kernel's state could've changed before the pidfile got opened. I've hacked ctld to open the pidfile earlier and verified that this fixes the problem. However, doing it properly is hard, because the code for opening the config file is intermingled with the code for interacting with the kernel. The biggest problem is the conf_pports list, added in 057abcb00413010898f3046f7704444b8f537bab .
Commit 969876fcee57ea1cb1c7b4d2ee757793cbfbe353 refactored ctld to better separate config file parsing from kernel interaction. Now it's easier to solve this bug. See the linked URL for the patch.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=5f89aea7b74aa4605b25af62e31303097a4a48cc commit 5f89aea7b74aa4605b25af62e31303097a4a48cc Author: Alan Somers <asomers@FreeBSD.org> AuthorDate: 2024-08-07 15:21:08 +0000 Commit: Alan Somers <asomers@FreeBSD.org> CommitDate: 2024-09-18 20:06:31 +0000 ctld: fix several process setup/teardown bugs All of the below bugs could result in a system where ctld is not running, but LUNs and targets still exist in the kernel; a difficult situation to recover from. * open the pidfile earlier. Open the pidfile before reading the kernel's current state, so two racing ctld processes won't step on each others' toes. * close the pidfile later. Close it after tearing down the configuration, for the same reason. * If the configured pidfile changes, then rename it on SIGHUP rather than remove and recreate it. * When running in debug mode, don't close the pidfile while handling a new connection. Only do that in non-debug mode, in the child of the fork. * Register signal handlers earlier. Otherwise a SIGTERM signal received during startup could kill ctld without tearing down the configuration. MFC after: 2 weeks PR: 271460 Sponsored by: Axcient Reviewed by: mav Pull Request: https://github.com/freebsd/freebsd-src/pull/1370 usr.sbin/ctld/ctld.c | 70 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 40 insertions(+), 30 deletions(-)
^Triage: assign to committer for MFC review.
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b64d88424dda7785eba815cefcc99709780c2953 commit b64d88424dda7785eba815cefcc99709780c2953 Author: Alan Somers <asomers@FreeBSD.org> AuthorDate: 2024-08-07 15:21:08 +0000 Commit: Alan Somers <asomers@FreeBSD.org> CommitDate: 2024-10-10 19:11:37 +0000 ctld: fix several process setup/teardown bugs All of the below bugs could result in a system where ctld is not running, but LUNs and targets still exist in the kernel; a difficult situation to recover from. * open the pidfile earlier. Open the pidfile before reading the kernel's current state, so two racing ctld processes won't step on each others' toes. * close the pidfile later. Close it after tearing down the configuration, for the same reason. * If the configured pidfile changes, then rename it on SIGHUP rather than remove and recreate it. * When running in debug mode, don't close the pidfile while handling a new connection. Only do that in non-debug mode, in the child of the fork. * Register signal handlers earlier. Otherwise a SIGTERM signal received during startup could kill ctld without tearing down the configuration. PR: 271460 Sponsored by: Axcient Reviewed by: mav Pull Request: https://github.com/freebsd/freebsd-src/pull/1370 (cherry picked from commit 5f89aea7b74aa4605b25af62e31303097a4a48cc) usr.sbin/ctld/ctld.c | 70 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 40 insertions(+), 30 deletions(-)
thanks Alan, this also worked for me to get unwedged without reboot.