Bug 271460 - ctld ports become inaccessible due to concurrent service restarts
Summary: ctld ports become inaccessible due to concurrent service restarts
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Alan Somers
URL: https://github.com/freebsd/freebsd-sr...
Keywords:
Depends on:
Blocks:
 
Reported: 2023-05-16 23:06 UTC by Alan Somers
Modified: 2024-11-11 17:34 UTC (History)
1 user (show)

See Also:
asomers: mfc-stable14+
asomers: mfc-stable13?


Attachments
Example ctl configuration file (7.97 KB, text/plain)
2023-05-16 23:06 UTC, Alan Somers
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer freebsd_triage 2023-05-16 23:06:51 UTC
Created attachment 242225 [details]
Example ctl configuration file

If two separate processes do "service ctld restart", then they can race.  The result is ctl ports that are inaccessible (clients can't connect), and the ports don't get torn down after ctld exits.  Attempting to start ctld again fails to fix the stuck ports (though new ports can be added).  The only remedy is to restart.

Steps to reproduce
==================
1) Create about 32 zvols (i've also observed this bug with file-backed LUNs)
2) Configure /etc/ctl.conf as shown in the attached file
3) Run the following in two separate terminals:
for ((i=0; i<10000; i=$i+1)); do  service ctld onerestart|| break; done

After some time, usually < 1 second, one terminal will fail with an error like this:
ctld: LUN modification error: LUN 31 is not managed by the block backend
ctld: failed to modify lun "disk31", CTL lun 31
ctld: CTL_LUN_MAP ioctl failed: Device not configured
ctld: failed to apply configuration; exiting
/etc/rc.d/ctld: WARNING: failed to start ctld


Then, kill the loop in the other terminal.  Then ensure that no ctld process is running, and do "ctladm portlist".  All 32 ports will be shown.  Attempting to start ctld one more time will result in an error like this:

ctld: error returned from port creation request: target "iqn.2018-10.myhost:disk0" for portal group tag 257 already exists
ctld: failed to update port pg0-iqn.2018-10.myhost:disk0
Comment 1 Alan Somers freebsd_committer freebsd_triage 2024-06-06 16:16:17 UTC
I've discovered a highly undocumented command that can allow one to recover from this situation without a reboot.  First shutdown ctld, then remove each iSCSI target port with the undocumented command, and then restart ctld.  The command is:

ctladm port -d iscsi -r -p DONTCARE -O cfiscsi_portal_group_tag=TAG -O cfiscsi_target=TARGET

Where "DONTCARE" must be an integer but otherwise its value does not matter, "TAG" can be found via "ctladm portlist -v" and is typically 257 or greater, and TARGET can also be found via "ctladm portlist -v".

I'll update the man page to document this syntax and also add ATF tests for it.
Comment 2 Alan Somers freebsd_committer freebsd_triage 2024-06-12 19:44:27 UTC
I've confirmed that the cause of the problem is that ctld opens its pidfile too late.  It reads the current list of targets from the kernel, then reads the config file, then opens its pidfile, and then applies changes based on the differences between the kernel's state and the config file.  But the kernel's state could've changed before the pidfile got opened.

I've hacked ctld to open the pidfile earlier and verified that this fixes the problem.  However, doing it properly is hard, because the code for opening the config file is intermingled with the code for interacting with the kernel.  The biggest problem is the conf_pports list, added in 057abcb00413010898f3046f7704444b8f537bab .
Comment 3 Alan Somers freebsd_committer freebsd_triage 2024-08-07 17:00:11 UTC
Commit 969876fcee57ea1cb1c7b4d2ee757793cbfbe353 refactored ctld to better separate config file parsing from kernel interaction.  Now it's easier to solve this bug.  See the linked URL for the patch.
Comment 4 commit-hook freebsd_committer freebsd_triage 2024-09-19 01:28:06 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=5f89aea7b74aa4605b25af62e31303097a4a48cc

commit 5f89aea7b74aa4605b25af62e31303097a4a48cc
Author:     Alan Somers <asomers@FreeBSD.org>
AuthorDate: 2024-08-07 15:21:08 +0000
Commit:     Alan Somers <asomers@FreeBSD.org>
CommitDate: 2024-09-18 20:06:31 +0000

    ctld: fix several process setup/teardown bugs

    All of the below bugs could result in a system where ctld is not
    running, but LUNs and targets still exist in the kernel; a difficult
    situation to recover from.

    * open the pidfile earlier.  Open the pidfile before reading the
      kernel's current state, so two racing ctld processes won't step on
      each others' toes.

    * close the pidfile later.  Close it after tearing down the
      configuration, for the same reason.

    * If the configured pidfile changes, then rename it on SIGHUP rather
      than remove and recreate it.

    * When running in debug mode, don't close the pidfile while handling a
      new connection.  Only do that in non-debug mode, in the child of the
      fork.

    * Register signal handlers earlier.  Otherwise a SIGTERM signal received
      during startup could kill ctld without tearing down the configuration.

    MFC after:      2 weeks
    PR:             271460
    Sponsored by:   Axcient
    Reviewed by:    mav
    Pull Request:   https://github.com/freebsd/freebsd-src/pull/1370

 usr.sbin/ctld/ctld.c | 70 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 40 insertions(+), 30 deletions(-)
Comment 5 Mark Linimon freebsd_committer freebsd_triage 2024-10-01 01:12:00 UTC
^Triage: assign to committer for MFC review.
Comment 6 commit-hook freebsd_committer freebsd_triage 2024-10-10 19:17:56 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b64d88424dda7785eba815cefcc99709780c2953

commit b64d88424dda7785eba815cefcc99709780c2953
Author:     Alan Somers <asomers@FreeBSD.org>
AuthorDate: 2024-08-07 15:21:08 +0000
Commit:     Alan Somers <asomers@FreeBSD.org>
CommitDate: 2024-10-10 19:11:37 +0000

    ctld: fix several process setup/teardown bugs

    All of the below bugs could result in a system where ctld is not
    running, but LUNs and targets still exist in the kernel; a difficult
    situation to recover from.

    * open the pidfile earlier.  Open the pidfile before reading the
      kernel's current state, so two racing ctld processes won't step on
      each others' toes.

    * close the pidfile later.  Close it after tearing down the
      configuration, for the same reason.

    * If the configured pidfile changes, then rename it on SIGHUP rather
      than remove and recreate it.

    * When running in debug mode, don't close the pidfile while handling a
      new connection.  Only do that in non-debug mode, in the child of the
      fork.

    * Register signal handlers earlier.  Otherwise a SIGTERM signal received
      during startup could kill ctld without tearing down the configuration.

    PR:             271460
    Sponsored by:   Axcient
    Reviewed by:    mav
    Pull Request:   https://github.com/freebsd/freebsd-src/pull/1370

    (cherry picked from commit 5f89aea7b74aa4605b25af62e31303097a4a48cc)

 usr.sbin/ctld/ctld.c | 70 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 40 insertions(+), 30 deletions(-)
Comment 7 Dave Cottlehuber freebsd_committer freebsd_triage 2024-11-11 13:41:58 UTC
thanks Alan, this also worked for me to get unwedged without reboot.