Bug 261387 - Should cam be calling callout drain for struct cam_sim and struct cam_ed?
Summary: Should cam be calling callout drain for struct cam_sim and struct cam_ed?
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-scsi (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-01-21 17:33 UTC by Herbie.Robinson
Modified: 2022-01-28 15:38 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Herbie.Robinson 2022-01-21 17:33:17 UTC
The documentation for callout says that callout_drain must be called before destroying any storage containing a callout structure.  I noticed the CAM doesn't do that for struct cam_sim and struct cam_id.

I pose this bug report as a question, because it does seem a little scary to fix it and it's certainly beyond my level of experience with the FreeBSD kernel.

The simple patches to fix this would be

--- cam_sim.c
+++ cam_sim.c
@@ -165,6 +165,7 @@

     if (free_devq)
          cam_simq_free(sim->devq);
+    callout_drain(&sim->callout);
     free(sim, M_CAMSIM);
 }

--- cam_xpt.c
+++ cam_xpt.c
@@ -4831,6 +4831,7 @@
 {
     struct cam_ed  *device = context;

+    callout_drain(&device->callout);
     mtx_lock(&device->device_mtx);
     mtx_destroy(&device->device_mtx);
     free(device, M_CAMDEV);

This is relative to the main branch https://cgit.freebsd.org/src/commit/?id=b252fb24301c1f7e7d83eab631e7d9fa947e227d.

The routines patched are cam_sim_free and xpt_destroy_device.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2022-01-25 15:38:06 UTC
Looks like the cam_sim callout has been removed: https://cgit.freebsd.org/src/commit/?id=28027f28e607012b83ee9062eed3b8ed82e819c1

I think you're right that it should have been drained, but it's somewhat moot since it apparently wasn't used.

The cam_ed callout is a bit more complicated: it's interlocked by the device's SIM's devq lock, dev->sim->devq->send_mtx.  So the callout handler, xpt_release_devq_timeout(), runs with send_mtx held, as does xpt_release_device() when it stops the callout.  The interlock is mostly sufficient to ensure that the callout does not continue running.  It's not completely sufficient, though, since xpt_release_devq_timeout() may drop the send_mtx lock to schedule I/O (presumably from a different device).  During this window, xpt_release_device() may try to stop the callout, fail, and continue destroying the device, while the callout is still running.

It's not very clear whether it's ok to free a callout while the handler is running.  At that point the callout subsystem no longer references the callout, and xpt_release_devq_timeout() presumably won't access a "doomed" device after dropping the send lock, since it's shouldn't appear in the devq anymore.

So I suspect that it's mostly a theoretical problem, though the callout documentation really doesn't make this clear.
Comment 2 Herbie.Robinson 2022-01-27 17:54:43 UTC
It's good that the callout was removed from the sim.  Calling callout_drain from that removal routine would have been scary...

I agree on the mostly theoretical aspect; however, on systems with huge caches and many cores, theory can become practical on rare occasions.
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2022-01-28 15:38:09 UTC
(In reply to Herbie.Robinson from comment #2)
"theoretical" was a bad word choice.  I don't mean that this becomes a problem with sufficiently large timing delays or anything like that.  (At least, I can't see how it does.)  Rather, this code seemingly violates the implicit contract that consumers have with the callout subsystem, that callout structures shouldn't be freed while the callout handler is running.  In fact, I believe this is safe to do in this particular case.  softclock_call_cc(), the function that actually invokes the callout handler, does not access the callout structure after calling the handler except in two cases which do not apply to the cam_ed callout.

So while it's formally correct to drain the callout before destroying the device in cam_destroy_device(), I can't see how it would fix any observed problems.  To be clear, I don't disagree with the patch.