Bug 254437 - Deadlock in ses_set_elm_status
Summary: Deadlock in ses_set_elm_status
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: Alan Somers
Depends on:
Reported: 2021-03-20 16:45 UTC by Alan Somers
Modified: 2021-07-29 21:57 UTC (History)
0 users

See Also:

Drop enc_cache_lock before cam_periph_sleep (3.70 KB, patch)
2021-07-29 21:57 UTC, Alan Somers
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer 2021-03-20 16:45:58 UTC
I ran `sudo sesutil fault all off` on a 12.1-STABLE system with 18 SES expanders and lots of disks.  It hung.  Procstat shows the following:

49857 103730 sesutil             -                   mi_switch+0xd4 sleepq_wait+0x2c _sleep+0x253 ses_set_elm_status+0x86 enc_ioctl+0x4f1 devfs_ioctl+0xb0 VOP_IOCTL_APV+0x7b vn_ioctl+0x16a devfs_ioctl_f+0x1e kern_ioctl+0x2b7 sys_ioctl+0xfa amd64_syscall+0x387 fast_syscall_common+0xf8

   55 100353 enc_daemon8         -                   mi_switch+0xd4 sleepq_wait+0x2c _sx_xlock_hard+0x3ee ses_publish_cache+0x1d1 enc_daemon+0x37f fork_exit+0x7e fork_trampoline+0xe 

It looks like sesutil acquired enc->enc_cache_lock in enc_ioctl, at line 438 (line numbers correspond to 13.0-RC2 sources), then went on to block on cam_periph_sleep(enc->periph, &req, PUSER, "encstat", 0); in ses_set_elm_status at line 2794.  Meanwhile, enc_daemon is blocked trying to acquire enc->enc_cache_lock in ses_publish_cache at line 1971.  But enc_daemon itself is responsible for waking up sesutil, via the wakeups in either ses_fill_control_request or ses_process_control_request.
Comment 1 Alan Somers freebsd_committer 2021-07-15 16:41:44 UTC
I just hit it again on 12.2-RELEASE.  I'm going to try to fix it, maybe in August.
Comment 2 Alan Somers freebsd_committer 2021-07-28 23:16:45 UTC
I could not reproduce the problem on 14.0-CURRENT with a GENERIC kernel after about 2 hours of trying.  However, with a GENERIC-NODEBUG kernel, it reproduced in two minutes.
Comment 3 Alan Somers freebsd_committer 2021-07-29 21:57:26 UTC
Created attachment 226787 [details]
Drop enc_cache_lock before cam_periph_sleep

This patch fixes the deadlock by dropping enc_cache_lock before calling cam_periph_sleep.  With this patch, I can do "sesutil fault all on; sesutil fault all off" in a tight loop for 3 hours, whereas before it would deadlock within a few minutes.  However, it exposed another problem.  About once an hour, a sesutil process hangs because of a missing wakeup.  I haven't yet figured out why the wakeups are missing, but I don't think they were introduced by this patch.