Bug 255445 - lang/python 3.8/3.9 SIGSEV core dumps in libthr TrueNAS
Summary: lang/python 3.8/3.9 SIGSEV core dumps in libthr TrueNAS
Status: Open
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-python (Nobody)
URL:
Keywords: crash, needs-patch, needs-qa
Depends on:
Blocks:
 
Reported: 2021-04-27 18:41 UTC by yocalebo@gmail.com
Modified: 2021-05-10 12:11 UTC (History)
3 users (show)

See Also:
bugzilla: maintainer-feedback? (python)


Attachments
python38-fix-threading-v1.patch (6.90 KB, patch)
2021-04-29 16:18 UTC, Kai Knoblich
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description yocalebo@gmail.com 2021-04-27 18:41:24 UTC
Seeing many TrueNAS (previously FreeNAS) users dump core on the main middlewared process (python) starting with our version 12.0 release.

Relevant OS information:
12.2-RELEASE-p6 FreeBSD 12.2-RELEASE-p6 f2858df162b(HEAD) TRUENAS  amd64

Python versions that experience the core dump:
Python 3.8.7
Python 3.9.4

When initially researching this, I did find a regression with threading and python 3.8 on freeBSD and was able to resolve that particular problem by backporting the commits: https://github.com/python/cpython/commit/4d96b4635aeff1b8ad41d41422ce808ce0b971c8 and https://github.com/python/cpython/commit/9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6.

The reason why I backported those commits is because all of the core dumps that I've analyzed are panic'ing in the same spot (or very close to it). For example, here are 2 backtraces showing null-ptr dereference.

Core was generated by `python3.8: middlewared'.
 Program terminated with signal SIGSEGV, Segmentation fault.
 #0 cond_signal_common (cond=<optimized out>) at /truenas-releng/freenas/_BE/os/lib/libthr/thread/thr_cond.c:457
warning: Source file is more recent than executable.
 457 mp = td->mutex_obj;
 [Current thread is 1 (LWP 100733)]
 (gdb) list
 452                _sleepq_unlock(cvp);
 453                    return (0);
 454                }
 455
 456                td = _sleepq_first(sq);
 457                mp = td->mutex_obj;
 458                cvp->__has_user_waiters = _sleepq_remove(sq, td);
 459                if (PMUTEX_OWNER_ID(mp) == TID(curthread)) {
 460                    if (curthread->nwaiter_defer >= MAX_DEFER_WAITERS) {
 461                        _thr_wake_all(curthread->defer_waiters, 

(gdb) p *td
Cannot access memory at address 0x0


and another one
Core was generated by `python3.8: middlewared'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  cond_signal_common (cond=<optimized out>) at /truenas-releng/freenas/_BE/os/lib/libthr/thread/thr_cond.c:459warning: Source file is more recent than executable.
459             if (PMUTEX_OWNER_ID(mp) == TID(curthread)) {
[Current thread is 1 (LWP 101105)]
(gdb) list
454             }
455
456             td = _sleepq_first(sq);
457             mp = td->mutex_obj;
458             cvp->__has_user_waiters = _sleepq_remove(sq, td);
459             if (PMUTEX_OWNER_ID(mp) == TID(curthread)) {
460                     if (curthread->nwaiter_defer >= MAX_DEFER_WAITERS) {
461                             _thr_wake_all(curthread->defer_waiters,
462                                 curthread->nwaiter_defer);
463                             curthread->nwaiter_defer = 0;
(gdb) p *mp
Cannot access memory at address 0x0

I'm trying to instrument a program to "stress" test threading (tearing down and recreating etc etc) but I've been unsuccessful at tickling this particular problem. The end-users that have seen this core dump sometimes go 1month + without a problem. Hoping someone more knowledgeable can at least give me a pointer or help me figure this one out. I have access to my VM that has all the relevant core dumps available so if someone needs remote access to it to "poke" around, please let me know. You can reach me at caleb [at] ixsystems.com
Comment 1 Li-Wen Hsu freebsd_committer 2021-04-27 19:01:56 UTC
(In reply to yocalebo@gmail.com from comment #0)
Thanks for reporting this. To narrow down the cause and to verify the fix. Do you think it's possible to:

- Have a minimal test case to reproduce the issue.
- Test it on vanilla FreeBSD, here it is 12.2-RELEASE, it will be nice if it can be also tested on 13.0-RELEASE and -CURRENT.

I know that debugging threading issues are hard and you're trying to do the first item, but we firstly need to reproduce this on an unmodified FreeBSD.

Also you mentioned back porting the two commits, does it help?
Comment 2 yocalebo@gmail.com 2021-04-27 19:22:03 UTC
(In reply to Li-Wen Hsu from comment #1)

Hi Li-Wen Hsu, thanks for the feedback. I will definitely use stock GENERIC freeBSD 12.2 kernel. The hard part is instrumenting a program that causes the SIGSEV :(.

I can, however, confirm that stock 12.2 has a regression with threads on py3.8.7. You can use this script to test it.

from multiprocessing.pool import ThreadPool
from datetime import datetime

class Broken(object):
    def __init__(self):
        self.pool = ThreadPool()

    def __del__(self):
        self.pool.close()
        self.pool.join()

ohno = Broken()
date = datetime.now()

print(str(date), ohno)

Running that 52 times caused an indefinite hang ~19.2% of the time. After cherry-picking the 2 commits I linked above, I confirmed that it resolved the "hangs". It never did reproduce a core dump.

Anyways, I'll keep trying to write a program that tickles the problem. If you or anyone has any ideas reading this, I'm all ears at this point and will happily try anything at this point :)
Comment 3 Li-Wen Hsu freebsd_committer 2021-04-27 19:29:53 UTC
(In reply to yocalebo@gmail.com from comment #2)
> Running that 52 times caused an indefinite hang ~19.2% of the time. After cherry-picking the 2 commits I linked above, I confirmed that it resolved the "hangs". It never did reproduce a core dump.

If this is the case, I think we need to backport this to current python 3.8 (and 3.9?) ports. Have this been done in the upstream?  We can do it locally but we also need to check it them won't cause other regressions. It would be nice to work with the upstream to reduce the possible regressions, and let the next version of 3.8 includes the fix.
Comment 4 yocalebo@gmail.com 2021-04-27 19:37:32 UTC
(In reply to Li-Wen Hsu from comment #3)

Upstream python 3.9 includes these patches by default already, they were never backported to 3.8. I'm fairly confident there are no regressions because:
1. they were commited to upstream python for version 3.9
2. as of yesterday, there are 21,736 users running py3.8.7 with those 2 backported commits and we've seen no further reports of regressions.
Comment 5 Li-Wen Hsu freebsd_committer 2021-04-27 19:41:30 UTC
(In reply to yocalebo@gmail.com from comment #4)
Thanks for the numbers and it sounds we should have those patches included. Do you think you can ask if they also want to backport to 3.8, or do you mean they have been asked but don't want to do so?
Comment 6 yocalebo@gmail.com 2021-04-27 19:51:59 UTC
(In reply to Li-Wen Hsu from comment #5)
Unfortunately, I'm not sure why they did not include the threading patch in python 3.8. I suspect because the use-case is relatively small (daemon=True threads and how they're destroyed/cleaned up by the calling thread) and they were very late in the "release" process for pushing out 3.8?? I'm not sure, hard to know.
Comment 7 Kubilay Kocak freebsd_committer freebsd_triage 2021-04-28 03:42:23 UTC
(In reply to yocalebo@gmail.com from comment #4)

Upstream has a latest minor and back 2 minors merge policy. If you can identify the issues and/or PR's associated with those two missing merge commits, I can take care of liaising with upstream to get them merged in the long term

Thank you for the excellent report, assessment and upstream research.
Comment 8 Kubilay Kocak freebsd_committer freebsd_triage 2021-04-28 03:44:47 UTC
(In reply to yocalebo@gmail.com from comment #6)

I run FreeBSD 12.x and CURRENT for upstream Cpython, which tests all non EoL branches (latest-and-back-2-versions), so I'm surprised with this report. This would appear to hint at something else contributing to the issue.
Comment 9 yocalebo@gmail.com 2021-04-28 14:04:22 UTC
(In reply to Kubilay Kocak from comment #7)

Hi Kubilay, thanks for adding Victor Stinner to this. Victor will more than likely be able to give a more definitive reason on why those 2 commits did not get backported to python3.8 since Victor is the author of those commits :)

Anyways, I have the upstream python bugs for those 2 commits:

https://github.com/python/cpython/commit/9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6 links to https://bugs.python.org/issue19466

https://github.com/python/cpython/commit/4d96b4635aeff1b8ad41d41422ce808ce0b971c8 links to https://bugs.python.org/issue39511

Without those 2 commits, if you take the python script I posted earlier and simply run a while true loop in the shell repeatedly calling it, you will notice that it will eventually hang indefinitely. With the 2 commits, the while true loop runs forever as expected.
Comment 10 yocalebo@gmail.com 2021-04-28 14:09:25 UTC
(In reply to Kubilay Kocak from comment #8)

I was quite surprised at the regression with py3.8 as well. But that was before I started looking at some of the changes from py3.7 to 3.8. Some of the changes are very intrusive. For example, here is a quote from one of the release note entries.

"The ``PyGC_Head`` struct has changed completely.  All code that touched the struct member should be rewritten.  (See :issue:`33597`.)"

That struct is the main struct for pythons garbage collection object (AFAICT), and I understand that it's a very "low-level" change and shouldn't effect too much etc etc, but seems strange to me for something (for what I perceive to be) such an intrusive change for a minor release.

But I will admit, I do not follow the release process for python very closely and this very well could be par for the course. :)
Comment 11 yocalebo@gmail.com 2021-04-29 00:15:54 UTC
Update on the matter. I was able to analyze a python3.9.4 core dump and I've narrowed the fact down that the thread on CPU that crashed is in concurrent.future._base.Executor class.

Core was generated by `python3.9: middlewared'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  pymalloc_alloc (ctx=<optimized out>, nbytes=56) at Objects/obmalloc.c:1621
warning: Source file is more recent than executable.
1621            if (UNLIKELY((pool->freeblock = *(block **)bp) == NULL)) {
[Current thread is 1 (LWP 100893)]
(gdb) info locals
size = 3
pool = 0x83757b000
bp = 0x0
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/local/lib/python3.9/xml/etree/ElementPath.py", line 282, in select
    yield elem
  File "/usr/local/lib/python3.9/xml/etree/ElementPath.py", line 389, in find
    return next(iterfind(elem, path, namespaces), None)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/disk_/disk_info_freebsd.py", line 96, in label_to_disk
    part = geom.class_by_name('PART').xml.find(f'.//provider[name="{dev}"]/../name')
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1521, in call_sync
    )
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py", line 416, in transform_topology
    disk = self.middleware.call_sync('disk.label_to_disk', *args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py", line 435, in transform_topology
    x[i] = self.transform_topology(x[i], dict(options, geom_scan=False))
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py", line 432, in transform_topology
    x[key] = self.transform_topology(x[key], dict(options, geom_scan=False))
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py", line 435, in transform_topology
    x[i] = self.transform_topology(x[i], dict(options, geom_scan=False))
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py", line 432, in transform_topology
    x[key] = self.transform_topology(x[key], dict(options, geom_scan=False))
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py", line 469, in pool_extend
    'topology': self.transform_topology(zpool['groups']),
  File "/usr/local/lib/python3.9/site-packages/middlewared/utils/io_thread_pool_executor.py", line 25, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/utils/io_thread_pool_executor.py", line 308, in _target
  File "/usr/local/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 912, in _bootstrap
    self._bootstrap_inner()
Comment 12 Kai Knoblich freebsd_committer 2021-04-29 16:18:12 UTC
Created attachment 224534 [details]
python38-fix-threading-v1.patch

At the outset: Thanks for the report and the detailed descriptions.

The threading issue with Python 3.8 (tested with 3.8.9 from the Ports tree) can be reproduced via the script from comment #2.  Following vanilla releases of FreeBSD were tested:

- 11.4-RELEASE-p9 [amd64/i386]
- 12.2-RELEASE-p6 [amd64/i386]
- 13.0-RELEASE-p0 [amd64]
- 14.0-CURRENT main-n246037-ca6e1fa3ce8 [amd64]

If the attached patch is used, which contains the two backported commits, as noted in comment #0 (BPO-19466 + BPO39511), the regression doesn't occur on any platform listed above.

A side note: If a VM has only one vCPU, the threading issue doesn't seem to be reproducible with an unpatched version of Python 3.8.9.
Comment 13 yocalebo@gmail.com 2021-04-29 18:03:25 UTC
Maybe some progress....not sure but I'm cautiously optimistic.

I analyzed 7 core dumps again to see if I spotted anything interesting. 6 of them were py3.8 and 1 was py3.9.

100% of them have a common pattern. The threads *tstate has lxml.etree._ParserDictionaryContext in the frames. Either the thread that was on CPU when it core dumped or some other thread. Since this is seemingly memory corruption, lxml has become my suspect since it uses it's own C bindings for the libxml2 and libxslt libraries.

I've instrumented a somewhat complicated script that daemonizes, creates a concurrent.futures._base.Executor class and calls methods that use lxml library to parse geom xml information in a while true loop.

Am I looking at a red herring?? Idk but this grabbed my attention so I'm running the script to see if it'll tickle the problem. Maybe the version of py-lxml we're using has a subtle issue with py3.8+. (Queue my overly dramatic rant about py3.8 changes to PyGC_Head struct.)
Comment 14 Victor Stinner 2021-04-30 15:25:09 UTC
Using the script of Comment 2, I cannot reproduce the hang with Python 3.9.4 on FreeBSD 13. I used:

for i in $(seq 1 100); do python3.9 script.py || break; done
Comment 15 yocalebo@gmail.com 2021-04-30 15:29:01 UTC
(In reply to Victor Stinner from comment #14)

Hi Victor, thanks for testing it out. Your results are expected since the 2 commits that fix the problem were included with python3.9. Have you tried using python3.8?
Comment 16 Victor Stinner 2021-04-30 16:18:59 UTC
I can reproduce the issue with Python 3.8 on Linux.

Well, first, don't rely on implicit resource management: don't rely on __del__() destructor.

You should use "with pool:" and then call pool.join().

--

I'm not sure if it's related, but daemon threads are evil and should be avoided if possible. It's another variant of "implicit resource management" (don't join the thread, expect it to disappear magically at exit). I fixed many race conditions in daemon threads:

* https://vstinner.github.io/daemon-threads-python-finalization-python32.html
* https://vstinner.github.io/threading-shutdown-race-condition.html
* https://vstinner.github.io/gil-bugfixes-daemon-threads-python39.html

Python 3.9 avoids daemon threads in concurrent.futures:
https://bugs.python.org/issue39812

But multiprocessing still uses daemon threads.

--

I also fixed multiple bugs in the Python finalization to make it more reliable. My notes on that topic:
https://pythondev.readthedocs.io/finalization.html
Comment 17 Victor Stinner 2021-04-30 16:23:11 UTC
Backporting "bpo-19466: Py_Finalize() clears daemon threads earlier (GH-18848)" sounds dangerous. This fix relies on many other fixes. Without the other fixes, clearing daemon threads earlier can hang or crash.

I suggest you to not rely on implicit resource management. If you really have to, use Python 3.9 which seems to work for your issue.
Comment 18 yocalebo@gmail.com 2021-04-30 16:27:49 UTC
(In reply to Victor Stinner from comment #17)

Thanks for the tips and the links to your notes. However, the panics are still happening on py3.9.4. Unfortunately, I've analyzed 11 core dumps at this point. 10 from py3.8 and 1 from py3.9.4. All of their tstate has a reference to lxml.etree._ParserDictionaryContext. I actually just analyzed a multiprocessing core dump and it is panic'ing during garbage collection (visit_decref) but the story is the same. _ParserDictionaryContext is in the thread at time of dump.
Comment 19 Kubilay Kocak freebsd_committer freebsd_triage 2021-05-02 04:15:11 UTC
(In reply to Victor Stinner from comment #17)

Thank you for the feedback Victor. Does this also mean (in your opinion) that carrying local patches for these commits in FreeBSD Ports is not a wise resolution?