Bug 218597 - [CRYPTODEV] spurious wakeup and synchronisation problem
Summary: [CRYPTODEV] spurious wakeup and synchronisation problem
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: John Baldwin
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2017-04-12 16:07 UTC by Alexandre martins
Modified: 2019-03-11 22:18 UTC (History)
4 users (show)

See Also:


Attachments
proposed patch (1.72 KB, patch)
2017-04-12 16:07 UTC, Alexandre martins
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandre martins 2017-04-12 16:07:03 UTC
Created attachment 181725 [details]
proposed patch

I using a dual core ARMADA 38X platform that contains a CESA module to accelerate cryptography operations.

I have some crashes when using the cryptodev through OpenSSL.

The crash only happen when the following conditions are met:
 - Multi-core system is used
 - Hardware acceleration is used
 - Hardware acceleration perform an interruption to signal the operations completion

That are the scenario, first:
 - A process ask a cryptographic operation
 - The crypto operation is send to the CESA
 - msleep is called, but the process seems to be migrated on another CPU, and msleep return immediately (spurious wakeup)
   (I puted a KASSERT that check if the CRYPTO_F_DONE is present and it rise)
 - The operation looks like complete and memory is freed
 - The CESA performs the cryptographic operation on the freed memory that may cause corruption in some other parts of the kernel.

Second:
 - The call to crypto_dispatch is done, but the scheduler prempt the process before the lock on ces->lock is done
 - The CESA perform the operation and crypto_done is called by the ithread
 - The flag CRYPTO_F_DONE is put and scheduler preempt the ithread
 - The user process continue, take the lock and check the flag, that is present, and doesn't call msleep.
 - It free the memory, and return to the userland
 - The ithread is resumed and calls the callbacks that crash by accessing the freed memory.

I attach my proposed patch to fix that issue
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2017-04-16 06:37:07 UTC
I do not know the opencrypto code.  From the brief look around the fragments that you patched, my first question is it possible for one session (struc csession) to have more than one requests (struct cryptop) active simultaneously ?  If yes, I do not see how the completion flag could be moved into session.

Also, from your description, it seems to be more of an issue in the interface between the driver and crypto infrastructure.  I.e. crp must not be freed until consumer acknowledge.
Comment 2 Alexandre martins 2017-04-18 07:35:54 UTC
It seems that the "struct csession" can only perform one request at a time, done by cryptodev_op function.

I agree that the synchronisation is not well formed. I made my patch to be as small as possible to fix the issue. A re-work may be needed
Comment 3 John Baldwin freebsd_committer freebsd_triage 2017-04-18 23:17:26 UTC
There's certainly nothing to prevent multiple threads from issuing ioctls against the same session.  (However, having 'error' in the cse is buggy, but it should also just be removed as all the uses of it check the crp->crp_etype it is copied from just before using it anyway.)

The problem is that CRYPTO_F_DONE isn't really safe for how the /dev/crypto ioctl handlers are using it (and probably isn't safe to use at all and should arguably be removed).  Instead, the sleeps need to be conditional on a flag that the callback function itself sets.  Probably the right way to handle this is to allocate a separate structure that crp_opaque points to that contains the "done" flag and a pointer to the cse.  The callback would set the done flag under the cse lock and then issue the wakeup, and the ioctl routines would sleep on this done flag in the new structure.  Something like:

struct crypt_op_data {
    bool done;
    struct csession *cse;
}

crypto_dev_op()
{
    struct crypt_op_data *cod;

    ...
    cod = malloc(sizeof(*cod), M_WAITOK | M_ZERO);
    cod->cse = cse;
    ...
    crp->crp_opaque = cod;  /* not 'cse' anymore */
    ...
    crypto_dispatch();
    mtx_lock(&cse->lock);
    while (error == 0 && !cod->done)
       mtx_sleep(crp, &cse->lock, ...);
    mtx_unlock(&cse->lock);
    ...
    /* be sure to clear done if retrying */
}

cryptodev_cb()
{
    struct cryptop *crp = op;
    struct crypto_op_data *cod = crp->crp_opaque;

    mtx_lock(&cod->cse->lock);
    cod->done = true;
    mtx_unlock(&cod->cse->lock);
    wakeup(crp);    
}
Comment 4 Alexandre martins 2017-04-19 07:23:54 UTC
Nice catch !

Just one question, instead of mallocing crypt_op_data, can we use the stack, as a closure ?
Comment 5 Alexandre martins 2017-04-19 08:05:50 UTC
(In reply to Alexandre martins from comment #4)

It seems that the cession is not thread safe, for example in cryptodev_op:

 ...
 712         cse->uio.uio_iov = &cse->iovec;
 713         cse->uio.uio_iovcnt = 1;
 714         cse->uio.uio_offset = 0;
 ...     

There is no lock around that.

Should I propose a bigger patch to fix that ?
Comment 6 Konstantin Belousov freebsd_committer freebsd_triage 2017-04-19 10:47:08 UTC
(In reply to Alexandre martins from comment #4)
If you want to use on-stack memory from another thread or from an interrupt context, be aware that the thread's kstacks are generally swappable.  Of course, the stack is swapped in when the thread is marked runnable, but otherwise you may get into a situation when other thread accessing unmapped kstack.

You may work around this with PHOLD(), but that would keep whole stack in memory when system desperately needs it.  In other words, using malloc(9) is less troublesome.
Comment 7 John Baldwin freebsd_committer freebsd_triage 2017-04-19 16:18:53 UTC
(In reply to Alexandre martins from comment #5)
Well, hmm.  If the uio is in the csession than perhaps we should just formalize "only one op at a time" by having the crypto ioctls detect that case and either fail or block.  Alternatively we could move the uio into the 'cod'.  Probably it is simpler to just enforce "one op at a time" and then your original patch should work I think.
Comment 8 John Baldwin freebsd_committer freebsd_triage 2017-08-17 00:25:00 UTC
FYI, I have a potential fix for this in the middle of a larger patch series I'm working on to enable  AES-GCM offloading for the OpenSSL /dev/crypto engine and some other things.  It uses malloc'd per-op buffer.  While concurrent operations on a session are probably rare, we don't do anything to prevent them currently and this makes them safe.  I also prefer to not muddy the per-session state with per-op state.  The relevant commits are:

https://github.com/freebsd/freebsd/commit/6d93ba89e16b137cca55b377c46430815cb68464

along with 3 followup fixes in this branch:

https://github.com/freebsd/freebsd/compare/master...bsdjhb:openssl_engine

I don't have an ETA for when this might reach HEAD apart from "soon-ish".  I have some other open reviews for /dev/crypto changes that I haven't been able to get any review for already.
Comment 9 John Baldwin freebsd_committer freebsd_triage 2018-01-16 01:02:56 UTC
I've posted the commits I mentioned earlier in a review at https://reviews.freebsd.org/D13928
Comment 10 commit-hook freebsd_committer freebsd_triage 2018-01-26 23:22:10 UTC
A commit references this bug:

Author: jhb
Date: Fri Jan 26 23:21:50 UTC 2018
New revision: 328453
URL: https://svnweb.freebsd.org/changeset/base/328453

Log:
  Move per-operation data out of the csession structure.

  Create a struct cryptop_data which contains state needed for a single
  symmetric crypto operation and move that state out of the session. This
  closes a race with the CRYPTO_F_DONE flag that can result in use after
  free.

  While here, remove the 'cse->error' member.  It was just a copy of
  'crp->crp_etype' and cryptodev_op() and cryptodev_aead() checked both
  'crp->crp_etype' and 'cse->error'.  Similarly, do not check for an
  error from mtx_sleep() since it is not used with PCATCH or a timeout
  so cannot fail with an error.

  PR:		218597
  Reviewed by:	kib
  Sponsored by:	Chelsio Communications
  Differential Revision:	https://reviews.freebsd.org/D13928

Changes:
  head/sys/opencrypto/cryptodev.c
Comment 11 commit-hook freebsd_committer freebsd_triage 2019-03-11 22:18:21 UTC
A commit references this bug:

Author: jhb
Date: Mon Mar 11 22:17:53 UTC 2019
New revision: 345033
URL: https://svnweb.freebsd.org/changeset/base/345033

Log:
  MFC 328453: Move per-operation data out of the csession structure.

  Create a struct cryptop_data which contains state needed for a single
  symmetric crypto operation and move that state out of the session. This
  closes a race with the CRYPTO_F_DONE flag that can result in use after
  free.

  While here, remove the 'cse->error' member.  It was just a copy of
  'crp->crp_etype' and cryptodev_op() and cryptodev_aead() checked both
  'crp->crp_etype' and 'cse->error'.  Similarly, do not check for an
  error from mtx_sleep() since it is not used with PCATCH or a timeout
  so cannot fail with an error.

  PR:		218597
  Sponsored by:	Chelsio Communications

Changes:
_U  stable/11/
  stable/11/sys/opencrypto/cryptodev.c