gh-73069: fix race conditions in os.scandir and associated direntry #131642

duaneg · 2025-03-24T02:19:55Z

This fixes the first two (c1 & c2) crashes from gh-73069. Let me know if I should add a news entry.

bedevere-app · 2025-03-24T02:19:59Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

ZeroIntensity

A blurb/news entry makes sense here.
Mutexes are prone to re-entrancy deadlocks in cases where we call back into Python, e.g., in an object's finalizer. We should be using critical sections, probably alongside argument clinic.

duaneg · 2025-03-24T23:29:16Z

A blurb/news entry makes sense here.

Righto, will add.

Mutexes are prone to re-entrancy deadlocks in cases where we call back into Python, e.g., in an object's finalizer. We should be using critical sections, probably alongside argument clinic.

Hmm, it seems re-entrancy is implicit in a lot of innocuous looking calls: when I first looked through the code it didn't seem like it could be re-entrant, but looking more closely I guess even something like PyType_GetModule could theoretically call into Python code, right?

Regardless, I don't think a critical section works for this bug: these methods are making blocking IO calls, which will release that lock. Note that these crashes occur on a GIL build, which AIUI means critical section cannot be an appropriate fix (since it is a no-op there). FWIW I did try it quickly and confirmed it still crashes with closedir and iternext protected by a critical section (in GIL and free-threading builds).

Would using a _PyRecursiveMutex instead of a PyMutex be an acceptable alternative? I'll push out a new version using that shortly.

It seems we do need to worry about re-entrancy, so switch to using a recursive mutex. Note we can't use a critical section since we need to perform blocking IO while holding the lock, which critical section would implicitly release.

ZeroIntensity · 2025-03-24T23:46:49Z

Would using a _PyRecursiveMutex instead of a PyMutex be an acceptable alternative? I'll push out a new version using that shortly.

Maybe, but we should be very cautious with it. _PyRecursiveMutex is safe in the re-entrant case, but it's still prone to lock-ordering deadlocks. Imagine a case where:

Thread A holds the recursive lock, and calls into Python via a finalizer.
Thread A begins waiting on a semaphore. The recursive lock is not released!
Thread B is waiting on the recursive lock, but it's also the same thread that unblocks the semaphore.
Deadlock :(

We might have to resort to some atomic compare-exchange shenanigans if mutexes prove to be too difficult.

duaneg · 2025-03-25T00:40:40Z

Would using a _PyRecursiveMutex instead of a PyMutex be an acceptable alternative? I'll push out a new version using that shortly.

Maybe, but we should be very cautious with it. _PyRecursiveMutex is safe in the re-entrant case, but it's still prone to lock-ordering deadlocks.

Ah, true! Unfortunately I don't think using atomics will help though: we ultimately must prevent concurrent calls to either or both of readdir and closedir and the semantics of the API require blocking behaviour, even if it is implemented by spinning with atomics. I think for this to be safe the code holding the lock must not be re-entrant.

Looking again at the code, I think it can be refactored so the section holding the lock does not call anything that could possibly call back into Python. Then we can use a standard non-recursive mutex. I'll work on that now.

… could be re-entrant.

…to avoid possibility of deadlock upon re-entrance.

duaneg · 2025-03-25T04:04:04Z

After playing with various options I think the c1 and c2 fixes require different approaches.

The c1 fix is more-or-less as described above: refactor to hold the lock only while retrieving the next entry, ensuring that no re-entrancy is possible. The only wrinkle is the direntry result from readdir needs to be copied into a buffer while holding the lock, as closedir on the directory it was reading frees it.

Note the Windows and POSIX code is looking very similar after all that. I think with a bit of additional refactoring it should be possible for them to share most of the same code and eliminate some duplication between #ifdef blocks. If this sounds useful I will carry on with that.

A similar approach doesn't really work for the c2 fix, however. In this case I think atomic compare/exchange is the better option: in a race there may be duplicated stat calls, but isn't really avoidable and the losers will be correctly cleaned up instead of leaking.

Should I rebase these commits to get rid of the introduced-and-then-removed failed attempts?

ZeroIntensity · 2025-03-25T09:44:39Z

Should I rebase these commits to get rid of the introduced-and-then-removed failed attempts?

Nope, we squash everything at the end anyway. Force-pushing and getting rid of commits just makes reviewing more difficult :)

…SIX platforms

duaneg · 2025-03-26T00:15:44Z

The new commit is just a little bit of additional refactoring to unify some of the Windows/POSIX code and minimise #ifdef code blocks. I'm not sure if this sort of semi-related cleanup-in-passing is considered helpful or unnecessary churn/risk here: if it is unhelpful (or should be done in a separate PR) just let me know and I'll drop/move it.

…(the final filename field being an unsized array) and in actual practice on the WASI platform. Thus just copying one into a stack-allocated structure is not good enough. The obvious way to address this is to allocate memory on the heap and return that. However, it seems wasteful to dynamically allocate a temporary chunk of data just to copy some data in and then out again. Instead, if the name field is potentially not long enough, allocate some extra space *on the stack* following the structure such that it must be sufficient to hold the the full record length (and copy all of that, not just the potentially smaller fixed structure size).

bedevere-app bot added the awaiting review label Mar 24, 2025

bedevere-app bot mentioned this pull request Mar 24, 2025

Python 3.5.2 crashers (from PyPy) #73069

Open

ZeroIntensity reviewed Mar 24, 2025

View reviewed changes

duaneg added 4 commits March 25, 2025 12:31

pythongh-73069 c1: fix race conditions with os.scandir iterator

7cc4a85

pythongh-73069 c2: fix race conditions with os.scandir direntry stat

a23f0dd

NEWS entry for pythongh-73069

7ef377f

Switch to _PyRecursiveMutex from PyMutex

f757a3b

It seems we do need to worry about re-entrancy, so switch to using a recursive mutex. Note we can't use a critical section since we need to perform blocking IO while holding the lock, which critical section would implicitly release.

duaneg force-pushed the gh-73069 branch from af1f063 to f757a3b Compare March 24, 2025 23:33

duaneg added 2 commits March 25, 2025 16:42

Refactor so mutex is held for minimal time and not over any code that…

40730d3

… could be re-entrant.

Refactor to use atomics & compare/exchange instead of holding a lock …

d374264

…to avoid possibility of deadlock upon re-entrance.

Refactor to share code and eliminate duplication between Windows & PO…

6b3b46d

…SIX platforms

duaneg force-pushed the gh-73069 branch from 93e1687 to 318eae2 Compare March 26, 2025 22:10

duaneg force-pushed the gh-73069 branch from 318eae2 to 072a07c Compare March 26, 2025 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

gh-73069: fix race conditions in os.scandir and associated direntry #131642

gh-73069: fix race conditions in os.scandir and associated direntry #131642

duaneg commented Mar 24, 2025

bedevere-app bot commented Mar 24, 2025

ZeroIntensity left a comment

duaneg commented Mar 24, 2025

ZeroIntensity commented Mar 24, 2025

duaneg commented Mar 25, 2025

duaneg commented Mar 25, 2025

ZeroIntensity commented Mar 25, 2025

duaneg commented Mar 26, 2025

gh-73069: fix race conditions in os.scandir and associated direntry #131642

Are you sure you want to change the base?

gh-73069: fix race conditions in os.scandir and associated direntry #131642

Conversation

duaneg commented Mar 24, 2025

bedevere-app bot commented Mar 24, 2025

ZeroIntensity left a comment

Choose a reason for hiding this comment

duaneg commented Mar 24, 2025

ZeroIntensity commented Mar 24, 2025

duaneg commented Mar 25, 2025

duaneg commented Mar 25, 2025

ZeroIntensity commented Mar 25, 2025

duaneg commented Mar 26, 2025