-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-73069: fix race conditions in os.scandir and associated direntry #131642
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A blurb/news entry makes sense here.
- Mutexes are prone to re-entrancy deadlocks in cases where we call back into Python, e.g., in an object's finalizer. We should be using critical sections, probably alongside argument clinic.
Righto, will add.
Hmm, it seems re-entrancy is implicit in a lot of innocuous looking calls: when I first looked through the code it didn't seem like it could be re-entrant, but looking more closely I guess even something like Regardless, I don't think a critical section works for this bug: these methods are making blocking IO calls, which will release that lock. Note that these crashes occur on a GIL build, which AIUI means critical section cannot be an appropriate fix (since it is a no-op there). FWIW I did try it quickly and confirmed it still crashes with Would using a |
It seems we do need to worry about re-entrancy, so switch to using a recursive mutex. Note we can't use a critical section since we need to perform blocking IO while holding the lock, which critical section would implicitly release.
Maybe, but we should be very cautious with it.
We might have to resort to some atomic compare-exchange shenanigans if mutexes prove to be too difficult. |
Ah, true! Unfortunately I don't think using atomics will help though: we ultimately must prevent concurrent calls to either or both of Looking again at the code, I think it can be refactored so the section holding the lock does not call anything that could possibly call back into Python. Then we can use a standard non-recursive mutex. I'll work on that now. |
… could be re-entrant.
…to avoid possibility of deadlock upon re-entrance.
After playing with various options I think the c1 and c2 fixes require different approaches. The c1 fix is more-or-less as described above: refactor to hold the lock only while retrieving the next entry, ensuring that no re-entrancy is possible. The only wrinkle is the direntry result from Note the Windows and POSIX code is looking very similar after all that. I think with a bit of additional refactoring it should be possible for them to share most of the same code and eliminate some duplication between #ifdef blocks. If this sounds useful I will carry on with that. A similar approach doesn't really work for the c2 fix, however. In this case I think atomic compare/exchange is the better option: in a race there may be duplicated Should I rebase these commits to get rid of the introduced-and-then-removed failed attempts? |
Nope, we squash everything at the end anyway. Force-pushing and getting rid of commits just makes reviewing more difficult :) |
The new commit is just a little bit of additional refactoring to unify some of the Windows/POSIX code and minimise #ifdef code blocks. I'm not sure if this sort of semi-related cleanup-in-passing is considered helpful or unnecessary churn/risk here: if it is unhelpful (or should be done in a separate PR) just let me know and I'll drop/move it. |
…(the final filename field being an unsized array) and in actual practice on the WASI platform. Thus just copying one into a stack-allocated structure is not good enough. The obvious way to address this is to allocate memory on the heap and return that. However, it seems wasteful to dynamically allocate a temporary chunk of data just to copy some data in and then out again. Instead, if the name field is potentially not long enough, allocate some extra space *on the stack* following the structure such that it must be sufficient to hold the the full record length (and copy all of that, not just the potentially smaller fixed structure size).
This fixes the first two (c1 & c2) crashes from gh-73069. Let me know if I should add a news entry.