Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when thread using dynamically loaded Rust library exits #91979

Open
devongovett opened this issue Dec 15, 2021 · 14 comments
Open
Labels
A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@devongovett
Copy link

Scenario: I have a Rust cdylib, which is loaded by a C program via dlopen. The C program creates a thread, and loads the Rust module inside it. It proceeds to call one of the Rust functions, and closes the library via dlclose. Then the thread exits. The Rust program has a thread local variable with a struct that implements Drop, which it modifies in the function called from C.

Full reproduction here: https://github.com/devongovett/rust-threadlocal-bug

On CentOS 7, which uses glibc 2.17, it segfaults at __nptl_deallocate_tsd() inside pthread_create.c. With later versions of glibc, there is no crash. I believe the crash occurs because Rust creates a thread local key with pthread_key_create but never calls pthread_key_delete (the call in the destructor is commented out):

impl Drop for Key {
fn drop(&mut self) {
// Right now Windows doesn't support TLS key destruction, but this also
// isn't used anywhere other than tests, so just leak the TLS key.
// unsafe { imp::destroy(self.key) }
}
}

When the thread exits, glibc tries to call the destructor for the key, but because the dynamic library has already been unloaded via dlclose at this point, the function no longer exists and we get a crash.

My theory is that this only occurs with glibc 2.17 and not later versions is due to __cxa_thread_atexit_impl not existing in these older versions. This function is used when available to register destructors, otherwise a fallback implementation is used:

if !__cxa_thread_atexit_impl.is_null() {
type F = unsafe extern "C" fn(
dtor: unsafe extern "C" fn(*mut u8),
arg: *mut u8,
dso_handle: *mut u8,
) -> libc::c_int;
mem::transmute::<*const libc::c_void, F>(__cxa_thread_atexit_impl)(
dtor,
t,
&__dso_handle as *const _ as *mut _,
);
return;
}
However, I'm not sure about that. It could be some other change in glibc.

I have not tested, but I think the bug could potentially be fixed if the commented out destructor linked above were actually called. The comment indicates something about windows not supporting this, so maybe it could be called conditionally?

glibc 2.17 is indeed pretty old, however, it is the version used by the current CentOS 7 version which is not EOL until 2024, so I do think this bug should be fixed.

Meta

rustc --version --verbose:

rustc 1.57.0 (f1edd0429 2021-11-29)
binary: rustc
commit-hash: f1edd0429582dd29cccacaf50fd134b05593bd9c
commit-date: 2021-11-29
host: x86_64-unknown-linux-gnu
release: 1.57.0
LLVM version: 13.0.0
@devongovett devongovett added the C-bug Category: This is a bug. label Dec 15, 2021
@devongovett
Copy link
Author

Upon further research, I'm pretty sure the reason it "works" with newer glibc versions is that the library is actually never fully unloaded if there are TLS destructors registered via __cxa_thread_atexit_impl. In the source code, l_tls_dtor_count is incremented when a destructor is registered: https://github.com/bminor/glibc/blob/91cc803d27bda34919717b496b53cf279e44a922/stdlib/cxa_thread_atexit_impl.c#L137

And in dlclose, this is checked: https://github.com/bminor/glibc/blob/90b37cac8b5a3e1548c29d91e3e0bff1014d2e5c/elf/dl-close.c#L186

This can be verified by running my reproduction program above with the LD_DEBUG=files environment variable. In glibc 2.17, it calls the .fini section immediately on dlclose (just before segfaulting), but in newer versions this is deferred until process exit (essentially a leak?!).

@Brooooooklyn
Copy link

Brooooooklyn/canvas#377 maybe relate to it

@hkratz
Copy link
Contributor

hkratz commented Dec 16, 2021

@rustbot label T-libs A-thread-locals

@rustbot rustbot added A-thread-locals Area: Thread local storage (TLS) T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 16, 2021
@follower
Copy link
Contributor

follower commented Dec 16, 2021

It may be useful/informative to read some of the prior discussion of dynamic libraries & thread local storage in this & linked issues: #28794

FWIW based on my experience the only "reliable" approach has been to simply never allow dylibs to be unloaded: follower/foreigner@3845586

Edit: Especially nagisa/rust_libloading#41 & also https://sourceware.org/glibc/wiki/Destructor%20support%20for%20thread_local%20variables

(I first encountered this issue when using wasmtime from C & C++.)

@Aaron1011
Copy link
Member

See also this upstream glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=21032

@devongovett
Copy link
Author

devongovett commented Dec 16, 2021

A bit more context about how this affects real world code. This currently happens to all native Node.js modules that include thread locals, when loaded within Node's worker threads. Node controls when dlopen and dlclose are called, so there isn't really a great way for module authors to simply prevent unloading the dylib. One way is for users to also load the module from the main thread in addition to workers, which in effect delays unloading until process exit, but this is not very obvious or ergonomic. And this is far from the only case where this could occur.

One potential solution is to register a destructor function in the .fini_array section of the ELF binary, which will be called when the library is unloaded by dlclose. In C this is possible via the __attribute__((destructor)) syntax, which was used to solve this related bug: https://bugzilla.redhat.com/show_bug.cgi?id=1065695. Perhaps Rust could do something similar to ensure the thread local key is dropped when the dylib is unloaded?

@joshtriplett joshtriplett added the E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. label Jan 26, 2022
@joshtriplett joshtriplett removed the E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. label Mar 1, 2022
@thomcc
Copy link
Member

thomcc commented Mar 1, 2022

Hmm, is this possibly caused by #88737 ? (edit: no, but it is related). That bug shouldn't be that hard to fix, if so. (edit: ☹️)

@mcollina
Copy link

Apprently, this was made significantly worse in v1.83.0, which has starting to segfault for me when using native addons written in Rust in Node.js.

@devongovett
Copy link
Author

FYI, @Brooooooklyn found a workaround. Compiling with the following linker args prevents the dylib from being unloaded.

[target.'cfg(target_env = "gnu")']
rustflags = ["-C", "link-args=-Wl,-z,nodelete"]

Would be nice to get a real fix though. This has indeed gotten much worse with recent Rust versions.

thecrypticace added a commit to tailwindlabs/tailwindcss that referenced this issue Mar 18, 2025
…read (#17276)

When Tailwind is loaded in a Node Worker thread, it currently causes a
segmentation fault on Linux when the thread exits. This is due to a
longstanding issue in Rust that affects all native modules:
rust-lang/rust#91979. I reported this years
ago but unfortunately it is still not fixed, and seems to have gotten
worse in Rust 1.83.0 and later. Looks like Tailwind recently updated
Rust versions and this issue started appearing when run in tools like
Parcel that use worker threads.

The workaround is to prevent the native module from ever being unloaded.
One way to do that is to always load the native module in the main
thread in addition to workers, but this is hard to enforce.
@Brooooooklyn found another method, which is to use a linker option for
this. I tested this on an Ubuntu system and verified it fixed the issue.
You can test with the following script.

```js
// test.js
const {Worker} = require('worker_threads');
new Worker('./worker.js');

// worker.js
require('@tailwindcss/oxide');
```

Without this change, a segmentation fault will occur.

---------

Co-authored-by: Jordan Pittman <jordan@cryptica.me>
philipp-spiess pushed a commit to tailwindlabs/tailwindcss that referenced this issue Mar 19, 2025
…read (#17276)

When Tailwind is loaded in a Node Worker thread, it currently causes a
segmentation fault on Linux when the thread exits. This is due to a
longstanding issue in Rust that affects all native modules:
rust-lang/rust#91979. I reported this years
ago but unfortunately it is still not fixed, and seems to have gotten
worse in Rust 1.83.0 and later. Looks like Tailwind recently updated
Rust versions and this issue started appearing when run in tools like
Parcel that use worker threads.

The workaround is to prevent the native module from ever being unloaded.
One way to do that is to always load the native module in the main
thread in addition to workers, but this is hard to enforce.
@Brooooooklyn found another method, which is to use a linker option for
this. I tested this on an Ubuntu system and verified it fixed the issue.
You can test with the following script.

```js
// test.js
const {Worker} = require('worker_threads');
new Worker('./worker.js');

// worker.js
require('@tailwindcss/oxide');
```

Without this change, a segmentation fault will occur.

---------

Co-authored-by: Jordan Pittman <jordan@cryptica.me>
@workingjubilee
Copy link
Member

workingjubilee commented Mar 26, 2025

Now that CentOS 7 is EOL, it is very hard to imagine that this would be best fixed by a method other than raising the requirements to a version of glibc that we actually can work with. Every other hack I have seen for running destructors on dlclose has tended to result in other, novel kinds of misery that show up a few months later. That said, that would only work if glibc 2.17 is the only case that reaches this, and besides that, I'm not sure that the .fini_array approach wouldn't work.

@X547
Copy link

X547 commented Mar 26, 2025

raising the requirements to a version of glibc that we actually can work with

As I understand, issue is still present with latest glibc, but it is a leak instead of crash.

@X547
Copy link

X547 commented Mar 26, 2025

This bug is observed when closing Vulkan 3D applications with Mesa NVK driver that use Rust code.

It is important to properly support dlopen/dlclose for OpenGL/Vulkan drivers because it is needed for compositor GPU hotplug, driver updates and other long living system 3D applications use cases.

@workingjubilee
Copy link
Member

workingjubilee commented Mar 26, 2025

As I understand, issue is still present with latest glibc, but it is a leak instead of crash.

And so is Box::leak or mem::forget, which is a safe function, whereas it is slightly harder to induce a segfault with safe code.

It is not clear that we can in fact support this "properly" if we define it as "without leaking".

@X547
Copy link

X547 commented Mar 26, 2025

And so is Box::leak or mem::forget, which is a safe function, whereas it is slightly harder to induce a segfault with safe code.

The problem is not safety, but logic breakage. Missing pthread_key_delete will cause leaking shared library reference and make impossible to live reload of shared library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests