Fix alignment issue of TensorData bytes #2416

WorldSEnder · 2024-10-24T06:46:36Z

Checklist

Confirmed that run-checks all script has been executed.
I have some issues with flaky tests, in particular quantize_dynamic_int8 and downsample_interpolation are off-by-1 in a few cases. I've ignored them locally, they seem to be fine in CI.
Further, I've stumbled over a test failure seemingly related to my AMD GPU driver

ACO ERROR:
In file ../mesa-24.1.4/src/amd/compiler/aco_assembler.cpp:1056
Unsupported opcode: v4: %370:v[8-11] = v_wmma_f32_16x16x16_f16 (latekill)%369:v[36-43].xx, (latekill)%359:v[48-55].xx, %368:v[8-11].xx

Made sure the book is up-to-date with this PR. Not applicable, changes to internals only.

Related Issues/PRs

Fixes #2375.

Changes

Add a new opaque structure Bytes which holds on to tensor data. As explained in the linked issue, using a Vec<u8> is incorrect, as this will deallocate the memory with an incorrect alignment/memory layout given to the allocator, which violates the allocator contract. Instead, the structure remembers the alignment of the data and uses that when deallocating.

For serialization, only the bytes are written and no alignment is taken into account. Instead, when deserializing, the data is allocated with over-alignment (currently align_of::<u128>) which should be sufficient to interpret a slice of the data as a slice of some larger type, such as f32 etc.

Note that this means that serialization is not strictly a round-trip, which can show in the difference between how slices and Vec works:

to re-interpret a [u8] as a [E], the slice needs to be aligned at least as strictly as the alignment of E.
to re-interpet a Vec<A> as a Vec<B>, the alignments must be exactly equal. This means that a deserialized instance of Bytes possibly can not be converted via Bytes::try_into_vec. In case this is needed in the future, one would either need to get access to the needed alignment during deserialization or copy the data to a correctly aligned buffer. Such an API is possible but not included in this PR.

Testing

Some limited tests of TensorData has been run through miri manually to confirm that it doesn't complain after the patch. In case you want tests in the repo, I would need more guidance how to set that up, since miri should probably only run a limited number of tests for CI performance reasons.

introduce max alignment (which depends on platform anyway) and dont serialize that part fixes Clone, Debug, and Eq impls to work on the bytes, not the pointers.

codecov · 2024-10-24T07:19:50Z

Codecov Report

Attention: Patch coverage is 84.57944% with 33 lines in your changes missing coverage. Please review.

Project coverage is 82.91%. Comparing base (80c831b) to head (06c7236).
Report is 44 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/burn-tensor/src/tensor/bytes.rs	84.07%	32 Missing ⚠️
crates/burn-core/src/record/serde/data.rs	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2416      +/-   ##
==========================================
- Coverage   85.26%   82.91%   -2.35%     
==========================================
  Files         792      811      +19     
  Lines      104516   105103     +587     
==========================================
- Hits        89115    87149    -1966     
- Misses      15401    17954    +2553

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

WorldSEnder · 2024-10-24T07:48:51Z

Re Codecov report:

The missed lines are for the most part the Debug impl of Bytes. I think it makes sense to invest a bit more time into nice formatting there, any comments on what the output should be would be appreciated.

laggui · 2024-11-04T14:01:08Z

Sorry for the delayed response! We discussed this offline but forgot to get back to you.

First, thanks for identifying the potential issue.

While these changes to handle alignment concerns more generically could certainly provide better safety, it also adds complexity and might not be immediately necessary. That's why we're a bit hesitant on making this change for now. As you noticed this is not an issue for most allocators.

Of course that doesn't mean it is not an issue, just that it does not affect the majority of use cases. So we should still keep this in mind, and we might circle back in the future thanks to your investigation.

WorldSEnder · 2024-11-04T15:41:07Z

While I understand the added complexity makes it difficult to review, I am a bit disappointed. There is no way around the fact that the currently implementation is misusing unsafe. miri rightfully complains about it. Yes, on the major platforms I have on my hand (linux and windows) the allocator plays along, yet burn-tensor is no_std. I do not have a ras-pi on my hands to test this, but the burn book mentions the embedded-alloc crate and that already seems much more fickle about the alloc requirements. After all, this is rust but I understand that you don't have the time to properly review this, with the ongoing work on cubecl and all. Don't read this as me trying to rush this in, I would also not be happy to see this merged without a proper review.

I only noticed this when I implemented the deserialization logic, but please note that the current impl will also happily allocate a Vec<u8> aligned to "just" a u8 even for element types of f32 or whatever you element type one has that will surprisingly and inconveniently fail to cast to a slice of its own element type for example here.

In any case, I suppose there is no problem with just leaving this open for the time being.

Just one more question: Would you be less wary if the new unsafe code was in a tested outside crate such as bytemuck? Some parts such as the (de)serialization I think don't make sense to be outside burn, but most of the "byte container" could possibly make more sense upstream.

laggui · 2024-11-04T17:07:39Z

Don't read this as me trying to rush this in, I would also not be happy to see this merged without a proper review.

No worries, we're on the same page here 🙂

I think our main concerns were more in terms of trade-offs, mostly time and possible ramifications to this breaking change (though I think my position has changed on that.. as you'll see below).

burn-tensor is no_std. I do not have a ras-pi on my hands to test this, but the burn book mentions the embedded-alloc crate and that already seems much more fickle about the alloc requirements

I'm glad you brought this up! Honestly I think I kind of forgot to take that into consideration.. this should be especially important for Burn as we aim for portability. I think our previous position was that the current implementation was not an issue for most modern processors and platforms, but that's not good enough.

I should really get a ras-pi myself and start messing around with Burn on no_std so we can improve this.

Just one more question: Would you be less wary if the new unsafe code was in a tested outside crate such as bytemuck? Some parts such as the (de)serialization I think don't make sense to be outside burn, but most of the "byte container" could possibly make more sense upstream.

I'm not too wary about this tbh.

I gotta run but wanted to make sure to get back to you with my current thoughts.

In short, in my previous comment I was opened to the change but did not think it was required for the short term. But now I think we should make sure that this is addressed (ideally, without performance impacts to de/serialization - so we should make sure to benchmark that for some practical use cases).

WorldSEnder · 2024-11-04T22:54:21Z

I gotta run but wanted to make sure to get back to you with my current thoughts.

In short, in my previous comment I was opened to the change but did not think it was required for the short term.

No worries, I too think we are on the same page :) While strictly speaking violating the allocator contract like atm could lead to memory unsafety, I don't think any allocator actually does use the violated guarantees in a meaningful way (but I'm open to surprises). On the other hand, misaligned allocations would be a real concern but at least as far as I can tell mostly lead to panics, not undefined behaviour or violations of memory safety, so a relaxed approach to fixing this is fine 👍 Sorry if I came off as a bit panicky

laggui · 2024-11-08T13:40:19Z

@WorldSEnder I was looking at your PR yesterday and I think we'd like to stay away from a new opaque type for now.

I also benchmarked some solutions to preserve alignment and none of them come close to the current pointer re-interpret sadly.

I think we can keep both approaches instead based on some target architectures for now, since we know that heap allocations are overaligned on mainstream systems. And for others with stricter alignment rules, we fallback to something similar to what you implemented in the Bytes opaque type. We also add some documentation for anyone that might come across the code in the future 🙂

    /// Initializes a new tensor data structure from the provided values.
    fn init<E: Element, S: Into<Vec<usize>>>(
        mut value: Vec<E>,
        shape: S,
        dtype: DType,
    ) -> Self {
        // Ensure shape is valid
        let shape = shape.into();
        let shape_numel = Self::numel(&shape);
        value.truncate(shape_numel);
        let numel = value.len();
        assert_eq!(
            shape_numel, numel,
            "Shape {:?} is invalid for input of size {:?}",
            shape, numel,
        );

        let bytes = Self::to_bytes(value);

        Self {
            bytes,
            shape,
            dtype,
        }
    }

    fn to_bytes<E: Element>(value: Vec<E>) -> Vec<u8> {
        // Fast path for architectures with heap allocation overalignment.
        #[cfg(all(
            not(miri),
            any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64")
        ))]
        {
            let mut value = value;
            let factor = std::mem::size_of::<E>() / std::mem::size_of::<u8>();
            let len = value.len() * factor;
            let capacity = value.capacity() * factor;
            let ptr = value.as_mut_ptr();

            std::mem::forget(value);

            // SAFETY: Practically, mainstream systems align heap allocations beyond
            // the minimum required alignment for the element type (e.g., 8 bytes on 64-bit
            // systems).
            //
            // This is not guaranteed by Rust's memory model, which follows strict alignment
            // requirements as specified by the type's layout.
            //
            // But we leverage the overalignment and efficient unaligned access to interpret
            // the same memory layout as individual bytes instead of `E` elements.
            // This simple trick makes this pointer cast acceptable (and requires no copy).
            unsafe { Vec::from_raw_parts(ptr as *mut u8, len, capacity) }
        }

        // Fallback for architectures with stricter alignment rules.
        #[cfg(any(
            miri,
            not(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64"))
        ))]
        {
            let num_bytes = value.len() * std::mem::size_of::<E>();
            let mut bytes = vec![0u8; num_bytes];
            unsafe {
                std::ptr::copy_nonoverlapping(
                    value.as_ptr() as *const u8,
                    bytes.as_mut_ptr(),
                    num_bytes,
                );
            }
            bytes
        }
    }

I also added miri to the conditions to make sure it doesn't complain 😅 let me know how you feel about this.

WorldSEnder · 2024-11-08T23:32:58Z

@laggui I think in the long run, an opaque wrapper should happen at some point (note the wrapper in this PR does no mem-copy so should not have the overhead you mention). A publicly exposed field of Vec<u8> lets the user mem::swap in any Vec<u8> that they may have constructed themselves. I'd want to double check if that entails that the storage has to be aligned if the system allocator always aligns it, my current assumption would be that it entails more weirdness that an opaque wrapper struct would cover.

Secondly, the target detection is perhaps 90% there. Sadly, I don't think there is a way to detect the specific allocator being used, and the check is of course only detecting the target_*, not whether System is being used as the #[global_allocator] and as far as I know there is no way to detect which allocator ends up being used in the end as the forwarding Global does not expose this.
Note for future reference: It seems the choice of default system allocator is made in the std in here if we want to go the route of just using system information as a proxy.
Another comment would be to just let miri complain. This would still be technically violate the allocator contract and since there is no way to check which allocator is used, I would still see it as unsound. So let miri panic and don't paint over the wound.

What exactly is the concern with an opaque struct? It seems there would be a simple almost-zero-cost conversion from Vec<u8> to TensorData, a Deref<Target=[u8]> impl and reasonably efficient conversion back to Vec<u8> (though that might entail a mem-copy or the downside of possibly-undefined-behaviour-maybe-just-panicking.

PS: Did you encounter any slowdown with this PR in your benchmarking? If so, this would be a concern and I would put in more work to make sure I didn't miss any unreasonable overhead.

laggui · 2024-11-11T16:07:37Z

A publicly exposed field of Vec lets the user mem::swap in any Vec that they may have constructed themselves. I'd want to double check if that entails that the storage has to be aligned if the system allocator always aligns it, my current assumption would be that it entails more weirdness that an opaque wrapper struct would cover.

Yeah I think we could (should) remove the public field and users could only get the slice with the current available methods. So no direct access to the underlying bytes. Not sure why we left it public actually 😅

Secondly, the target detection is perhaps 90% there. Sadly, I don't think there is a way to detect the specific allocator being used, and the check is of course only detecting the target_*, not whether System is being used as the #[global_allocator] and as far as I know there is no way to detect which allocator ends up being used in the end as the forwarding Global does not expose this.

That's what I found as well. Was trying to cover the most important cases with this condition where stricter alignment requirements are necessary, but I realize this is not 100%. I wasn't sure if a new type was warranted at this time but I now realize that it might be required to do this cleanly.

Another comment would be to just let miri complain. [...] So let miri panic and don't paint over the wound.

Oh I agree! This was mostly for the current tests with miri 😅

What exactly is the concern with an opaque struct?

An opaque struct like Bytes is a nice to encapsulate and control the alignment details, but it introduces a breaking change with potential overhead (and complexity, though in its current form it is fairly simple and contained). I think my main concern is/was with potential redundancy since the alignment provided by Vec<u8> is sufficient on common platforms with Rust's default allocator. I thought we might be able keep away from the new opaque struct to avoid the breaking change and deal with the alignment requirements when necessary, though I'm not sure if this is feasible for 100% coverage (and I am not entirely informed on the different allocators). I'm bouncing back and forth between the two positions a bit (as with any trade-off, thanks for the constructive feedback / discussion in this thread btw!)

Also tiny note on the current Bytes implementation: what about sending between threads? Might need to be more careful (as opposed to Vec<u8>).

PS: Did you encounter any slowdown with this PR in your benchmarking? If so, this would be a concern and I would put in more work to make sure I didn't miss any unreasonable overhead.

Ran some benchmarks for TensorData creation from existing floating point data (32 x 512 x 1024) and deserialization (12 linear layers with shape [2048, 2048]) on both the current state and with your changes.

The results seem to be on par with the current, which is great 🙏 I think I recalled incorrectly that there was a copy in from_elems without looking back at the code, so I was expecting a slowdown. But that's not the case (it's in another path).

Current

Benchmark	Feature	Backend	Device	Median
from_data	ndarray	`ndarray`	Cpu	95.000µs
load_record_lazy	ndarray	`ndarray`	Cpu	105.000µs
load_record_manual	ndarray	`ndarray`	Cpu	0.000ns
load_record_sync	ndarray	`ndarray`	Cpu	253.399ms

This PR

Benchmark	Feature	Backend	Device	Median
from_data	ndarray	`ndarray`	Cpu	96.000µs
load_record_lazy	ndarray	`ndarray`	Cpu	112.000µs
load_record_manual	ndarray	`ndarray`	Cpu	0.000ns
load_record_sync	ndarray	`ndarray`	Cpu	269.023ms

WorldSEnder · 2024-11-11T17:32:14Z

Yeah I think we could (should) remove the public field and users could only get the slice with the current available methods. So no direct access to the underlying bytes. Not sure why we left it public actually 😅

I was wondering if it was intentional that you could push further bytes after construction, cause that seems a bit weird since you could push e.g. a single u8 even when the TensorData contains u16 etc. Good to hear that this was never intended, thought I missed something :D

Was trying to cover the most important cases with this condition where stricter alignment requirements are necessary, but I realize this is not 100%.

After more investigation, and a bit longer look than I had wished, I found that the default allocator on windows actually does use the passed align in a non-trivial way. Specifically, data below MIN_ALIGN is just allocated as-is, but anything beyond that will allocate additional space and contain a header with a pointer to the actual allocation, pictogram:

x ... unused bytes
u ... user data

 v --- aligned to MIN_ALIGN, this pointer is allocated from HeapAlloc
 | x x x x Header(ptr) | u u u u u u u ..... |
 ^----------------^-^
                       ^  aligned to requested alignment, this pointer is returned to user-code

The reason the current code gets away with it - supposedly - is because on 64-bit platforms MIN_ALIGN is 16, which is larger than the data types that currently get put into a tensor. But it's awefully brittle and it just feels like corner-case after corner-case to find these kind of "exceptions" to the allocator playing nicely along to wrong alignment.

it introduces a breaking change

Yeah, the timing close to the release of 0.15 is a bit unfortunate. Not sure about your planned release cycles, I wouldn't mind holding it until the breaking 0.16 comes along naturally.

Also tiny note on the current Bytes implementation: what about sending between threads?

Thread-safety is indeed a good point. The current Bytes should be Send + Sync without a problem since Element: Send + Sync and &Bytes offering only read access, but that could be encoded in some static asserts and tests.

The results seem to be on par with the current, which is great 🙏

There seems to be a slight slowdown in deserialization, which might perform an extra copy to an aligned buffer. I think in that case we can get away with actually checking the runtime alignment instead (still storing and de-allocating with the alignment of the actual buffer) and not copying/only copying when someone wants to convert it an actual Vec<E>. I will take another look 👍

this already improves readability a bit by separating out alloc/dealloc logic and adding a bunch of safety comments and better error messages

borrowing from the deserializer will not save a copy, and is moreover inefficient when we could take ownership of an existing byte buffer

both changes only target miri's borrowing semantics, oprationally the pointers are the same, but they obey different borrow-stack rules.

WorldSEnder · 2024-11-12T18:07:47Z

Alright, have added both suggestions (Sync and Send + avoiding a copy if the deserialized vector ends up being aligned). I have added further tests and these can be run under miri. In case you want to do that yourself, try it with

cargo +nightly miri test --package burn-tensor --lib -- tensor::bytes::tests --show-output

WorldSEnder added 3 commits October 23, 2024 11:53

implement memory-safe bytes that can be serialized and cloned

cc8549f

change serialization to only serialize the bytes

45748ca

introduce max alignment (which depends on platform anyway) and dont serialize that part fixes Clone, Debug, and Eq impls to work on the bytes, not the pointers.

make bytes no-std compatible

f84c73a

WorldSEnder force-pushed the fix-vec-align branch from 5e9679a to f84c73a Compare October 24, 2024 07:03

WorldSEnder added 5 commits November 12, 2024 15:04

enforce Send and Sync for Bytes

4af1183

avoid a copy during deserialization if data is already aligned

1e56a16

this already improves readability a bit by separating out alloc/dealloc logic and adding a bunch of safety comments and better error messages

revert back to using Vec as deserialization intermediate

2a771e9

borrowing from the deserializer will not save a copy, and is moreover inefficient when we could take ownership of an existing byte buffer

add serialization and conversion tests

52654b3

make Bytes tests run under miri

06c7236

both changes only target miri's borrowing semantics, oprationally the pointers are the same, but they obey different borrow-stack rules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix alignment issue of TensorData bytes #2416

Fix alignment issue of TensorData bytes #2416

WorldSEnder commented Oct 24, 2024 •

edited

Loading

codecov bot commented Oct 24, 2024 •

edited

Loading

WorldSEnder commented Oct 24, 2024

laggui commented Nov 4, 2024

WorldSEnder commented Nov 4, 2024 •

edited

Loading

laggui commented Nov 4, 2024

WorldSEnder commented Nov 4, 2024

laggui commented Nov 8, 2024 •

edited

Loading

WorldSEnder commented Nov 8, 2024 •

edited

Loading

laggui commented Nov 11, 2024

WorldSEnder commented Nov 11, 2024 •

edited

Loading

WorldSEnder commented Nov 12, 2024

Fix alignment issue of TensorData bytes #2416

Are you sure you want to change the base?

Fix alignment issue of TensorData bytes #2416

Conversation

WorldSEnder commented Oct 24, 2024 • edited Loading

Checklist

Related Issues/PRs

Changes

Testing

codecov bot commented Oct 24, 2024 • edited Loading

Codecov Report

WorldSEnder commented Oct 24, 2024

laggui commented Nov 4, 2024

WorldSEnder commented Nov 4, 2024 • edited Loading

laggui commented Nov 4, 2024

WorldSEnder commented Nov 4, 2024

laggui commented Nov 8, 2024 • edited Loading

WorldSEnder commented Nov 8, 2024 • edited Loading

laggui commented Nov 11, 2024

WorldSEnder commented Nov 11, 2024 • edited Loading

WorldSEnder commented Nov 12, 2024

WorldSEnder commented Oct 24, 2024 •

edited

Loading

codecov bot commented Oct 24, 2024 •

edited

Loading

WorldSEnder commented Nov 4, 2024 •

edited

Loading

laggui commented Nov 8, 2024 •

edited

Loading

WorldSEnder commented Nov 8, 2024 •

edited

Loading

WorldSEnder commented Nov 11, 2024 •

edited

Loading