Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null terminated bytestrings? #483

Open
hasufell opened this issue Feb 4, 2022 · 5 comments
Open

null terminated bytestrings? #483

hasufell opened this issue Feb 4, 2022 · 5 comments

Comments

@hasufell
Copy link
Member

hasufell commented Feb 4, 2022

This is more of a discussion/question than an issue.

I was looking into calling into the libc function strpbrk, because it is much faster than any equivalent of findIndex could be, see:

But then I noticed... Haskell bytestrings are not null-terminated and doing so would require an entire memcpy, which kind of defeats the purpose when looking for optimization.

So I wondered:

  1. what if ByteStrings were null-terminated internally, without changing any of the external API? That would make it easier to just pass them to C functions expecting null-terminated strings without copying. Yes, I'm aware that a bytestring can have null bytes anywhere and that you'd potentially get divergent behavior between strpbrk and a "pure Haskell implementation"
  2. what if there was another module enforcing the variant? Via a newtype maybe?
  3. are there other tricks that could be employed? Lazy bytestrings, obviously, don't help here. Could Text be an alternative? The main reason I use ByteString for this task is because it has those very fast elemIndex functions implemented via memchr.

I think there might be many more such C functions, that are not re-implemented for ByteString exactly for that reason.

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 4, 2022

You cannot retain constant-time slicing for null-terminated strings.

@hasufell
Copy link
Member Author

hasufell commented Feb 4, 2022

You cannot retain constant-time slicing for null-terminated strings.

My suggestion wasn't to change the internal representation of ByteString. But you could have e.g. a module where only functions are exposed where adding the null byte during construction is trivial (e.g. fromString) and then maintain that invariant for all operations.

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 4, 2022

Trade offs and API would be vastly different from bytestring, so I don’t really see it fit into it.

We can discuss adding cbits implementation for findIndex instead.

@hasufell
Copy link
Member Author

hasufell commented Feb 4, 2022

We can discuss adding cbits implementation for findIndex instead.

It wouldn't even need to be a full findIndex, but something like

findIndex' :: [Word8]
           -> ByteString
           -> Maybe (Int, Word8)

I'm not sure whether I have any visions about how to implement this fast.

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 17, 2022

Something like this could be implemented very efficiently, even without cbits:

newtype Mask = Mask ByteString -- 256 bits = 32 bytes
findIndexInMask :: Mask -> ByteString -> Maybe (Int, Word8)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants