Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KV Data Loading: Non WASM Pluggability, Ability to Create Indexes and Other Ingestion Features #57

Open
thegreatfatzby opened this issue Jun 2, 2024 · 2 comments
Assignees

Comments

@thegreatfatzby
Copy link

Similar to the question about logging with data loading, I wonder if we can allow more functionality/flexibility in the data loading corner of the KV world, and allow for creating more interesting indexes, more complex loading and batching logic, etc. As an example, during our bidder's indexing phase we do a lot with bitmaps, and I'd guess that if we could do that in C/C++ and then read using the WASM, we'd be able to optimize a lot more than if we can only store string based KVs. Since the loading of data is safe (I think) from a privacy perspective, there could be a Chromium space hook, loadFile/loadData/etc, that invokes compiled code with more interesting access to the in memory store.

This would pair nicely with being able to expose our own reading functions, but since that is not privacy safe I'd think what we could do instead is submit PRs to the repo for new reading UDFs as makes sense, but the writing side would not require that.

@kelvintatendagorekore
Copy link
Contributor

Thanks for your questions and suggestions!

We have been discussing and working on some features internally to expand functionality/flexibility in the data loading and query areas and, I wanted to talk more about some of these existing features to see if they would be useful towards addressing any of your needs:

  1. Set query language - The K/V server supports loading data as std::string sets and exposes a UDF read API for executing queries over the sets called runQuery("query"). The query language supports simple set operations, union denoted as |, difference denoted as - and intersection denoted as &. As an example, suppose that we have indexed two sets of ads related to games and news to the K/V server memory store. We can find the sets of ads related to both games and news by intersecting games and news using the following query in a UDF: games_and_news_ads = runQuery("games & news"). See the following example on how to use the runQuery API: https://github.com/privacysandbox/protected-auction-key-value-service/tree/release-0.16/getting_started/examples/sample_word2vec#the-word2vec-sample
  2. Set query based on bitmaps - Improving on (1) above, the K/V server also supports indexing uint32 sets and running queries over these sets using the UDF read API called runSetQueryInt("query"). Queries over uint32 sets are implemented using bitmaps for performance. More documentation to come soon.
  3. Binary data for key/value lookups - The K/V server supports a key/value read API getValuesBinary that can be called from UDFs which returns a binary serialized protobuf response thus avoiding JSON serialization overhead in the similar getValues API. The feature is documented here: https://github.com/privacysandbox/protected-auction-key-value-service/blob/release-0.16/docs/udf_read_apis_with_binary_data.md#udf-datastore-read-apis-with-binary-data

We are also curious to hear more about your questions and needs.

  1. Re: "creating more interesting indexes...": Do you mind providing a concrete example of what you mean by creating indexes? As an example, by indexing do you mean you want to have multiple keys pointing to the same copy of the value to save space? In general, what are your goals for having indexes?
  2. Re: "more complex loading and batching logic, etc": Do you mind explaining further what you mean here, maybe an example would be helpful?
  3. Re: "...we do a lot with bitmaps, and I'd guess that if we could do that in C/C++ and then read using the WASM, we'd be able to optimize a lot more...": Can you expand more on how you envision using bitmaps? Would it happen during data loading or at query time? Would this require any side effects to the memory store during request processing (Note this is not allowed for privacy)?

@lx3-g
Copy link
Collaborator

lx3-g commented Jul 16, 2024

@thegreatfatzby Anything else we need to clarify? Does the current functionality meet your needs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants