-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Knowing the source site in the aggregation API / aggregate queries need key discovery mechanism #583
Comments
Thanks for filing @alois-bissuel. The use-case makes sense. To me, something like (3) seems to be the most natural choice by putting most of the onus on the aggregation service to provide the ability to measure aggregates you are interested in. We would need to think through what that means for the input encoding and query model. One question: is it feasible for you to encode the publisher in a dense encoding i.e. using a dictionary vs. a sparse encoding like the raw domain bytes? Note we are also thinking through a design for token-based authentication which would interact with (1) and (2). Will try to have something published soon. |
Thanks for the quick answer!
I hope this makes sense! |
@alois-bissuel yes it does make sense, thank you for clarifying! |
FWIW, to me, (3) also seems the easiest choice from the API designer's perspective. It would be up to the adtech to encode the domain/URL/whatever they need using one or more 128 bit keys. We could consider changes to the aggregation service to allow discovery of keys (e.g. if the value for a given key exceeds a suitably large noisy threshold, we'd allow reporting on it even if the key has not been pre-declared in a domain file at query time). Such API extension might let you as an adtech implement something RAPPOR-like on top of it. cc @cilvento |
// +1 to option 3, discovery of keys sounds like a more general, useful mechanism. |
Impression source domain is a fundamental piece of information that factors into almost every ads use-case and as information allowing advertisers to identify who is being advertised to is reduced or eliminated by privacy preserving changes, the impression source domain will become a more critical factor in impression purchase decisions. Given that, any restrictions on the availability of source domain should be considered very carefully since limiting what can be reported limits what can be measured and that will have a direct impact on what inventory buyers will support: marketers are not going to buy impressions they can't tie to a source domain. Given the importance of source domain, I suggest we consider making it a requirement that any measurement solution include it; if we don't, I think the degradation in usability will inhibit wide adoption and push participants to alternative, more privacy invasive, measurement tools and/or to shift their spend to contexts in which measurement is better supported. With that preamble, could source domain be included in aggregatable reports as part of the encrypted payload? That would allow the potential of including it in the aggregation key when it had value for a specific report and the aggregation service could redact, filter or noise outputs to prevent revealing too much source domain related information. |
It seems like it could be helpful to call out when source domain is needed for cross-site measurement versus same-site measurement/reporting. For example, reporting that an ad was served on a given domain (modulo restrictions in FLEDGE reporting) is a same-site reporting use-case. This same-site reporting could also be helpful for scoping the set of possible source domains for key discovery, although it's not immediately clear to me how efficient this would be (particularly if conversion rates are low). The case @alois-bissuel is outlining above for source domain discovery makes sense for ARA, but are there other use-cases that should be considered in the design? For example, are there other use-cases for unique reach, frequency reporting, etc that would require different source domain discovery methods in the Private Aggregation API? Or is just making the source domain available within encrypted reports a sufficient first step. |
@cilvento I've started a response, still thinking things through before I post it. In considering what you said, I occurred to me that I'm not entirely clear on how the "mechanism for key retrieval or discovery in the aggregation service" identified by @alois-bissuel in the 3rd option above would work. If someone could add a description that would be most appreciated. |
Hey @bmayd , I think what @alois-bissuel is referring to is something like the following:
This additional query option will often look like a histogram with a thresholding step applied (see this paper for some technical details). In this way, the query result helps you "discover" the non-zero keys i.e. which publishers saw conversions. This key discovery is not possible today due to the constraint that the output domain needs to be fully specified at query time. |
Thanks for the additional detail, @csharrison. The paper you refer to is rather opaque to me, but I gather the gist is that there must be sufficient value inputs from partition members before the partition is revealed. I in the context of A-ARA: source sites could act as partitions and could be included in outputs if there was enough contribution from them to assure their inclusion wouldn't provide information that might allow identification of specific inputs. Please let me know if I didn't get that right. Assuming my understanding is correct, I think the approach is reasonable and assume it would allow for reporting of top-converting source sites and an "other" bucket with something like a count of unspecified source sites and the conversions attributable to them. In terms of the encoding the source site, I suggest including it in the encrypted payload and not as part of the 128-bit aggregation keys. Doing this would allow the aggregation service to control when source site was revealed, but also allow for a separation between the aggregation keys and source sites so they would not consume key-space which would reduce key complexity and allow for keys that could potentially be reused across campaigns, with source site being an additional bucketing option. I think there would be other benefits from having source site available in the aggregation service, for example:
There are other sources of source site information, but sourcing it directly from the browser and through the aggregation service provides a unique point of validation coming from browsers via a protected channel vs other systems which are more subject to manipulation. |
Yeah, the idea is that any "key" could act as a partition. For this particular use-case you could imagine a key includes the source site, e.g. source_site x campaign is a key.
I think this is technically possible but we'd need to carefully design this functionality to be privacy preserving. It is not immediately available with the technique I linked (which just drops the buckets).
This is an interesting suggestion and I agree it comes with some benefits, but for completeness I think it's worth discussing some downsides:
Can you say more about this? Is the concern about a bad actor mutating an aggregation key, or about an aggregation key that is securely generated from bad information? |
Sorry for the very late answer. First of all, I completely support @bmayd's explanation of the need of having the domain in the reports. I think that including the domain within the encrypted part of the report is an extremely good idea, which would balance nicely the security and usability properties of the API. For the API surface, I reckon a simple interface could be created. For instance, we could query the provided keys without the domain (eg cross-domain reporting), or ask to get the provided keys crossed with the domain (and maybe have a further thresholding step @csharrison introduced). I don't think we need to specifically filter by source domain (eg I want only the reports for example.com), so we don't need a specific query language to describe the source site. |
I was actually thinking of sources of site information that are outside the ARA entirely, such as ad-servers. As we become increasingly data limited, our ability to confidently corroborate claims is reduced, making information provided through a trusted channel, such as ARA, much more important.
I agree with @alois-bissuel here, I think it is a good start that addresses the majority of reporting needs and will give us a solid basis for evaluating the API. If it turns out there are significant unaddressed use-cases, we can consider them when they're surfaced. |
@csharrison Been a long time since we discussed this, but I wasn't able to find anything regarding the resolution. Will it be possible to get the impression source domain included in the encrypted payload so it is available for inclusion in group-by keys so we can report by source domain? |
The first step to this (introducing a key discovery mechanism) is still under consideration and I think is a prereq for this use-case. Once this is supported, the use-case of getting impression domains included is partially supported via hash-based methods (which as I understand from this thread are non-ideal). However, we can probably work from this foundation on more advanced techniques like encrypting the whole site, but we haven't made a ton of progress on that yet. cc @keke123 |
Hello,
We have two use cases in advertising which are hard to fit in with the current version of the aggregate API. Advertisers and marketers want to know from which domain conversions were made. For fraud prevention, knowing on which domain clicks were made is paramount for detecting and banishing shady websites being set up for siphoning money off advertiser.
The source site (ie the publisher domain) has been removed in aggregatable reports in #445. Before this pull request (which is trying to solve #439), the source site was available in the clear (as the
attribution_destination
currently is).Encoding the publisher domain in the aggregate API in its current state (ie no
source_site
in the aggregatable reports) is a very hard problem because of its following characteristics:So far, I see three potentials solutions, the first two of which use plausible deniability to add back in clear the
source_site
to the aggregatable reports:source_site
in aggregatable reports, and send with some probability empty conversion reports (eg random key and zero value) from any website the user has visited. This might enable the exfiltration of even more user data than before (a very targeted campaign will allow a bad party to gain knowledge of the browsing habit of the targeted user group). Hence the second proposition.What are your suggestions?
N.B. This issue also concerns the Private Aggregation API, as it uses the same report format as ARA for slightly different use cases. Cross-posting a very similar issue there.
The text was updated successfully, but these errors were encountered: