Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.5.0
-
None
Description
We got many reports that dropDuplicates does not clean up the state even though they have set a watermark for the query. We document the behavior clearly that the event time column should be a part of the subset columns for deduplication to clean up the state, but it cannot be applied to the customers as timestamps are not exactly the same for duplicated events in their use cases.
We propose to deduce a new API of dropDuplicates which has following different characteristics compared to existing dropDuplicates:
- Weaker constraints on the subset (key)
- Does not require an event time column on the subset.
- Looser semantics on deduplication
- Only guarantee to deduplicate events within the watermark.
Since the new API leverages event time, the new API has following new requirements:
- The input must be streaming DataFrame.
- The watermark must be defined.
- The event time column must be defined in the input DataFrame.
More specifically on the semantic, once the operator processes the first arrived event, events arriving within the watermark for the first event will be deduplicated.
(Technically, the expiration time should be the “event time of the first arrived event + watermark delay threshold”, to match up with future events.)
Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events. (If they are unsure, they can alternatively set the delay threshold large enough, e.g. 48 hours.)
Longer design doc will be attached.