Page MenuHomePhabricator

Monitoring to surface "low-traffic" jobs isolation failure
Closed, ResolvedPublic

Description

In T378385, a large influx of flaggedrevs_CacheUpdate resulted in delayed execution of other job types that are similarly mapped to the shared "low-traffic jobs" changeprop-jobqueue config.

In this scenario (or a future one like it) it would have been handy to identify the problem earlier, and potentially mitigate by isolating the "no longer low-traffic" job to its own consumer config.

At present, we don't really have a way surface this automatically. The purpose of this task is to assess how to do that.

Event Timeline

Change #1083904 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/alerts@master] Add JobQueueLowTrafficProcessingRateTooHigh alert

https://gerrit.wikimedia.org/r/1083904

Scott_French changed the task status from Open to In Progress.Wed, Nov 13, 8:01 PM
Scott_French raised the priority of this task from Low to High.

There are really two parts to this:

  1. Identifying a sufficiently accurate alert signal and implementing the alert.
  2. Defining an appropriate response and documenting it in a runbook.

Alerting

One simple option is to create an alert that identifies jobs executed at an usually high rate by the low-traffic config. That has the upside of the alert signal being very simple to reason about and pointing directly to a likely "antagonist" job. The downside is that it's cause-based rather than impact-based, which makes choosing the execution rate non-obvious (lest we alert when there is no impact - i.e., false positive). That's what's sketched out in https://gerrit.wikimedia.org/r/1083904.

Another option is to look at the execution delay for all jobs that are bundled into the low-traffic config, and use that in some way. The trouble there is that the data is tricky to work with (e.g., sparse, particularly when impacted jobs are starved for throughput), so you have to get creative with the alert signal. For example:

count(
  histogram_quantile(0.95, sum(rate(cpjobqueue_normal_rule_processing_delay_bucket{rule=~"low-traffic-jobs-mediawiki-job-.*"}[30m])) by (rule, le)) > 120
) / count(
  cpjobqueue_normal_rule_processing_count{rule=~"low-traffic-jobs-mediawiki-job-.*"}
)

gives us something like "fraction of low-traffic job types with a p95 execution delay greater than 2m" which feels more principled, as it's directly tied to impact. At the same time, it requires selecting multiple "magic numbers" including the percentile, rate window width, badness threshold, and what the trigger fraction would be.

Despite the inherent complexity, that latter feels like the right approach, which I'll continue to pursue. Either approach (with appropriate values) would have detected T378385 and T379462.

Response

At a high level, the response is to isolate the antagonist to a dedicated consumer. That gets a little complicated though: if you simply add a new rule to the high_traffic_jobs_config in [0], that will be effective in mitigating the impact it was having on low-traffic jobs, but that does nothing for the likely large unprocessed backlog of events on the antagonist job's topic.

Specifically, when the new consumer group appears, it will start at the latest offset on the associated topic (i.e., changeprop correctly sets auto.offset.reset to "largest"), but what we need is to pick up from the current offset for that topic on the cpjobqueue-low_traffic_jobs consumer group.

Instead, the procedure would look like:

  1. Add the antagonist to a new rule in high_traffic_jobs_config with enabled: false and deploy. At this point, impact stops, but we've also pause processing of the antagonist job.
  2. Use the standard kafka-consumer-groups.sh tool to query the current offset for cpjobqueue-low_traffic_jobs on the antagonist topic (--describe) and pre-create the new consumer group at that offset for that topic (--reset-offsets).
  3. Switch to enabled: true and deploy. At this point, the new rule will pick up processing at the appropriate offset.

I've tested something like this procedure locally, and I believe it should work, though I need to do a bit more research.

Importantly, #1 is sufficient for mitigation, while #2 and #3 put us in a recovered (operating normally) state.

[0] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/changeprop-jobqueue/values.yaml

Change #1091797 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/alerts@master] Add JobQueueLowTrafficRuleWidespreadHighLatency

https://gerrit.wikimedia.org/r/1091797

Change #1091797 merged by jenkins-bot:

[operations/alerts@master] Add JobQueueLowTrafficConsumerWidespreadHighLatency

https://gerrit.wikimedia.org/r/1091797

To spell out step #2 of the procedure described in T378609#10325210 more explicitly:

Suppose the antagonist is AntagonistJob and the current primary DC is codfw, in which case the relevant topic is going to be codfw.mediawiki.job.AntagonistJob.

The first part is to fetch the current offset on the cpjobqueue-low_traffic_jobs consumer group. From any kafka-main broker host in codfw:

kafka-consumer-groups --bootstrap-server localhost:9092 --group cpjobqueue-low_traffic_jobs --describe

For a low-traffic job type, we should only ever see a single partition on the associated topic (0), and we'll take note of its CURRENT-OFFSET.

With that in hand, we can initialize the offset for the soon-to-be-used (step #3) consumer group, where $OFFSET is the current offset on the old consumer group. We can test this out with:

kafka-consumer-groups --bootstrap-server localhost:9092 --group cpjobqueue-AntagonistJob --topic codfw.mediawiki.job.AntagonistJob --reset-offsets --to-offset $OFFSET --dry-run

Then, if there are no issues reported (e.g., we got the topic name wrong) run that again with --execute instead of --dry-run.

That would be the "fully manual" v1.0 version of how to achieve step #2.

Can we automate this?

Naively, we'd write a tool to use some equivalent of the listConsumerGroupOffsets and alterConsumerGroupOffsets Admin API methods.

Interestingly, it looks like we have support for something sort of like this in spicerack already: Kafka.transfer_consumer_position [0].

That takes the perhaps unusual approach of using ephemeral consumers to retrieve the offset on the old group and then commit at it on the new group. I suspect this approach was used due to the lack of full Admin API support in kafka-python, but am not entirely sure (e.g., it may also be a workaround to perform this operation remotely).

While I need to give this a bit more thought, I think this approach would work for us. The reason I'm hesitant is that the cpjobqueue-low_traffic_jobs consumer group is still in active use, so some care is needed to avoid any disruption to the existing consumers. AFAICT, though, the spicerack code does not perform any operations that would trigger a rebalance, for example (e.g., never subscribes nor polls).

[0] https://gerrit.wikimedia.org/g/operations/software/spicerack/+/a2c4953850272a048ac1c746c8fbb714a3173b58/spicerack/kafka.py#303

Change #1092906 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/alerts@master] Move low-traffic consumer latency alert to critical

https://gerrit.wikimedia.org/r/1092906

Change #1092906 merged by jenkins-bot:

[operations/alerts@master] Move low-traffic consumer latency alert to critical

https://gerrit.wikimedia.org/r/1092906

Change #1083904 abandoned by Scott French:

[operations/alerts@master] Add JobQueueLowTrafficRuleProcessingRateTooHigh

Reason:

With the latency-impact alert now live, it's preferable to hold off on this unless we discover it's needed as a fallback.

https://gerrit.wikimedia.org/r/1083904

The latency-impact alert is now live, and based on initial testing with short trigger duration, appears to work as expected.

I have a couple of ideas for further tuning of the low-traffic consumer to mitigate "throughput starvation" issues, which I'll follow up on separately.