Page MenuHomePhabricator

Selectively disable changeprop functionality that is no longer used
Open, Needs TriagePublic

Description

Intro

Per T353876, we are at point in the RESTBase migration that we only rely on changeprop for lint table updates. This is documented in T361013. The purpose of this task is to evaluate and start turning off feature flags of changeprop that call out to RESTBase. The intent is manifold

  • Having less moving parts makes operating the platform easier
  • There's fewer code paths being exercised, so less space for errors, bugs, attacks etc
  • Turning selectively off parts of changeprop gives us more control (compared to 1 big turn off) over the process, with easier rollback steps.
  • We won't be needlessly stressing various APIs across the infrastructure any more

Tracking

  • summary_definition_rerender
  • rerendered_pcs_endpoints
  • purge_varnish
  • mw_purge_null_edit
  • page_create
  • page_delete
  • page_edit
  • page_move
  • page_restore
  • page_images_summary
  • page_images_mobile
  • revision_visibility_change
  • on_transclusion_update
  • on_backlinks_update
  • ores_cache
  • ores precache (older ORES precaching mechanism, non functional already)
  • wikidata_description_on_edit
  • wikidata_description_on_undelete
  • liftwing_models

Event Timeline

Let's start with the "easy" ones. I see feature flags in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/changeprop/templates/_config.yaml for

  • summary_definition_rerender
  • rerendered_pcs_endpoints
  • purge_varnish
  • mw_purge_null_edit
  • page_create
  • page_delete
  • page_edit
  • page_move
  • page_restore
  • page_images_summary
  • page_images_mobile
  • revision_visibility_change
  • on_transclusion_update
  • on_backlinks_update
  • ores_cache
  • wikidata_description_on_edit
  • wikidata_description_on_undelete
  • liftwing_models

Adding ORES and Lift-Wing as tags.

We 'll need to figure out the best order to minimize repercussions and cause the least amount of pain to users. On a happy note, it does appear way more plausible now that in the case of a big incident we 'll be able to stop changeprop temporarily.

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:

  • stop the changeprop rule for the lift wing topic that Search uses.
  • write down the offset of the related consumer group using the kafka api (IIRC it should be possible)
  • create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).
  • add the rule to cp-jobqueue and check if it works.

Change #1016391 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] changeprop: Remove ORES functionality from chart

https://gerrit.wikimedia.org/r/1016391

Change #1016391 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Remove ORES functionality from chart

https://gerrit.wikimedia.org/r/1016391

Next up. mobile-sections. It's deprecated per T328036 for a long time now. I 'll remove rules updating mobile-sections endpoints. That should be fine for external users, we have been returning for many months now 403 to almost everyone (exceptions are still around for Kiwix and Wikiwand, T340036).

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:

  • stop the changeprop rule for the lift wing topic that Search uses.
  • write down the offset of the related consumer group using the kafka api (IIRC it should be possible)
  • create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).
  • add the rule to cp-jobqueue and check if it works.

Hmmm, can these endpoints receive the same request 2 times? I see that all that changeprop does is a POST to https://inference.discovery.wmnet:30443/v1/models/<wiki>/<sometopic> with a body that contains event: '{{globals.message}}'

And the rules make it apparently pretty easy to have them both run from changeprop and jobqueue simultaneously. That way we might just run both for a while (a couple of days?) and then just shutdown the changeprop parts of it, leaving jobqueue to continue as normal.

Change #1017054 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] changeprop: Remove all MCS endpoints

https://gerrit.wikimedia.org/r/1017054

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:

  • stop the changeprop rule for the lift wing topic that Search uses.
  • write down the offset of the related consumer group using the kafka api (IIRC it should be possible)
  • create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).
  • add the rule to cp-jobqueue and check if it works.

Hmmm, can these endpoints receive the same request 2 times? I see that all that changeprop does is a POST to https://inference.discovery.wmnet:30443/v1/models/<wiki>/<sometopic> with a body that contains event: '{{globals.message}}'

And the rules make it apparently pretty easy to have them both run from changeprop and jobqueue simultaneously. That way we might just run both for a while (a couple of days?) and then just shutdown the changeprop parts of it, leaving jobqueue to continue as normal.

Sorry for the lag! Yes I think it is doable without any problem, but IIRC the Search team relies on those streams to update ES indexes, I'd involve them before proceeding to validate that everything looks ok. I would also just announce the move on Wikitech-l so other folks can take actions if needed (I don't recall any other clients at the moment but better safe than sorry). I can help with anything, lemme know :)

Hi @dcausse @EBernhardson, I just wanted to sync with you whether it is acceptable to lose some events in the stream for eqiad.mediawiki_page_outlink_topic_prediction_change_v1 and eqiad.mediawiki_revision_score_drafttopic when we transition from changeprop to cp-jobqueue. If I recall correctly, Search uses these streams to update Elastic Search. I checked the consumer groups on the dashboards (outlink, drafttopic) and the cirrus-streaming-updater-producer-eqiad was there. :)

@achou except expert search users explicitly searching for topics (which I suspect are rare) the growth team is the only team using this data in a user facing product, it is hard to tell what would be the impact for them but I suspect that if only a few (<100) are lost these might hardly impact anything. If you suspect that more might be lost perhaps having duplicates is better if this is an option for you.