⚓ T361483 Selectively disable changeprop functionality that is no longer used

	Subject	Repo	Branch	Lines +/-
	changeprop: Remove all MCS endpoints	operations/deployment-charts	master	+2 -69
	changeprop: Remove ORES functionality from chart	operations/deployment-charts	master	+1 -52

Status	Assigned	Task
Stalled	None	T324931 Clean up open RESTBase related tickets
In Progress	None	T262315 <CORE TECHNOLOGY> API Migration & RESTBase Sunset
Open	None	T361483 Selectively disable changeprop functionality that is no longer used

akosiaris created this task.Apr 1 2024, 4:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2024, 4:26 PM

akosiaris added a parent task: T262315: <CORE TECHNOLOGY> API Migration & RESTBase Sunset.Apr 1 2024, 4:27 PM

Let's start with the "easy" ones. I see feature flags in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/changeprop/templates/_config.yaml for

summary_definition_rerender
rerendered_pcs_endpoints
purge_varnish
mw_purge_null_edit
page_create
page_delete
page_edit
page_move
page_restore
page_images_summary
page_images_mobile
revision_visibility_change
on_transclusion_update
on_backlinks_update
ores_cache
wikidata_description_on_edit
wikidata_description_on_undelete
liftwing_models

Adding ORES and Lift-Wing as tags.

We 'll need to figure out the best order to minimize repercussions and cause the least amount of pain to users. On a happy note, it does appear way more plausible now that in the case of a big incident we 'll be able to stop changeprop temporarily.

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptApr 1 2024, 4:42 PM

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

In T361483#9679703, @elukey wrote:

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

In T361483#9680024, @akosiaris wrote:

In T361483#9679703, @elukey wrote:

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:

stop the changeprop rule for the lift wing topic that Search uses.
write down the offset of the related consumer group using the kafka api (IIRC it should be possible)
create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).
add the rule to cp-jobqueue and check if it works.

MSantos subscribed.Apr 2 2024, 3:50 PM

Change #1016391 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] changeprop: Remove ORES functionality from chart

https://gerrit.wikimedia.org/r/1016391

gerritbot added a project: Patch-For-Review.Apr 2 2024, 3:55 PM

Change #1016391 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Remove ORES functionality from chart

https://gerrit.wikimedia.org/r/1016391

akosiaris updated the task description. (Show Details)Apr 4 2024, 8:11 AM

Next up. mobile-sections. It's deprecated per T328036 for a long time now. I 'll remove rules updating mobile-sections endpoints. That should be fine for external users, we have been returning for many months now 403 to almost everyone (exceptions are still around for Kiwix and Wikiwand, T340036).

In T361483#9680093, @elukey wrote:

In T361483#9680024, @akosiaris wrote:

In T361483#9679703, @elukey wrote:

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:

stop the changeprop rule for the lift wing topic that Search uses.

write down the offset of the related consumer group using the kafka api (IIRC it should be possible)

create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).

add the rule to cp-jobqueue and check if it works.

Hmmm, can these endpoints receive the same request 2 times? I see that all that changeprop does is a POST to https://inference.discovery.wmnet:30443/v1/models/<wiki>/<sometopic> with a body that contains event: '{{globals.message}}'

And the rules make it apparently pretty easy to have them both run from changeprop and jobqueue simultaneously. That way we might just run both for a while (a couple of days?) and then just shutdown the changeprop parts of it, leaving jobqueue to continue as normal.

akosiaris mentioned this in T328036: MCS decommission (2023).Apr 4 2024, 1:08 PM

Change #1017054 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] changeprop: Remove all MCS endpoints

https://gerrit.wikimedia.org/r/1017054

hnowlan subscribed.Apr 4 2024, 1:38 PM

SLopes-WMF moved this task from Backlog to Tracking on the Content-Transform-Team board.Apr 4 2024, 2:10 PM

SLopes-WMF removed a project: Parsoid.Apr 4 2024, 2:19 PM

isarantopoulos moved this task from Unsorted to Watching on the Machine-Learning-Team board.Apr 9 2024, 2:41 PM

In T361483#9688445, @akosiaris wrote:

In T361483#9680093, @elukey wrote:

In T361483#9680024, @akosiaris wrote:

In T361483#9679703, @elukey wrote:

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:

stop the changeprop rule for the lift wing topic that Search uses.

write down the offset of the related consumer group using the kafka api (IIRC it should be possible)

create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).

add the rule to cp-jobqueue and check if it works.

Hmmm, can these endpoints receive the same request 2 times? I see that all that changeprop does is a POST to https://inference.discovery.wmnet:30443/v1/models/<wiki>/<sometopic> with a body that contains event: '{{globals.message}}'

And the rules make it apparently pretty easy to have them both run from changeprop and jobqueue simultaneously. That way we might just run both for a while (a couple of days?) and then just shutdown the changeprop parts of it, leaving jobqueue to continue as normal.

Sorry for the lag! Yes I think it is doable without any problem, but IIRC the Search team relies on those streams to update ES indexes, I'd involve them before proceeding to validate that everything looks ok. I would also just announce the move on Wikitech-l so other folks can take actions if needed (I don't recall any other clients at the moment but better safe than sorry). I can help with anything, lemme know :)

Hi @dcausse @EBernhardson, I just wanted to sync with you whether it is acceptable to lose some events in the stream for eqiad.mediawiki_page_outlink_topic_prediction_change_v1 and eqiad.mediawiki_revision_score_drafttopic when we transition from changeprop to cp-jobqueue. If I recall correctly, Search uses these streams to update Elastic Search. I checked the consumer groups on the dashboards (outlink, drafttopic) and the cirrus-streaming-updater-producer-eqiad was there. :)

@achou except expert search users explicitly searching for topics (which I suspect are rare) the growth team is the only team using this data in a user facing product, it is hard to tell what would be the impact for them but I suspect that if only a few (<100) are lost these might hardly impact anything. If you suspect that more might be lost perhaps having duplicates is better if this is an option for you.

Selectively disable changeprop functionality that is no longer used
Open, Needs TriagePublic
Actions

Description

Intro

Tracking

Details

Related Objects
Search...

Event Timeline

Selectively disable changeprop functionality that is no longer usedOpen, Needs TriagePublicActions