Kubernetes CI Policy: define metrics/reports that allow us to track whether the situation is getting better #18785

spiffxp · 2020-08-11T19:14:11Z

We were experiencing a lot of obvious pain as humans when kubernetes/kubernetes#92937 was opened.

There are a number of theories as to why that pain was being experienced, and we're now acting based on some of those theories.

What we are lacking is:

metrics that show that pain is being experienced
metrics that prove the theories as to why the pain was being experienced
metrics/reports that prove the action we are taking is having positive/negative impact on overall CI health

This issue is intended to cover brainstorming, exploring and implementing metrics / reports that help guide us in the right direction.

Some suggestions / questions I'm pulling up from below

can we / does it make sense to implement an alert when nothing has merged into kubernetes/kubernetes for a while, and should have (non-empty tide pool)?
can we / does it make sense to implement an alert when the kubernetes/kubernetes non-release-branch tide pool is above a certain threshold? what should that threshold be?
can we identify which job runs were due to new commits vs. /retest spam?

The text was updated successfully, but these errors were encountered:

spiffxp · 2020-08-11T19:31:29Z

"We are experiencing pain" - kubernetes/kubernetes#92937

PR's weren't merging into kubernetes/kubernetes' main branch for too long
- can we have an alert for this?
- need to account for periods when there's nothing to merge (e.g. weekends, holidays)
one way we could measure/observe PR's not merging is by looking at tide pool size for kubernetes/kubernetes master
- plotted at https://monitoring.prow.k8s.io/d/d69a91f76d8110d3e72885ee5ce8038e/tide-dashboard?orgId=1&fullscreen&panelId=5&from=now-90d&to=now
- only moves at the speed of the slowest presubmit job
PR's weren't merging due to jobs failing
- these were occurring in tide's batch jobs
- these were occurring on each PR in presubmit (e.g. trying to get "all green checkmarks" before tide would try merging)
- do we have any flake / failure data over time for each of these?
- can we say "this job is the one that's been causing batches to fail the most"?
- etc.
jobs were failing due to (flakes? scheduling errors? resource contention?)
- jobs that failed due to test failure (flake or legitimate) end up in "failure" state
- jobs that failed to schedule end up in "error" state

spiffxp · 2020-08-11T19:32:50Z

"Theories as to why the pain is being experienced"

we suspect we may have encountered resource contention
- can we see what the capacity of our build cluster is?
- can we see it go over capacity?
we suspect resource contention may have happened due to a higher-than-usual volume of traffic
- were we dealing with more PR's than we have historically?
- were we using more resources per-PR than we have historically?
- are there more resources being consumed by non-k/k jobs than there have been historically?
we suspect there was resource contention due to people spamming /retest
- can we identify how many jobs a given PR triggered?
- can we tell which job executions were due to a /retest vs. a push?

BenTheElder · 2020-08-11T19:55:33Z

I would really like to be able to track rates of https://prow.k8s.io/?state=error

currently we do have some metrics that basically track what you'd find on these pages, but it's the currently existing prowjob CRs, so it's influenced by the GC logic, instead of a guage.

spiffxp · 2020-08-11T20:28:12Z

Right, this plank dashboard is close to what @BenTheElder wants, but

it's not based on "events" but the pool of ProwJob CR's created by plank that stick around for 48h before being cleaned up by sinker
it's for all repos, not just kubernetes/kubernetes
you need to manually click on the "error" series to single it out

I tried hacking up a version to filter just on "error" state

Compare this to the tide pool metric

They both grow at about the same time, but it takes a lot longer for the "error" jobs to fall off than it does tide pool size

BenTheElder · 2020-08-11T20:39:44Z

it's not based on "events" but the pool of ProwJob CR's created by plank that stick around for 48h before being cleaned up by sinker

nit: AIUI it's actually 48 hours OR the most recent result (periodic etc.)

spiffxp · 2020-08-11T23:08:49Z

What data sources are available to us?

(google.com only) Cloud Monitoring of k8s-prow-builds
(google.com only) Cloud Logging of k8s-prow
Cloud Monitoring of k8s-infra-prow-build (available to [email protected])
https://monitoring.prow.k8s.io (specifically whatever metrics the prometheus instance currently collects)
- the prometheus instance has only recently had its retention raised from 30d to 1y, so we don't have much in the way of pre-"pain" data
GCS (specifically all of the artifacts that end up in gs://kubernetes-jenkins)
- not everything is going to land in here, for example jobs that fail to schedule and instead hit "error" state
- walking GCS and scraping things can be pretty time intensive, we have kettle do that for us and populate...
https://console.cloud.google.com/bigquery?project=k8s-gubernator (the k8s-gubernator:build.all bigquery dataset as populated by kettle)
- there are things that end up in here that aren't run by prow.k8s.io
GitHub's API
- we could scrape PR's and try to reconstruct what happened based on events in PRs
go.k8s.io/triage
- this is based off of data that ends up in the k8s-gubernator:build.all dataset, but perhaps the right set of regexes to include/exclude certain jobs or tests could give us a feel for how things are failing/flaking now vs. two weeks ago
devstats.cncf.io

spiffxp · 2020-08-13T00:46:43Z

How often are people spamming /retest or /test for kubernetes/kubernetes in the last 90d?
https://k8s.devstats.cncf.io/d/5/bot-commands-repository-groups?orgId=1&var-period=d7&var-repogroup_name=Kubernetes&var-commands=%22%2Fretest%22&var-commands=%22%2Ftest%22&var-commands=%22%2Ftest%20all%22&from=now-90d&to=now

Thoughts:

/test could be valid manual triggering, but could also be more representative of humans sitting on a PR being impatient
this could just be a proxy for PR traffic, is there some way of tracking amount per PR, or normalizing for open PR's?

BenTheElder · 2020-08-13T05:00:40Z

There are also substantial flakes that are not related to infrastructure health. We need to avoid conflating all test flakes with infrastructure health.

…

On Wed, Aug 12, 2020 at 5:46 PM Aaron Crickenberger < ***@***.***> wrote: How often are people spamming /retest or /test for kubernetes/kubernetes in the last 90d? https://k8s.devstats.cncf.io/d/5/bot-commands-repository-groups?orgId=1&var-period=d7&var-repogroup_name=Kubernetes&var-commands=%22%2Fretest%22&var-commands=%22%2Ftest%22&var-commands=%22%2Ftest%20all%22&from=now-90d&to=now [image: Screen Shot 2020-08-12 at 5 42 47 PM] <https://user-images.githubusercontent.com/49258/90081921-5cb01b00-dcc3-11ea-9912-1592e716c627.png> Thoughts: - /test could be valid manual triggering, but could also be more representative of humans sitting on a PR being impatient - this could just be a proxy for PR traffic, is there some way of tracking amount per PR, or normalizing for open PR's? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18785 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHADK22KY5QCNO44HXOSH3SAMZX7ANCNFSM4P3NIR3A> .

spiffxp · 2020-08-27T01:35:20Z

The screenshot I posted in #18785 (comment) was from a prototype of #19007, once that merges we'll be able to drill down and filter a little bit more in https://monitoring.prow.k8s.io/d/e1778910572e3552a935c2035ce80369/plank-dashboard?orgId=1

fejta-bot · 2020-11-25T01:37:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ameukam · 2020-11-25T21:03:50Z

/remove-lifecycle stale

spiffxp · 2021-02-05T23:14:38Z

Anecdotally, we could really use some better visibility into common cases of jobs failing to schedule (or just failing)

how often are pods evicted due to cluster maintenance / upgrades?
how often are pods unable to schedule because the build cluster is at maximum capacity?
(for super peaky situations: how long would we need to wait to smooth out peaky load?)

I suspect we are already occasionally hitting peaks of many PR's that cause the cluster to reach maximum capacity.

spiffxp · 2021-02-05T23:16:54Z

/sig testing
/wg k8s-infra
/priority important-longterm

BenTheElder · 2021-02-07T04:10:13Z

FWIW in the current state: https://prow.k8s.io/?state=error -- a quick sampling suggests these are largely due to nodepool exhaustion (transiently)

spiffxp · 2021-07-13T16:50:53Z

/milestone v1.23
I feel like the fun we recently had during v1.22 Code Freeze demonstrates that we're still not capable of visualizing all resources that could potentially be exhausted.

It seems like kubernetes/kubernetes#103512 was the biggest culprit, which came down to port exhaustion

spiffxp · 2021-10-01T19:08:52Z

/kind feature

k8s-triage-robot · 2021-12-30T19:50:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-01-29T20:18:32Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-02-28T21:15:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-02-28T21:16:38Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ameukam · 2022-02-28T21:36:12Z

/reopen
/lifecycle frozen

k8s-ci-robot · 2022-02-28T21:36:31Z

@ameukam: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp mentioned this issue Aug 11, 2020

Kubernetes CI Policy (Umbrella issue) #18551

Open

8 tasks

BenTheElder mentioned this issue Aug 25, 2020

Kubernetes CI Policy: critical jobs must be Guaranteed Pod QOS #18530

Closed

36 tasks

spiffxp mentioned this issue Aug 26, 2020

Update plank dashboard to filter and group by more things #19007

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2020

spiffxp mentioned this issue Feb 5, 2021

Raise in-use IP address quota for k8s-infra-prow-build kubernetes/k8s.io#1616

Closed

spiffxp added area/deflake Issues or PRs related to deflaking kubernetes tests area/jobs area/metrics area/prow Issues or PRs related to prow labels Feb 5, 2021

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/k8s-infra priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Feb 5, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 8, 2021

k8s-ci-robot added this to the v1.23 milestone Jul 13, 2021

k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 1, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 29, 2022

k8s-ci-robot closed this as completed Feb 28, 2022

k8s-ci-robot reopened this Feb 28, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 28, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 28, 2022

BenTheElder modified the milestones: v1.23, someday Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes CI Policy: define metrics/reports that allow us to track whether the situation is getting better #18785

Kubernetes CI Policy: define metrics/reports that allow us to track whether the situation is getting better #18785

spiffxp commented Aug 11, 2020 •

edited

Loading

spiffxp commented Aug 11, 2020

spiffxp commented Aug 11, 2020 •

edited

Loading

BenTheElder commented Aug 11, 2020

spiffxp commented Aug 11, 2020

BenTheElder commented Aug 11, 2020

spiffxp commented Aug 11, 2020 •

edited

Loading

spiffxp commented Aug 13, 2020

BenTheElder commented Aug 13, 2020 via email

spiffxp commented Aug 27, 2020

fejta-bot commented Nov 25, 2020

ameukam commented Nov 25, 2020

spiffxp commented Feb 5, 2021

spiffxp commented Feb 5, 2021

BenTheElder commented Feb 7, 2021

spiffxp commented Jul 13, 2021

spiffxp commented Oct 1, 2021

k8s-triage-robot commented Dec 30, 2021

k8s-triage-robot commented Jan 29, 2022

k8s-triage-robot commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

ameukam commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

Kubernetes CI Policy: define metrics/reports that allow us to track whether the situation is getting better #18785

Kubernetes CI Policy: define metrics/reports that allow us to track whether the situation is getting better #18785

Comments

spiffxp commented Aug 11, 2020 • edited Loading

spiffxp commented Aug 11, 2020

spiffxp commented Aug 11, 2020 • edited Loading

BenTheElder commented Aug 11, 2020

spiffxp commented Aug 11, 2020

BenTheElder commented Aug 11, 2020

spiffxp commented Aug 11, 2020 • edited Loading

spiffxp commented Aug 13, 2020

BenTheElder commented Aug 13, 2020 via email

spiffxp commented Aug 27, 2020

fejta-bot commented Nov 25, 2020

ameukam commented Nov 25, 2020

spiffxp commented Feb 5, 2021

spiffxp commented Feb 5, 2021

BenTheElder commented Feb 7, 2021

spiffxp commented Jul 13, 2021

spiffxp commented Oct 1, 2021

k8s-triage-robot commented Dec 30, 2021

k8s-triage-robot commented Jan 29, 2022

k8s-triage-robot commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

ameukam commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

spiffxp commented Aug 11, 2020 •

edited

Loading

spiffxp commented Aug 11, 2020 •

edited

Loading

spiffxp commented Aug 11, 2020 •

edited

Loading