Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes CI Policy: define metrics/reports that allow us to track whether the situation is getting better #18785

Open
3 tasks
spiffxp opened this issue Aug 11, 2020 · 28 comments
Labels
area/deflake Issues or PRs related to deflaking kubernetes tests area/jobs area/metrics area/prow Issues or PRs related to prow kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 11, 2020

Part of #18551

We were experiencing a lot of obvious pain as humans when kubernetes/kubernetes#92937 was opened.

There are a number of theories as to why that pain was being experienced, and we're now acting based on some of those theories.

What we are lacking is:

  • metrics that show that pain is being experienced
  • metrics that prove the theories as to why the pain was being experienced
  • metrics/reports that prove the action we are taking is having positive/negative impact on overall CI health

This issue is intended to cover brainstorming, exploring and implementing metrics / reports that help guide us in the right direction.

Some suggestions / questions I'm pulling up from below

  • can we / does it make sense to implement an alert when nothing has merged into kubernetes/kubernetes for a while, and should have (non-empty tide pool)?
  • can we / does it make sense to implement an alert when the kubernetes/kubernetes non-release-branch tide pool is above a certain threshold? what should that threshold be?
  • can we identify which job runs were due to new commits vs. /retest spam?
@spiffxp
Copy link
Member Author

spiffxp commented Aug 11, 2020

"We are experiencing pain" - kubernetes/kubernetes#92937

  • PR's weren't merging into kubernetes/kubernetes' main branch for too long
    • can we have an alert for this?
    • need to account for periods when there's nothing to merge (e.g. weekends, holidays)
  • one way we could measure/observe PR's not merging is by looking at tide pool size for kubernetes/kubernetes master
  • PR's weren't merging due to jobs failing
    • these were occurring in tide's batch jobs
    • these were occurring on each PR in presubmit (e.g. trying to get "all green checkmarks" before tide would try merging)
    • do we have any flake / failure data over time for each of these?
    • can we say "this job is the one that's been causing batches to fail the most"?
    • etc.
  • jobs were failing due to (flakes? scheduling errors? resource contention?)
    • jobs that failed due to test failure (flake or legitimate) end up in "failure" state
    • jobs that failed to schedule end up in "error" state

@spiffxp
Copy link
Member Author

spiffxp commented Aug 11, 2020

"Theories as to why the pain is being experienced"

  • we suspect we may have encountered resource contention
    • can we see what the capacity of our build cluster is?
    • can we see it go over capacity?
  • we suspect resource contention may have happened due to a higher-than-usual volume of traffic
    • were we dealing with more PR's than we have historically?
    • were we using more resources per-PR than we have historically?
    • are there more resources being consumed by non-k/k jobs than there have been historically?
  • we suspect there was resource contention due to people spamming /retest
    • can we identify how many jobs a given PR triggered?
    • can we tell which job executions were due to a /retest vs. a push?

@BenTheElder
Copy link
Member

I would really like to be able to track rates of https://prow.k8s.io/?state=error

currently we do have some metrics that basically track what you'd find on these pages, but it's the currently existing prowjob CRs, so it's influenced by the GC logic, instead of a guage.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 11, 2020

Right, this plank dashboard is close to what @BenTheElder wants, but

  • it's not based on "events" but the pool of ProwJob CR's created by plank that stick around for 48h before being cleaned up by sinker
  • it's for all repos, not just kubernetes/kubernetes
  • you need to manually click on the "error" series to single it out

I tried hacking up a version to filter just on "error" state
Screen Shot 2020-08-11 at 1 19 36 PM

Compare this to the tide pool metric
Screen Shot 2020-08-11 at 1 26 01 PM

They both grow at about the same time, but it takes a lot longer for the "error" jobs to fall off than it does tide pool size

@BenTheElder
Copy link
Member

it's not based on "events" but the pool of ProwJob CR's created by plank that stick around for 48h before being cleaned up by sinker

nit: AIUI it's actually 48 hours OR the most recent result (periodic etc.)

@spiffxp
Copy link
Member Author

spiffxp commented Aug 11, 2020

What data sources are available to us?

  • (google.com only) Cloud Monitoring of k8s-prow-builds
  • (google.com only) Cloud Logging of k8s-prow
  • Cloud Monitoring of k8s-infra-prow-build (available to [email protected])
  • https://monitoring.prow.k8s.io (specifically whatever metrics the prometheus instance currently collects)
    • the prometheus instance has only recently had its retention raised from 30d to 1y, so we don't have much in the way of pre-"pain" data
  • GCS (specifically all of the artifacts that end up in gs://kubernetes-jenkins)
    • not everything is going to land in here, for example jobs that fail to schedule and instead hit "error" state
    • walking GCS and scraping things can be pretty time intensive, we have kettle do that for us and populate...
  • https://console.cloud.google.com/bigquery?project=k8s-gubernator (the k8s-gubernator:build.all bigquery dataset as populated by kettle)
    • there are things that end up in here that aren't run by prow.k8s.io
  • GitHub's API
    • we could scrape PR's and try to reconstruct what happened based on events in PRs
  • go.k8s.io/triage
    • this is based off of data that ends up in the k8s-gubernator:build.all dataset, but perhaps the right set of regexes to include/exclude certain jobs or tests could give us a feel for how things are failing/flaking now vs. two weeks ago
  • devstats.cncf.io

@spiffxp
Copy link
Member Author

spiffxp commented Aug 13, 2020

How often are people spamming /retest or /test for kubernetes/kubernetes in the last 90d?
https://k8s.devstats.cncf.io/d/5/bot-commands-repository-groups?orgId=1&var-period=d7&var-repogroup_name=Kubernetes&var-commands=%22%2Fretest%22&var-commands=%22%2Ftest%22&var-commands=%22%2Ftest%20all%22&from=now-90d&to=now
Screen Shot 2020-08-12 at 5 42 47 PM

Thoughts:

  • /test could be valid manual triggering, but could also be more representative of humans sitting on a PR being impatient
  • this could just be a proxy for PR traffic, is there some way of tracking amount per PR, or normalizing for open PR's?

@BenTheElder
Copy link
Member

BenTheElder commented Aug 13, 2020 via email

@spiffxp
Copy link
Member Author

spiffxp commented Aug 27, 2020

The screenshot I posted in #18785 (comment) was from a prototype of #19007, once that merges we'll be able to drill down and filter a little bit more in https://monitoring.prow.k8s.io/d/e1778910572e3552a935c2035ce80369/plank-dashboard?orgId=1

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2020
@ameukam
Copy link
Member

ameukam commented Nov 25, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Feb 5, 2021

Anecdotally, we could really use some better visibility into common cases of jobs failing to schedule (or just failing)

  • how often are pods evicted due to cluster maintenance / upgrades?
  • how often are pods unable to schedule because the build cluster is at maximum capacity?
  • (for super peaky situations: how long would we need to wait to smooth out peaky load?)

I suspect we are already occasionally hitting peaks of many PR's that cause the cluster to reach maximum capacity.

@spiffxp spiffxp added area/deflake Issues or PRs related to deflaking kubernetes tests area/jobs area/metrics area/prow Issues or PRs related to prow labels Feb 5, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 5, 2021

/sig testing
/wg k8s-infra
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/k8s-infra priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Feb 5, 2021
@BenTheElder
Copy link
Member

FWIW in the current state: https://prow.k8s.io/?state=error -- a quick sampling suggests these are largely due to nodepool exhaustion (transiently)

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 8, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Jul 13, 2021

/milestone v1.23
I feel like the fun we recently had during v1.22 Code Freeze demonstrates that we're still not capable of visualizing all resources that could potentially be exhausted.

It seems like kubernetes/kubernetes#103512 was the biggest culprit, which came down to port exhaustion

@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Jul 13, 2021
@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Oct 1, 2021

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 1, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 29, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ameukam
Copy link
Member

ameukam commented Feb 28, 2022

/reopen
/lifecycle frozen

@k8s-ci-robot
Copy link
Contributor

@ameukam: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Feb 28, 2022
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 28, 2022
@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 28, 2022
@BenTheElder BenTheElder modified the milestones: v1.23, someday Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deflake Issues or PRs related to deflaking kubernetes tests area/jobs area/metrics area/prow Issues or PRs related to prow kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

6 participants