-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes CI Policy: define metrics/reports that allow us to track whether the situation is getting better #18785
Comments
"We are experiencing pain" - kubernetes/kubernetes#92937
|
"Theories as to why the pain is being experienced"
|
I would really like to be able to track rates of https://prow.k8s.io/?state=error currently we do have some metrics that basically track what you'd find on these pages, but it's the currently existing prowjob CRs, so it's influenced by the GC logic, instead of a guage. |
Right, this plank dashboard is close to what @BenTheElder wants, but
I tried hacking up a version to filter just on "error" state Compare this to the tide pool metric They both grow at about the same time, but it takes a lot longer for the "error" jobs to fall off than it does tide pool size |
nit: AIUI it's actually 48 hours OR the most recent result (periodic etc.) |
What data sources are available to us?
|
How often are people spamming Thoughts:
|
There are also substantial flakes that are not related to infrastructure
health.
We need to avoid conflating all test flakes with infrastructure health.
…On Wed, Aug 12, 2020 at 5:46 PM Aaron Crickenberger < ***@***.***> wrote:
How often are people spamming /retest or /test for kubernetes/kubernetes
in the last 90d?
https://k8s.devstats.cncf.io/d/5/bot-commands-repository-groups?orgId=1&var-period=d7&var-repogroup_name=Kubernetes&var-commands=%22%2Fretest%22&var-commands=%22%2Ftest%22&var-commands=%22%2Ftest%20all%22&from=now-90d&to=now
[image: Screen Shot 2020-08-12 at 5 42 47 PM]
<https://user-images.githubusercontent.com/49258/90081921-5cb01b00-dcc3-11ea-9912-1592e716c627.png>
Thoughts:
- /test could be valid manual triggering, but could also be more
representative of humans sitting on a PR being impatient
- this could just be a proxy for PR traffic, is there some way of
tracking amount per PR, or normalizing for open PR's?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18785 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHADK22KY5QCNO44HXOSH3SAMZX7ANCNFSM4P3NIR3A>
.
|
The screenshot I posted in #18785 (comment) was from a prototype of #19007, once that merges we'll be able to drill down and filter a little bit more in https://monitoring.prow.k8s.io/d/e1778910572e3552a935c2035ce80369/plank-dashboard?orgId=1 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Anecdotally, we could really use some better visibility into common cases of jobs failing to schedule (or just failing)
I suspect we are already occasionally hitting peaks of many PR's that cause the cluster to reach maximum capacity. |
/sig testing |
FWIW in the current state: https://prow.k8s.io/?state=error -- a quick sampling suggests these are largely due to nodepool exhaustion (transiently) |
/milestone v1.23 It seems like kubernetes/kubernetes#103512 was the biggest culprit, which came down to port exhaustion |
/kind feature |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@ameukam: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Part of #18551
We were experiencing a lot of obvious pain as humans when kubernetes/kubernetes#92937 was opened.
There are a number of theories as to why that pain was being experienced, and we're now acting based on some of those theories.
What we are lacking is:
This issue is intended to cover brainstorming, exploring and implementing metrics / reports that help guide us in the right direction.
Some suggestions / questions I'm pulling up from below
/retest
spam?The text was updated successfully, but these errors were encountered: