This document helps you troubleshoot observability issues in Google Distributed Cloud. If you experience any of these issues, review the suggested fixes and workarounds.
If you need additional assistance, reach out to Cloud Customer Care.
Cloud Audit Logs aren't collected
Cloud Audit Logs are enabled by default unless there's adisableCloudAuditLogging
flag set under the clusterOperations
section of
cluster config.
If Cloud Audit Logs are enabled, permissions are the most common reason that logs aren't collected. In this scenario, permission denied error messages are displayed in the Cloud Audit Logs proxy container.
The Cloud Audit Logs proxy container runs as a DaemonSet in all Google Distributed Cloud clusters.If you see permission errors, follow the steps to troubleshoot and resolve permission issues.
kube-state-metrics
metrics aren't collected
kube-state-metrics
(KSM) runs as a single replica Deployment in the cluster
and generates metrics on almost all resources in the cluster. When KSM and the
gke-metrics-agent
run on the same node, there's a greater risk of outage
among metrics agents on all nodes.
KSM metrics have names that follow the pattern of kube_<ResourceKind>
, like
kube_pod_container_info
. Metrics that start with kube_onpremusercluster_
are
from the on-premises cluster controller, not from KSM.
If KSM metrics are missing, review the following troubleshooting steps:
- In Cloud Monitoring, check the CPU, memory, and restart count of KSM using
the summary API metrics like
kubernetes.io/anthos/container/...
. This is a separate pipeline with KSM. Confirm that the KSM Pod isn't limited by not enough resources.- If these summary API metrics aren't available for KSM,
gke-metrics-agent
on the same node probably also has the same issue.
- If these summary API metrics aren't available for KSM,
- In the cluster, check the status and logs of the KSM Pod and the
gke-metrics-agent
Pod on the same node with KSM.
kube-state-metrics
crash looping
Symptom
No metrics from kube-state-metrics
(KSM) are available from
Cloud Monitoring.
Cause
This scenario is more likely to occur in large clusters, or clusters with large amounts of resources. KSM runs as a single replica Deployment and lists almost all resources in the cluster like Pods, Deployments, DaemonSets, ConfigMaps, Secrets, and PersistentVolumes. Metrics are generated on each of these resource objects. If any of the resources has many objects, like a cluster with over 10,000 Pods, KSM potentially runs out of memory.
Affected versions
This issue could be experienced in any version of Google Distributed Cloud.
The default CPU and memory limit have been increased in the last few Google Distributed Cloud versions, so these resource issues should be less common.
Fix and workaround
To check if your problem is because of out of memory problems, review the following steps:
- Use
kubectl describe pod
orkubectl get pod -o yaml
and check the error status message. - Check the memory consumption and utilization metric for KSM and confirm if it's reaching the limit before getting restarted.
If you confirm that out of memory problems are the issue, use either one of the following solutions:
Increase the memory request and limit for KSM.
For Google Distributed Cloud versions 1.16.0 or later, Google Cloud Observability manages KSM. To update KSM, see Overriding the default CPU and memory requests and limits for a Stackdriver component.
For versions earlier than 1.16.0, to adjust the CPU and memory of KSM use the Stackdriver custom resource's resourceOverride for
kube-state-metrics
.
Reduce the number of metrics from KSM.
For Google Distributed Cloud 1.13, KSM only exposes a smaller number of metrics called Core Metrics by default. This behavior means that resource usage is smaller than previous versions, but the same procedure can be followed to further reduce the number of KSM metrics.
For Google Distributed Cloud versions earlier than 1.13, KSM uses the default flags. This configuration exposes a large number of metrics.
gke-metrics-agent
crash looping
If gke-metrics-agent
only experiences out of memory issues on the node where
kube-state-metrics
exists, the cause is a large number of kube-state-metrics
metrics. To mitigate this issue, scale down stackdriver-operator
and modify
KSM to expose a small set of needed metrics as detailed in the previous section.
Remember to scale back up stackdriver-operator
after the cluster is upgraded
to Google Distributed Cloud 1.13 where KSM by default exposes a smaller number of Core
Metrics.
gke-metric-agent
. You can adjust CPU and memory for all gke-metrics-agent
Pods by adding the
resourceAttrOverride
field
to the Stackdriver custom resource.
stackdriver-metadata-agent
crash looping
Symptom
No system metadata label is available when filtering metrics in Cloud Monitoring.
Cause
The most common case of stackdriver-metadata-agent
crash looping is because of
out of memory events. This event is similar to kube-state-metrics
. Although
stackdriver-metadata-agent
isn't listing all resources, it still lists all
objects for the relevant resource types like Pods, Deployments, and
NetworkPolicy. The agent runs as a single replica Deployment, which increases
the risk of out of memory events if the number of objects is too great.
Affected version
This issue could be experienced in any version of Google Distributed Cloud.
The default CPU and memory limit has been increased in the last few Google Distributed Cloud versions, so these resource issues should be less common.
Fix and workaround
To check if your problem is because of out of memory problems, review the following steps:
- Use
kubectl describe pod
orkubectl get pod -o yaml
and check the error status message. - Check the memory consumption and utilization metric for
stackdriver-metadata-agent
and confirm if it's reaching the limit before getting restarted.
resourceAttrOverride
field
of the Stackdriver custom resource.
metrics-server
crash looping
Symptom
Horizontal Pod Autoscaler and kubectl top
don't work in your cluster.
Cause and affected versions
This issue isn't very common, but is caused by out of memory errors in large clusters or in clusters with high Pod density.
This issue could be experienced in any version of Google Distributed Cloud.
Fix and workaround
Increase metrics server resource limits.
In Google Distributed Cloud version 1.13 and later, the namespace of metrics-server
and its config has been moved from kube-system
to
gke-managed-metrics-server
.
metrics-server-operator
and manually change the metrics-server
pod.
What's next
If you need additional assistance, reach out to
Cloud Customer Care.