Troubleshoot Prometheus Issues

Kubernetes cluster monitors are in a DOWN state

When viewing targets in the Prometheus console, some Kubernetes cluster monitors may be down (kube-etcd, kube-proxy, and such). This is likely caused by the configuration of the Kubernetes cluster itself. Depending on the type of cluster, certain metrics may be disabled by default. Enabling metrics is cluster dependent; for details, refer to the documentation for your cluster type.

For example, to enable kube-proxy metrics on kind clusters, edit the kube-proxy ConfigMap.

$ kubectl edit cm/kube-proxy -n kube-system

Replace the metricsBindAddress value with the following and save the ConfigMap.

metricsBindAddress: 0.0.0.0:10249

Then, restart the kube-proxy pods.

$ kubectl delete pod -l k8s-app=kube-proxy -n kube-system

For more information, see this GitHub issue.

Metrics Trait Service Monitor not discovered

Metrics Traits use Service Monitors which require a Service to collect metrics. If your OAM workload is created with a Metrics Trait and no Ingress Trait, a Service might not be generated for your workload and will need to be created manually.

This troubleshooting example uses the hello-helidon application.

Verify a Service Monitor exists for your application workload.

$ kubectl get servicemonitors -n hello-helidon

Verify a Service exists for your application workload.

$ kubectl get services -n hello-helidon

If no Service exists, create one manually. This example uses the default Prometheus port.

apiVersion: v1
kind: Service
metadata:
  name: hello-helidon-service
  namespace: hello-helidon
spec:
  selector:
    app: hello-helidon
  ports:
    - name: tcp-hello-helidon
      port: 8080
      protocol: TCP
      targetPort: 8080

After you’ve completed these steps, you can verify metrics collection has succeeded.

Metrics queries no longer return metrics

If Prometheus storage reaches capacity, then metrics queries will no longer return results. Check the Prometheus logs.

$ kubectl logs -l app.kubernetes.io/instance=prometheus-operator-kube-p-prometheus -n verrazzano-monitoring

If there are messages indicating that the disk is full, then it will be necessary to either expand the storage or free disk space. If the default storage class supports volume expansion, then you can attempt to expand the volume.

Check if the default storage class allows volume expansion.

$ kubectl get storageclass

If the default storage class allows expansion, then modify the persistent volume claim and the Prometheus resource storage request to use the larger size.

For example, to increase the storage to 100 Gi:

$ kubectl patch pvc prometheus-prometheus-operator-kube-p-prometheus-db-prometheus-prometheus-operator-kube-p-prometheus-0 -n verrazzano-monitoring \
   --type=merge -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

$ kubectl patch prometheus prometheus-operator-kube-p-prometheus -n verrazzano-monitoring \
   --type=merge -p '{"spec":{"storage":{"volumeClaimTemplate":{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}}}}'

Alternatively, delete existing metrics data in the Prometheus pods to free space.

$ kubectl exec statefulset.apps/prometheus-prometheus-operator-kube-p-prometheus -n verrazzano-monitoring -- rm -fr /prometheus/wal

$ kubectl rollout restart statefulset.apps/prometheus-prometheus-operator-kube-p-prometheus -n verrazzano-monitoring

For information on how to configure Prometheus data retention settings to avoid filling up persistent storage in the Prometheus pods, see Configure data retention settings.