Monitoring OpenShift pod restarts with Prometheus/AlertManager and kube-state-metrics

Prometheus is started to be the emerging solution to monitor OpenShift. We won’t discuss in this article how to set up Prometheus for OpenShift, because some articles already exist about this topic. You can check this git repository how to install it Prometheus on OpenShift with Grafana dashboards and Alert Manager enabled on how to install it Prometheus on OpenShift with Grafana dashboards and Alert Manager enabled.

When installed on OpenShift, Prometheus can run as a single pod and it will grab (or say scrap in the prometheus terminology) metrics from different providers (or exporters in the prometheus wording). In this git repository, we set up node-exporter as a provider from Prometheus to get metrics on nodes and have alerts and grafana dashboards to monitor them. It also comes with some basic alerts that checks node’s filesystem or CPU usage.

When you run OpenShift, it is very very valuable to monitor your pods restarts. Because, many restarts is often a sign of a malfunction. To do so, we deploy another exporter that exposes a convenient set of metrics from kubernetes API. Fortunately, there is a kubernetes project named kube-state-metrics which exposes these metrics.

The kube-state-metrics needs to be deployed as DeploymentConfig and exposed as a service. Then, annonate this service so it can be to be scraped by prometheus:

oc create -f << EOF
apiVersion: v1
kind: DeploymentConfig
metadata:
  namespace: monitoring
  name: kube-state-metrics
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: kube-state-metrics
    spec:
      containers:
      - name: kube-state-metrics
        image: gcr.io/google_containers/kube-state-metrics:v0.5.0
        imagePullPolicy: IfNotPresent
        Ports:
        - containerPort: 8080
EOF
oc expose dc kube-state-metrics --port=8080
oc annotate svc kube-state-metrics    prometheus.io/scrape='true'

Then, you can define the following alert, and you will be notified every time your pod restarts more than once in the last 5 minutes:

pod-restart.rules: |
    ALERT PodRestartingTooMuch
      IF  rate(kube_pod_container_status_restarts[1m]) > 1/(5*60)
      FOR 1m
      LABELS {
        severity="warning"
      }
      ANNOTATIONS {
        SUMMARY = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.",
        DESCRIPTION = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much."
      }

Leave a Reply

Your email address will not be published. Required fields are marked *