Monitoring OpenShift pod restarts with Prometheus/AlertManager and kube-state-metrics

Prometheus is started to be the emerging solution to monitor OpenShift. We won’t discuss in this article how to set up Prometheus for OpenShift, because some articles already exist about this topic. You can check this git repository how to install it Prometheus on OpenShift with Grafana dashboards and Alert Manager enabled on how to install it Prometheus on OpenShift with Grafana dashboards and Alert Manager enabled.

When installed on OpenShift, Prometheus can run as a single pod and it will grab (or say scrap in the prometheus terminology) metrics from different providers (or exporters in the prometheus wording). In this git repository, we set up node-exporter as a provider from Prometheus to get metrics on nodes and have alerts and grafana dashboards to monitor them. It also comes with some basic alerts that checks node’s filesystem or CPU usage.

When you run OpenShift, it is very very valuable to monitor your pods restarts. Because, many restarts is often a sign of a malfunction. To do so, we deploy another exporter that exposes a convenient set of metrics from kubernetes API. Fortunately, there is a kubernetes project named kube-state-metrics which exposes these metrics.

The kube-state-metrics needs to be deployed as DeploymentConfig and exposed as a service. Then, annonate this service so it can be to be scraped by prometheus:

oc create -f << EOF
apiVersion: v1
kind: DeploymentConfig
metadata:
  namespace: monitoring
  name: kube-state-metrics
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: kube-state-metrics
    spec:
      containers:
      - name: kube-state-metrics
        image: gcr.io/google_containers/kube-state-metrics:v0.5.0
        imagePullPolicy: IfNotPresent
        Ports:
        - containerPort: 8080
EOF
oc expose dc kube-state-metrics --port=8080
oc annotate svc kube-state-metrics    prometheus.io/scrape='true'

Then, you can define the following alert, and you will be notified every time your pod restarts more than once in the last 5 minutes:

pod-restart.rules: |
    ALERT PodRestartingTooMuch
      IF  rate(kube_pod_container_status_restarts[1m]) > 1/(5*60)
      FOR 1m
      LABELS {
        severity="warning"
      }
      ANNOTATIONS {
        SUMMARY = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.",
        DESCRIPTION = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much."
      }

OpenShift cheat sheet for beginners

Here is a simple cheatsheet for OpenShift beginners that will help you to visualise some basic settings about your projects, applications, pods in order to debug or get informations about how they behave.

Listing all your projects

oc get projects

This will give you the list of all the project that you can work on an highlight the current project.

Positioning the current project

oc project my-project

This will switch you current project to my-project. This settings is save in your ~/.kube/config file, so if multiple persons are using oc simultaneously with the same user, just mind no overriding each other.

Listing the existing pods (applications)

oc get pods

This will list all the pods (a wrapper for containers, even if generally 1 pod = 1 container) and show you status for each of them.

Checking status for pod

oc describe pod <pod_id>

This will display information about the pod lifecycle: the node on which it has been scheduled, the status of the docker image on the node (image existing or pulling or failed to be pulled), the readiness and liveness status, and if the pod is started or stopped.

Watching pods logs

oc logs -f <pod_id>

The -f option is for follow, just like for the tail command. This will display the logs sent to stdout from the container.
If the pod has crashed or has stopped, it will be in a state that would prevent seeing logs unless you specify -p (for –previous) option.

oc logs -p <pod_id>

Watching event on project

oc get events -w

This will show you all the OpenShift events occurring on the current project and keep watching it (-w for –watch). The evens includes scheduling events, pod startup, scheduling, etc…

Hope that this will help every beginner.