Exporter Review: RabbitMQ

In this edition of our exporter review series, we will be introducing RabbitMQ, one of the best-fit exporters for monitoring metrics used by NexClipper. Read on to find out the exporter’s most important metrics, recommended alert rules, as well as the related Grafana dashboard and Helm Chart.

About RabbitMQ

RabbitMQ is a widely adopted open source message broker. A message broker is software that enables applications, systems, and services to communicate with each other and exchange information.

RabbitMQ is lightweight, easy to deploy on premises and in the cloud, and able to handle millions of users and transactions. It can be deployed in distributed and federated configurations to meet high-scale, high-availability requirements. It supports multiple messaging protocols – AMQP 1.0, MQTT, STOMP.

Since it is a mission-critical piece of software that binds the applications, monitoring is a must. A RabbitMQ exporter is required to monitor and expose the RabbitMQ metrics. It queries RabbitMQ, scraps the data, and exposes the metrics to a Kubernetes service endpoint that can further be scrapped by Prometheus to ingest the time series data. For monitoring of RabbitMQ we use an external Prometheus exporter, which is maintained by the Prometheus Community. On deployment this exporter scraps sizable metrics from RabbitMQ and helps users get crucial information about the message broker which is difficult to get from RabbitMQ directly. 

For this setup, we are using bitnami rabbitmq helm charts to start the cluster. 

RabbitMQ has a built-in Prometheus plugin as well as an official Prometheus exporter – below we are explaining the setup of both.

RabbitMQ with Prometheus Exporter

How do you set up an exporter for Prometheus?

With the latest version of Prometheus (2.33 as of February 2022), there are three ways to set up a Prometheus exporter: 

Method 1 – Native

Supported by Prometheus since the beginning
To set up an exporter in native way a Prometheus config needs to be updated to add the target.
A sample configuration:

 # scrape_config job  
   - job_name: rabbitmq-staging
     scrape_interval: 45s
     scrape_timeout:  30s
     metrics_path: "/metrics"
     static_configs:
     - targets:
       - <RabbitMQ endpoint>
Method 2 – Service Discovery

This method is applicable for Kubernetes deployment only
With this, a default scrap config can be added to the prometheus.yaml file and an annotation can be added to the exporter service. With this, Prometheus will automatically start scrapping the data from the services with the mentioned path.

Prometheus.yaml

     - job_name: kubernetes-services
        scrape_interval: 15s
        scrape_timeout: 10s
        kubernetes_sd_configs:
        - role: service
        relabel_configs:
        # Example relabel to scrape only endpoints that have
        # prometheus.io/scrape: "true" annotation.
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        #  prometheus.io/path: "/scrape/path" annotation.
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        #  prometheus.io/port: "80" annotation.
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: (.+)(?::\d+);(\d+)
          replacement: $1:$2

Exporter service:

 annotations:
    prometheus.io/path: /metrics
    prometheus.io/scrape: "true"
Method 3 – Prometheus Operator

Setting up a service monitor
The Prometheus operator supports an automated way of scraping data from the exporters by setting up a service monitor Kubernetes object. A sample service monitor for RabbitMQ can be found here. These are the necessary steps:

Step 1

Add/update Prometheus operator’s selectors. By default, the Prometheus operator comes with empty selectors which will select every service monitor available in the cluster for scrapping the data.

To check your Prometheus configuration:

Kubectl get prometheus -n <namespace> -o yaml

A sample output will look like this.

ruleNamespaceSelector: {}
    ruleSelector:
      matchLabels:
        app: kube-prometheus-stack
        release: kps
    scrapeInterval: 1m
    scrapeTimeout: 10s
    securityContext:
      fsGroup: 2000
      runAsGroup: 2000
      runAsNonRoot: true
      runAsUser: 1000
    serviceAccountName: kps-kube-prometheus-stack-prometheus
    serviceMonitorNamespaceSelector: {}
    serviceMonitorSelector:
      matchLabels:
        release: kps

Here you can see that this Prometheus configuration is selecting all the service monitors with the label release = kps

So with this, if you are modifying the default Prometheus operator configuration for service monitor scrapping, make sure you use the right labels in your service monitor as well.

Step 2

Add a service monitor and make sure it has a matching label and namespace for the Prometheus service monitor selectors (serviceMonitorNamespaceSelector & serviceMonitorSelector).

Sample configuration:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: rabbitmq-exporter
    meta.helm.sh/release-namespace: monitor
  creationTimestamp: "2022-04-04T10:22:52Z"
  generation: 1
  labels:
    app: prometheus-rabbitmq-exporter
    app.kubernetes.io/managed-by: Helm
    chart: prometheus-rabbitmq-exporter-1.1.0
    heritage: Helm
    release: kps
  name: rabbitmq-exporter-prometheus-rabbitmq-exporter
  namespace: monitor
  resourceVersion: "86677099"
  uid: 55943299-a8ed-4553-9cdb-cc784176aea8
spec:
  endpoints:
  - interval: 15s
    port: rabbitmq-exporter
  selector:
    matchLabels:
      app: prometheus-rabbitmq-exporter
      release: rabbitmq-exporter

Here you can see we have a matching label on the service monitor release = kps that we are specifying in the Prometheus operator scrapping configuration.

Metrics

The following ones are handpicked metrics that will give insights for RabbitMQ operations.

  1. Server is up
    As the name suggests, this metric will expose the state of the RabbitMQ process and whether it is up or down.
    ➡ The key of the exporter metric is “rabbitmq_up”.
    ➡ The value of the metric is a boolean – 1 or 0 which symbolizes if RabbitMQ is up or down respectively.  
  1. Overflowing queue
    Queues are a fundamental component of any message broker. All messages that are getting pushed or read by RabbitMQ must belong to one of the queues.
    Users would never want to choke the queue. If the queue is filled up to the maximum capacity, it can no longer accept new messages.
    To get the total number of the ready messages in the queue.
    ➡ The metric Key  is “rabbitmq_queue_messages_ready_total”
    ➡ The value will be number of the messages, ex: “rabbitmq_queue_messages_ready_total  157”
  1.  Too many connections
    RabbitMQ acts as a broker between a publisher and a subscriber. Every client to the queue opens a connection with RabbitMQ. Each new one requires resources from the underlying machine and puts burden on the hardware as well as software. Therefore, the number of connections to RabbitMQ should be limited to avoid any discrepancy in the service.
    ➡ metric “ rabbitmq_connectionsTotal” gives the total active connections on RabbitMQ
    ➡ The number should be calculated based on the resources allocated to the RabbitMQ service
  1. Active queue
    As the name suggests the metrics will give insight into how many active queues are present in RabbitMQ that are handling the data. 
    A message can be enqueued (added) and dequeued (removed). It is important to monitor the active queue.
    ➡ meric “rabbitmq_queuesTotal” exposes the number of active queues
  1. Total number of consumers
    As the name suggests, this metric will provide insight into how many consumers a queue has. Consumers in RabbitMQ are those targets which consume the message from the queue.
    ➡ metric  “rabbitmq_consumersTotal” exposes the total number of active consumers on a queue

Alerting

After digging into all the valuable metrics, this section explains in detail how we can get critical alerts.

PromQL is a query language for the Prometheus monitoring system. It is designed for building powerful yet simple queries for graphs, alerts, or derived time series (aka recording rules). PromQL is designed from scratch and has zero common grounds with other query languages used in time series databases, such as SQL in TimescaleDB, InfluxQL, or Flux. More details can be found here.

Prometheus comes with a built-in Alert Manager that is responsible for sending alerts (could be email, Slack, or any other supported channel) when any of the trigger conditions is met. Alerting rules allow users to define alerts based on Prometheus query expressions. They are defined based on the available metrics scraped by the exporter. Click here for a good source for community-defined alerts.

A general alert looks as follows:

– alert:(Alert Name)
expr: (Metric exported from exporter) >/</==/<=/=> (Value)
for: (wait for a certain duration between first encountering a new expression output vector element and counting an alert as firing for this element)
labels: (allows specifying a set of additional labels to be attached to the alert)
annotation: (specifies a set of informational labels that can be used to store longer additional information)

Some of the recommended RabbitMQ alerts are:

  1. Alert – RabbitMQ is Down
- alert: RabbitmqDown
    expr: rabbitmq_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Rabbitmq down (instance {{ $labels.instance }})
      description: "RabbitMQ node down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  1. Alert – too many messages in the queue
- alert: RabbitmqTooManyMessagesInQueue
      expr: rabbitmq_queue_messages_ready_total > 1000
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Rabbitmq too many messages in queue (instance {{ $labels.instance }})
        description: "Queue is filling up (> 1000 msgs)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  1.  Alert – RabbitMQ running out of memory
- alert: RabbitmqOutOfMemory
    expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq out of memory (instance {{ $labels.instance }})
      description: "Memory available for RabbitMQ is low (< 10%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  1.  Alert –  Too many connections
- alert: RabbitmqTooManyConnections
    expr: rabbitmq_connections > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq too many connections (instance {{ $labels.instance }})
      description: "The total connections of a node is too high\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}
  1.  Alert – Too many consumers
-  alert: Too_many_consumers
    expr: rabbitmq_consumersTotal > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: High number of consumers (instance {{ $labels.instance }})
      description: "Consumers are exceeding \n  VALUE = >1000 \n  LABELS = {{ $labels }}"

Dashboard

Graphs are easier to understand and more user-friendly than a row of numbers. For this purpose, users can plot their time series data in visualized format using Grafana.

Grafana is an open-source dashboarding tool used for visualizing metrics with the help of customizable and illustrative charts and graphs. It connects very well with Prometheus and makes monitoring easy and informative. Dashboards in Grafana are made up of panels, with each panel running a PromQL query to fetch metrics from Prometheus.
Grafana supports community-driven graphs for most of the widely used software, which can be directly imported to the Grafana Community.

NexClipper uses the Redis Database by the downager dashboard, which is widely accepted and has a lot of useful panels.

What is a Panel?

Panels are the most basic component of a dashboard and can display information in various ways, such as gauge, text, bar chart, graph, and so on. They provide information in a very interactive way. Users can view every panel separately and check the value of metrics within a specific time range. 
The values on the panel are queried using PromQL, which is Prometheus Query Language. PromQL is a simple query language used to query metrics within Prometheus. It enables users to query data, aggregate and apply arithmetic functions to the metrics, and then further visualize them on panels.

Here an example panel:

Showing system up/down with other consumer-related information

Helm Chart

The exporter, alert rule, and dashboard can be deployed in Kubernetes using Helm chart. The Helm chart used for deployment is taken from the Prometheus community, which can be found here. To deploy this Helm chart users can either follow the steps in the above link or refer to the ones outlined below:

helm repo add Prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

helm install [RELEASE_NAME] prometheus-community/prometheus-rabbitmq-exporter

Some of the common parameters that must be changed in the values file include: 

rabbitmq.url: Defines Rabbit MQ Listening URL.
rabbitmq.user: Rabbit MQ connection User.
rabbitmq.password: RabbitMQ password.

Additional parameters can be changed based on individual needs, such as including_queues, skip_queues, output format, timeouts, etc. All these parameters can be tuned via the values.yaml file here.

  capabilities: bert,no_sort
  include_queues: ".*"
  include_vhost: ".*"
  skip_queues: "^$"
  skip_verify: "false"
  skip_vhost: "^$"
  exporters: "exchange,node,overview,queue"
  output_format: "TTY"
  timeout: 30
  max_queues: 0

In addition to the native way of setting up Prometheus monitoring, a service monitor can be deployed (if a Prometheus operator is being used) to scrap the data from RabbitMQ, and Prometheus then scraps the data from the service monitor. With this approach multiple RabbitMQs can be scrapped without altering the Prometheus configuration. Every RabbitMQ comes with its own service monitor.

In the above-mentioned chart, a service monitor can be deployed by turning it on from the values.yaml file here.

# or use the service monitor
prometheus:
  monitor:
    enabled: true
    additionalLabels:
      release: kps
    interval: 15s
    namespace: []
  rules:
    enabled: true
    additionalLabels:
      release: kps
      app: kube-prometheus-stack

A sample reference values file:

rabbitmq:
  url: http://ncmq-rabbitmq-hana.nc.svc.cluster.local:15672
  user: guest
  password: guest
  # If existingPasswordSecret is set then password is ignored
  existingPasswordSecret: ~
  existingPasswordSecretKey: password
  capabilities: bert,no_sort
  include_queues: ".*"
  include_vhost: ".*"
  skip_queues: "^$"
  skip_verify: "false"
  skip_vhost: "^$"
  exporters: "exchange,node,overview,queue"
  output_format: "TTY"
  timeout: 30
  max_queues: 0

## Additional labels to set in the Deployment object. Together with standard labels from
## the chart
additionalLabels: {}

podLabels: {}


# Either use Annotation
annotations:
  prometheus.io/scrape: "true"
  prometheus.io/path: "/metrics"
  prometheus.io/port: "9419"


# or use the service monitor
prometheus:
  monitor:
    enabled: true
    additionalLabels:
      release: kps
    interval: 15s
    namespace: []
  rules:
    enabled: true
    additionalLabels:
      release: kps
      app: kube-prometheus-stack

Update the annotation section here if not using Prometheus operator.

annotations:
    prometheus.io/path: /metrics
    prometheus.io/scrape: "true"

Now let’s move onto the setup of the second way.

RabbitMQ with built-in Prometheus plugin
(no exporter needed)

Additionally, there is a solution to monitor RabbitMQ by using the built-in Prometheus plugin from RabbitMQ. Our recommendation is to use both options.

How to install plugin, choose official metrics, and set alerts

RabbitMQ version V3.8.0 and above supports the way to enable a built-in Prometheus metrics plugin that will expose all RabbitMQ metrics in Prometheus format to an endpoint that Prometheus can scrap by enabling the auto-discovery or by creating a service monitor. To enable the RabbitMQ plugin via Helm charts, set the metrics enabled to “true”.

helm install <release name> bitnami/rabbitmq --set metrics.enabled=true

More details about the plugin can be found here.

In the case of standard Prometheus installation, once the plugin is enabled in RabbitMQ, annotations need to be added to RabbitMQ (if you are using the RabbitMQ chart it will be added automatically). Here are the annotations:

annotations:
    prometheus.io/path: /metrics
    prometheus.io/scrape: "true"

These annotations should be added on the pod level. Now Prometheus will automatically start scraping the data if the pod discovery is enabled.
Prometheus configuration for pod discovery:

- job_name: "kubernetes-pods"

    kubernetes_sd_configs:
      - role: pod

In the case of the Prometheus Operator, once the plugin is enabled in RabbitMQ, the service monitor needs to be enables. For this, run the following command:

helm upgrade ---install <release name> bitnami/rabbitmq --set metrics.enabled=true --set metrics.serviceMonitor.enabled=true

Once the service monitor is created, the Prometheus operator will start scrapping the metrics automatically in the default configuration.

Some important metrics

  1. Server is up
    As the name suggests, this metric will expose the state of the RabbitMQ process and whether it is up or down.
    ➡ The key of the exporter metric is “rabbitmq_up”.
    ➡ The value of the metric is a boolean – 1 or 0 which symbolizes if RabbitMQ is up or down respectively.  
  1. Cluster down
    Tis metric exposes the state of the RabbitMQ cluster.
    ➡ The key of the exporter metric is “rabbitmq_running
    ➡ The value of the metric is a number that symbolizes the number of nodes in the RabbitMQ cluster.
  1.  Out of memory
    The memory status of RabbitMQ is exposed through this metric.
    ➡ The key of the exporter metric is “rabbitmq_node_mem_used” and “rabbitmq_node_mem_limit
    ➡ The value of the metric is a number that symbolizes the number of available memory
  1. Too many connections
    RabbitMQ acts as a broker between a publisher and a subscriber. Every client to the queue opens a connection with RabbitMQ. Each new one requires resources from the underlying machine and puts burden on the hardware as well as software. Therefore, the number of connections to RabbitMQ should be limited to avoid any discrepancy in the service.
    ➡  metric “ rabbitmq_connectionsTotal” gives the total active connections on RabbitMQ
    ➡ The number should be calculated based on the resources allocated to the RabbitMQ service
  1. Cluster partitions down
    This metric exposes the RabbitMQ partition status.
    ➡ The key of the exporter metric is “rabbitmq_partitions
    ➡ The value of the metric is a number that symbolizes a number of the network partition created

Some critical alerts

  1. Alert – Rabbit MQ Down
  - alert: RabbitmqDown
    expr: rabbitmq_up{service="{{ template "rabbitmq.fullname" . }}"} == 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: Rabbitmq down (instance {{ "{{ $labels.instance }}" }})
      description: RabbitMQ node down
  1. Alert – Rabbit MQ Cluster Down
  - alert: ClusterDown
    expr: |
      sum(rabbitmq_running{service="{{ template "rabbitmq.fullname" . }}"})
      < {{ .Values.replicaCount }}
    for: 5m
    labels:
      severity: error
    annotations:
      summary: Cluster down (instance {{ "{{ $labels.instance }}" }})
      description: |
          Less than {{ .Values.replicaCount }} nodes running in RabbitMQ cluster
          VALUE = {{ "{{ $value }}" }}
  1.  Alert – RabbitMQ Partition
  - alert: ClusterPartition
    expr: rabbitmq_partitions{service="{{ template "rabbitmq.fullname" . }}"} > 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: Cluster partition (instance {{ "{{ $labels.instance }}" }})
      description: |
          Cluster partition
          VALUE = {{ "{{ $value }}" }}
  1.   Alert – RabbitMQ is out of memory
  - alert: OutOfMemory
    expr: |
      rabbitmq_node_mem_used{service="{{ template "rabbitmq.fullname" . }}"}
      / rabbitmq_node_mem_limit{service="{{ template "rabbitmq.fullname" . }}"}
      * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Out of memory (instance {{ "{{ $labels.instance }}" }})
      description: |
          Memory available for RabbmitMQ is low (< 10%)\n  VALUE = {{ "{{ $value }}" }}
          LABELS: {{ "{{ $labels }}" }}
  1.  Alert – Too many connections
  - alert: TooManyConnections
    expr: rabbitmq_connectionsTotal{service="{{ template "rabbitmq.fullname" . }}"} > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Too many connections (instance {{ "{{ $labels.instance }}" }})
      description: |
          RabbitMQ instance has too many connections (> 1000)
          VALUE = {{ "{{ $value }}" }}\n  LABELS: {{ "{{ $labels }}" }}

Alerts can be enabled, disabled, altered, or added using the helm chart here.

Dashboard

This is the dashboard that has been used.

This concludes our discussion of the RabbitMQ exporter! If you have any questions, you can reach our team via support@nexclipper.io and stay tuned for further exporter reviews and tips coming soon.