Monitoring a Multi-Node Docker Swarm Stack with Grafana, Prometheus, and 100 Lines of Python

6 min readNov 19, 2022

One year after joining the PrioBike team, our backend infrastructure now consists of over 100 Docker microservices distributed across staging and production deployments. A few weeks ago, we migrated our Docker deployment from Docker Compose to Docker Swarm, allowing us to scale services across multiple virtual machines toward supporting a virtually unlimited number of users. However, running multiple nodes in a Docker overlay network posed significant new challenges.

With multiple nodes running various kinds of containers, ssh ing into a virtual machine to monitor container status becomes a time-consuming task. This is why Docker Swarm provides node-agnostic monitoring with commands such as docker stack services.

❯ docker stack services stack
ID             NAME                        MODE         REPLICAS               IMAGE                  PORTS
f5om9jmvtclc   stack_dummy                 replicated   0/4 (max 1 per node)   alpine:latest
w7kojfmex7ig   stack_stack-monitor         global       0/0                    stack-monitor:latest
s10gx86vispj   stack_stack-monitor-nginx   global       0/0                    nginx:alpine           *:80->80/tcp

In the command line output, we can see, for each service, how many replicas are currently running. This tells us if there are containers that crashed, if a service is currently starting, or if it’s unhealthy. Here it doesn’t matter on which machines the service is running as long as this command is executed from the manager node of the Docker Swarm.

❯ docker node ls
ID                            HOSTNAME         STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
h69co8eexw0nv7128h9gbrihp *   docker-desktop   Ready     Active         Leader           20.10.12

With docker node ls , we can find all nodes currently connected to the Docker Swarm and their status. If, at any time, some node crashes in our deployment or is restarted due to updates, we can see this here.

❯ docker stack ps stack
ID             NAME                                                      IMAGE                  NODE             DESIRED STATE   CURRENT STATE                      ERROR                              PORTS
kpqa6bxjjdrs   stack_dummy.1                                             alpine:latest          docker-desktop   Ready           Ready less than a second ago
lbfnd41g40lk    \_ stack_dummy.1                                         alpine:latest          docker-desktop   Shutdown        Complete less than a second ago
vqqxh520s2ch    \_ stack_dummy.1                                         alpine:latest          docker-desktop   Shutdown        Complete 16 seconds ago
itior6vgaiao    \_ stack_dummy.1                                         alpine:latest          docker-desktop   Shutdown        Complete 32 seconds ago
zlxup3rrmb8r    \_ stack_dummy.1                                         alpine:latest          docker-desktop   Shutdown        Failed 48 seconds ago              "No such container: stack_dumm…"
...
7hh3m9ar5ljg   stack_stack-monitor-nginx.h69co8eexw0nv7128h9gbrihp       nginx:alpine           docker-desktop   Running         Running 42 seconds ago
j51qtc46ew95    \_ stack_stack-monitor-nginx.h69co8eexw0nv7128h9gbrihp   nginx:alpine           docker-desktop   Shutdown        Failed 48 seconds ago              "task: non-zero exit (255)"
9o2hiv19hqh4    \_ stack_stack-monitor-nginx.h69co8eexw0nv7128h9gbrihp   nginx:alpine           docker-desktop   Shutdown        Failed 48 seconds ago              "task: non-zero exit (255)"
gv9a3im00ccc   stack_stack-monitor.h69co8eexw0nv7128h9gbrihp             stack-monitor:latest   docker-desktop   Ready           Preparing less than a second ago
pykfjfnnnjji    \_ stack_stack-monitor.h69co8eexw0nv7128h9gbrihp         stack-monitor:latest   docker-desktop   Shutdown        Failed less than a second ago      "task: non-zero exit (1)"
8jk79f5vd27f    \_ stack_stack-monitor.h69co8eexw0nv7128h9gbrihp         stack-monitor:latest   docker-desktop   Shutdown        Failed 6 seconds ago               "task: non-zero exit (1)"
5k1rucq3qrun    \_ stack_stack-monitor.h69co8eexw0nv7128h9gbrihp         stack-monitor:latest   docker-desktop   Shutdown        Failed 12 seconds ago              "task: non-zero exit (1)"
3xbrevh4uovf    \_ stack_stack-monitor.h69co8eexw0nv7128h9gbrihp         stack-monitor:latest   docker-desktop   Shutdown        Failed 18 seconds ago              "task: non-zero exit (1)"

Finally, with docker stack ps we can find out in detail which containers were stopped in the past and on which node they are/were running. With this information, we can tell if containers are crashing irregularly, often on a specific virtual machine.

Transforming this into a Grafana monitoring solution

Grafana is a self-hostable tool to provide easily accessible metrics. We use it to record and visualize data statistics of IoT MQTT endpoints, such as the five thousand traffic lights that send us real-time observations. To do this, we need a metrics endpoint at each service to be monitored. This endpoint is then scraped by Prometheus periodically, which provides the scraped metrics to the Grafana service.

Grafana is our central monitoring platform.

Now how can we transform the Docker command line output of docker stack services , docker stack ps or docker node ls into statistics that are displayed in Grafana?

Our solution to this problem is a Python microservice that executes and parses the Docker commands via the command line interface. We wrap the Python script into a simple Alpine Linux Docker image and bind the Docker socket as a Docker volume into the microservice’s container.

version: '3.9'

services:
  # The stack monitor.
  stack-monitor:
    image: stack-monitor
    environment:
      # Monitor a stack named "stack".
      - STACK=stack
      # This is where the stack monitor will output the Prometheus metrics.
      - OUTPUT=/usr/share/nginx/html/metrics.txt
    volumes:
      # Mount a volume under the shared nginx dir to serve static files.
      - stack_monitor_staticfiles:/usr/share/nginx/html/
      # Mount the Docker socket to be able to query the Docker API.
      - /var/run/docker.sock:/var/run/docker.sock
    deploy:
      mode: global
      placement:
        constraints:
          - node.role == manager

  # The nginx proxy that serves the stack monitor metrics.
  stack-monitor-nginx:
    image: nginx:alpine
    hostname: stack-monitor-nginx # For Prometheus.
    volumes:
      # Mount a volume under the shared nginx dir to serve static files.
      - stack_monitor_staticfiles:/usr/share/nginx/html/
    ports:
      - 80:80 # Not strictly necessary, only for accessing via localhost.
    deploy:
      mode: global
      placement:
        constraints:
          - node.role == manager
  
  # ... Here come your other Swarm services 🐳
  # TODO: Paste your Grafana and Prometheus service configuration here.

volumes:
  stack_monitor_staticfiles:
    name: stack_monitor_staticfiles

Then we can run the Docker commands via Python’s subprocess package and parse its output. To achieve this, we use Docker’s ––format option and parse the output with Python’s json package. After we parse the relevant parameters, we generate Prometheus-style metrics and output them to a file, like metrics.txt .

# Get the name of the deployed Docker Swarm container stack from the environment.
stack_name = os.environ.get('STACK', None)
if stack_name is None:
    raise Exception('STACK environment variable not set')
# Get the location of the output metrics file.
metrics_file = os.environ.get('OUTPUT', None)
if metrics_file is None:
    raise Exception('OUTPUT environment variable not set')

prometheus_metrics = []

# Lookup the state of each service.
services_fmt = '{"id": "{{.ID}}", "name": "{{.Name}}", "mode": "{{.Mode}}", "replicas": "{{.Replicas}}", "image": "{{.Image}}", "ports": "{{.Ports}}"}'
services_cmd = f"docker stack services {stack_name} --format '{services_fmt}'"
logging.info(f'Running: {services_cmd}')
services = subprocess.run(services_cmd, shell=True, capture_output=True, text=True)
services_output = services.stdout.strip()
if services.returncode != 0:
    raise Exception(f"Error running command: {services_cmd} - {services.stderr}")
services_data = [json.loads(line) for line in services_output.splitlines()]

for service in services_data:
    replicas = service['replicas']
    # Remove the part after x/y, like (max 1 per node) or (global).
    # We only want the first part since that is the number of replicas.
    if ' ' in replicas:
        replicas = replicas.split(' ')[0]
    # Split the x/y into x and y
    replicas = replicas.split('/')
    name = service['name']
    n_replicas = replicas[0]
    n_desired_replicas = replicas[1]
    is_down = int(n_replicas) == 0
    is_partially_running = int(n_replicas) < int(n_desired_replicas) and int(n_replicas) > 0
    is_up = int(n_replicas) == int(n_desired_replicas) and int(n_replicas) > 0
    prometheus_metrics.append(f'service_replicas{{service="{name}"}} {n_replicas}')
    prometheus_metrics.append(f'service_desired_replicas{{service="{name}"}} {n_desired_replicas}')
    prometheus_metrics.append(f'service_is_partially_running{{service="{name}"}} {int(is_partially_running)}')
    prometheus_metrics.append(f'service_is_down{{service="{name}"}} {int(is_down)}')
    prometheus_metrics.append(f'service_is_up{{service="{name}"}} {int(is_up)}')

logging.info('Successfully generated prometheus metrics.')

# Write the metrics to the output file.
with open(metrics_file, 'w') as f:
    f.write('\n'.join(prometheus_metrics))

Now, this metrics file is output to /usr/share/nginx/html/metrics.txt which is the directory that is shared with the NGINX service. With this configuration, we can now directly access the generated metrics file at localhost/metrics.txt! Finally, we can scrape the metrics file with the following Prometheus configuration.

global:
  scrape_interval: '5s'
  evaluation_interval: '5s'

scrape_configs:
  - job_name: 'stack-monitor-nginx'
    metrics_path: /metrics.txt
    static_configs:
      - targets: ['stack-monitor-nginx:80']

That’s it. Assuming that Grafana and Prometheus are set up correctly, we can now access the Docker container statistics in Grafana and monitor all our services and nodes at a glance. Monitoring our Swarm deployment is now easily 100 times more convenient 🚀

The full example with Grafana and Prometheus is available here.

Monitoring a Multi-Node Docker Swarm Stack with Grafana, Prometheus, and 100 Lines of Python

Transforming this into a Grafana monitoring solution

Written by Philipp