Monitoring a Multi-Node Docker Swarm Stack with Grafana, Prometheus, and 100 Lines of Python

One year after joining the PrioBike team, our backend infrastructure now consists of over 100 Docker microservices distributed across staging and production deployments. A few weeks ago, we migrated our Docker deployment from Docker Compose to Docker Swarm, allowing us to scale services across multiple virtual machines toward supporting a virtually unlimited number of users. However, running multiple nodes in a Docker overlay network posed significant new challenges.

With multiple nodes running various kinds of containers, ssh ing into a virtual machine to monitor container status becomes a time-consuming task. This is why Docker Swarm provides node-agnostic monitoring with commands such as docker stack services.

In the command line output, we can see, for each service, how many replicas are currently running. This tells us if there are containers that crashed, if a service is currently starting, or if it’s unhealthy. Here it doesn’t matter on which machines the service is running as long as this command is executed from the manager node of the Docker Swarm.

With docker node ls , we can find all nodes currently connected to the Docker Swarm and their status. If, at any time, some node crashes in our deployment or is restarted due to updates, we can see this here.

Finally, with docker stack ps we can find out in detail which containers were stopped in the past and on which node they are/were running. With this information, we can tell if containers are crashing irregularly, often on a specific virtual machine.

Grafana is a self-hostable tool to provide easily accessible metrics. We use it to record and visualize data statistics of IoT MQTT endpoints, such as the five thousand traffic lights that send us real-time observations. To do this, we need a metrics endpoint at each service to be monitored. This endpoint is then scraped by Prometheus periodically, which provides the scraped metrics to the Grafana service.

Grafana is our central monitoring platform.

Now how can we transform the Docker command line output of docker stack services , docker stack ps or docker node ls into statistics that are displayed in Grafana?

Our solution to this problem is a Python microservice that executes and parses the Docker commands via the command line interface. We wrap the Python script into a simple Alpine Linux Docker image and bind the Docker socket as a Docker volume into the microservice’s container.

Then we can run the Docker commands via Python’s subprocess package and parse its output. To achieve this, we use Docker’s ––format option and parse the output with Python’s json package. After we parse the relevant parameters, we generate Prometheus-style metrics and output them to a file, like metrics.txt .

Now, this metrics file is output to /usr/share/nginx/html/metrics.txt which is the directory that is shared with the NGINX service. With this configuration, we can now directly access the generated metrics file at localhost/metrics.txt! Finally, we can scrape the metrics file with the following Prometheus configuration.

That’s it. Assuming that Grafana and Prometheus are set up correctly, we can now access the Docker container statistics in Grafana and monitor all our services and nodes at a glance. Monitoring our Swarm deployment is now easily 100 times more convenient 🚀

The full example with Grafana and Prometheus is available here.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store