Back to articles

Building a Complete Docker Monitoring Stack with Prometheus & Grafana: A Hands-On Guide

How to set up enterprise-grade monitoring for your Docker infrastructure with GPU support

Aziz Hamadi
#Docker#Prometheus#Grafana#Monitoring#DevOps#MLOps#GPU

How to set up enterprise-grade monitoring for your Docker infrastructure with GPU support

Introduction

In today's cloud-native world, visibility into your infrastructure is not just a luxury—it's a necessity. Whether you're running containerized applications, managing ML workloads with GPU resources, or simply ensuring your services stay healthy, having comprehensive monitoring in place can mean the difference between catching issues early and facing production outages.

In this article, I'll walk you through building a complete monitoring stack for Docker environments using industry-standard tools: Prometheus for metrics collection and Grafana for visualization. What makes this setup special is its comprehensive coverage—from container-level metrics to system-wide performance, and even GPU monitoring for ML/AI workloads.

monitoring stack grafana prometheus gpu

The Challenge: Why Monitor Docker Environments?

Running applications in Docker containers brings numerous benefits: isolation, scalability, and portability. However, this containerization also introduces new complexities:

  • Limited visibility: Traditional monitoring tools often can't peek inside containers
  • Resource isolation: Understanding which containers are consuming resources requires specialized tools
  • GPU workloads: ML/AI applications need GPU monitoring, which traditional tools don't handle well
  • Fragmented metrics: Different components expose metrics in different ways

This is where a purpose-built monitoring stack becomes essential. By combining the right tools, we can achieve complete observability across our entire Docker infrastructure.

The Solution: A Complete Monitoring Stack

Our monitoring stack brings together five powerful components that work in harmony:

1. Prometheus - The Metrics Database

Prometheus is an open-source monitoring and alerting toolkit that's become the de facto standard for cloud-native monitoring. It collects metrics by scraping HTTP endpoints and stores them in a time-series database. Its powerful query language (PromQL) allows you to analyze metrics and create complex alerts.

Key features:

  • Pull-based metric collection
  • Powerful query language
  • Efficient time-series storage
  • Built-in service discovery

2. Grafana - The Visualization Layer

Grafana provides beautiful, customizable dashboards that transform raw metrics into actionable insights. With its extensive library of pre-built dashboards, you can get started monitoring within minutes.

Key features:

  • Rich visualization options (graphs, gauges, heatmaps)
  • Pre-built dashboard library
  • Alerting capabilities
  • Multi-data source support

3. cAdvisor - Container Metrics

Google's cAdvisor (Container Advisor) provides detailed metrics about running containers. It automatically discovers all containers on a host and collects resource usage and performance statistics.

Metrics collected:

  • CPU usage per container
  • Memory consumption
  • Network I/O statistics
  • Filesystem usage

4. Node Exporter - Host System Metrics

The Prometheus Node Exporter exposes hardware and OS metrics of the host machine. It's essential for understanding overall system health and resource utilization.

Metrics collected:

  • CPU, memory, disk, and network utilization
  • System load averages
  • Filesystem statistics
  • Hardware temperature (where available)

5. DCGM Exporter - GPU Monitoring (Optional but Powerful)

For ML/AI workloads running on NVIDIA GPUs, the DCGM Exporter provides critical insights into GPU utilization, memory usage, and performance metrics.

Metrics collected:

  • GPU utilization percentage
  • Memory usage (used/total)
  • Temperature and power consumption
  • Performance statistics

Architecture Overview

The architecture is beautifully simple yet powerful:

┌─────────────┐
│  cAdvisor   │──┐
│  (Port 8080)│  │
└─────────────┘  │
                 │
┌─────────────┐  │    ┌─────────────┐    ┌─────────────┐
│Node Exporter│──┼───▶│ Prometheus  │───▶│   Grafana   │
│  (Port 9100)│  │    │ (Port 9090) │    │ (Port 3000) │
└─────────────┘  │    └─────────────┘    └─────────────┘
                 │
┌─────────────┐  │
│DCGM Exporter│──┘
│  (Port 9400)│
└─────────────┘

Each exporter collects metrics from its respective domain and exposes them via HTTP endpoints. Prometheus scrapes these endpoints at regular intervals (every 15 seconds by default) and stores the metrics. Grafana connects to Prometheus as a data source and visualizes the metrics through dashboards.

Getting Started: Installation Guide

Prerequisites

Before diving in, ensure you have:

  • Docker (v20.10+) installed
  • Docker Compose (v2.0+) installed
  • (Optional) NVIDIA Container Toolkit for GPU monitoring

Step 1: Set Up the Stack

The entire stack is defined in a single docker-compose.yaml file, making deployment straightforward. The configuration includes:

services:
  grafana:
    image: docker.io/grafana/grafana-oss:12.1.1
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana

  prometheus:
    image: docker.io/prom/prometheus:v3.6.0
    ports:
      - "9090:9090"
    command:
      - --storage.tsdb.retention.time=7d
      - --storage.tsdb.retention.size=2GB

Key configuration points:

  • Data persistence: Both Grafana and Prometheus use Docker volumes to persist data
  • Retention policy: Prometheus is configured to retain 7 days of data or 2GB, whichever comes first
  • Auto-restart: All services are configured to restart automatically unless stopped

Step 2: Configure Prometheus

Prometheus needs to know which endpoints to scrape. This is configured in prometheus.yaml:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'dcgm_exporter'
    static_configs:
      - targets: ['dcgm_exporter:9400']

Each job represents a different source of metrics, and Prometheus will scrape them at the configured interval (15 seconds by default).

Step 3: Deploy and Verify

Once deployed, you can verify everything is working:

  1. Check Prometheus targets: Navigate to http://localhost:9090/targets to see all configured endpoints. All should show as "UP" (green).
Prometheus Targets Status

Figure 1: Prometheus Targets Status

  1. Test metrics endpoints: You can directly query each exporter:
    curl http://localhost:8080/metrics  # cAdvisor
    curl http://localhost:9100/metrics  # Node Exporter
    curl http://localhost:9400/metrics  # DCGM Exporter (if enabled)
  2. Verify service health: Check that all containers are running and try to restart prometheus if any target is down.
Prometheus Targets Detail

Figure 2: Portainer monitoring stack

Visualizing Metrics with Grafana

With Prometheus collecting metrics, Grafana brings them to life with beautiful dashboards.

Initial Setup

  1. Access Grafana at http://localhost:3000
  2. Default credentials: admin/admin (you'll be prompted to change the password)
  3. Add Prometheus as a data source:
    • Navigate to ConnectionsData Sources
    • Click Add new data source
    • Select Prometheus
    • Set URL to http://prometheus:9090
    • Click Save & Test

Import Pre-built Dashboards

The Grafana community has created excellent dashboards for each exporter. Here are the recommended ones:

Dashboard Purpose Dashboard ID
Container Monitoring cAdvisor metrics 19792
Node Exporter Full System metrics 1860
NVIDIA DCGM Exporter GPU metrics 12239

Importing is straightforward:

  1. Click +Import
  2. Enter the Dashboard ID
  3. Select your Prometheus data source
  4. Click Import

Dashboard Highlights

The dashboards provide comprehensive views into your infrastructure:

Grafana Container Dashboard

Figure 3: Grafana Container Dashboard

Container Metrics Dashboard shows:

  • CPU usage per container
  • Memory consumption trends
  • Network I/O statistics
  • Container restart counts
Grafana System Dashboard

Figure 4: Grafana System Dashboard

System Metrics Dashboard provides:

  • Overall CPU, memory, and disk utilization
  • Network traffic patterns
  • Load averages
  • Filesystem usage
Grafana GPU Dashboard

Figure 5: Grafana GPU Dashboard

GPU Metrics Dashboard (for ML workloads) displays:

  • GPU utilization percentages
  • Memory usage across all GPUs
  • Temperature monitoring
  • Power consumption

Real-World Benefits

This monitoring stack provides several immediate benefits:

1. Proactive Issue Detection

By continuously monitoring metrics, you can detect anomalies before they become critical. For example:

  • Sudden spikes in memory usage might indicate a memory leak
  • Increased container restart counts signal application instability
  • GPU temperature increases could indicate cooling issues

2. Resource Optimization

Understanding resource consumption helps you:

  • Right-size containers based on actual usage
  • Identify over-provisioned resources
  • Plan capacity more accurately

3. Performance Insights

The detailed metrics help you:

  • Identify performance bottlenecks
  • Understand application behavior under load
  • Make data-driven optimization decisions

4. Cost Management

For cloud deployments, visibility into resource usage directly translates to cost awareness:

  • Identify unused or underutilized resources
  • Track resource trends over time
  • Make informed scaling decisions

Advanced Configuration Options

Custom Scrape Intervals

You can customize how often Prometheus scrapes each target:

scrape_configs:
  - job_name: 'high-frequency-metrics'
    scrape_interval: 5s  # Scrape every 5 seconds
    
  - job_name: 'low-priority-metrics'
    scrape_interval: 60s  # Scrape every minute

Retention Policies

Adjust data retention based on your needs:

# In docker-compose.yaml
command:
  - --storage.tsdb.retention.time=30d  # Keep 30 days
  - --storage.tsdb.retention.size=10GB  # Or 10GB

Custom Dashboards

While pre-built dashboards are great for getting started, you'll likely want to create custom dashboards tailored to your specific needs. Grafana's query builder makes it straightforward to create visualizations from Prometheus metrics.

Troubleshooting Tips

If something isn't working as expected:

  1. Check container logs: docker logs <container-name>
  2. Verify network connectivity: Ensure containers can communicate (they're on the same Docker network by default)
  3. Test endpoints directly: Use curl to verify exporters are responding
  4. Check Prometheus targets page: This is your single source of truth for exporter health

Security Considerations

For production deployments, consider:

  • Authentication: Set up proper authentication for Grafana and Prometheus
  • Network isolation: Use Docker networks to limit access
  • HTTPS: Configure TLS for production environments
  • Secret management: Use Docker secrets or environment files for sensitive configuration

Conclusion

Setting up a comprehensive monitoring stack doesn't have to be complicated. With Docker Compose and well-configured open-source tools, you can achieve enterprise-grade observability in minutes rather than days.

This stack provides the foundation for understanding your infrastructure's behavior, optimizing resource usage, and catching issues before they impact users. Whether you're running simple web applications or complex ML workloads, visibility is key to maintaining healthy systems.

The best part? All the tools are open-source and battle-tested at scale. Companies from startups to Fortune 500s rely on Prometheus and Grafana for monitoring, so you're using tools with proven track records.

Next Steps

  • Explore Prometheus's alerting capabilities to get notified of issues
  • How to create your own custom metric exporter of your REST API
  • We'll explore adding distributed tracing (Jaeger/Tempo) for complete observability

Happy monitoring! 🚀

If you found this guide helpful, feel free to check out the complete setup on my GitHub. The repository includes all configuration files and documentation to get you started quickly.