How to set up enterprise-grade monitoring for your Docker infrastructure with GPU support

Introduction

In today's cloud-native world, visibility into your infrastructure is not just a luxury—it's a necessity. Whether you're running containerized applications, managing ML workloads with GPU resources, or simply ensuring your services stay healthy, having comprehensive monitoring in place can mean the difference between catching issues early and facing production outages.

In this article, I'll walk you through building a complete monitoring stack for Docker environments using industry-standard tools: Prometheus for metrics collection and Grafana for visualization. What makes this setup special is its comprehensive coverage—from container-level metrics to system-wide performance, and even GPU monitoring for ML/AI workloads.

The Challenge: Why Monitor Docker Environments?

Running applications in Docker containers brings numerous benefits: isolation, scalability, and portability. However, this containerization also introduces new complexities:

Limited visibility: Traditional monitoring tools often can't peek inside containers
Resource isolation: Understanding which containers are consuming resources requires specialized tools
GPU workloads: ML/AI applications need GPU monitoring, which traditional tools don't handle well
Fragmented metrics: Different components expose metrics in different ways

This is where a purpose-built monitoring stack becomes essential. By combining the right tools, we can achieve complete observability across our entire Docker infrastructure.

The Solution: A Complete Monitoring Stack

Our monitoring stack brings together five powerful components that work in harmony:

1. Prometheus - The Metrics Database

Prometheus is an open-source monitoring and alerting toolkit that's become the de facto standard for cloud-native monitoring. It collects metrics by scraping HTTP endpoints and stores them in a time-series database. Its powerful query language (PromQL) allows you to analyze metrics and create complex alerts.

Key features:

Pull-based metric collection
Powerful query language
Efficient time-series storage
Built-in service discovery

2. Grafana - The Visualization Layer

Grafana provides beautiful, customizable dashboards that transform raw metrics into actionable insights. With its extensive library of pre-built dashboards, you can get started monitoring within minutes.

Key features:

Rich visualization options (graphs, gauges, heatmaps)
Pre-built dashboard library
Alerting capabilities
Multi-data source support

3. cAdvisor - Container Metrics

Google's cAdvisor (Container Advisor) provides detailed metrics about running containers. It automatically discovers all containers on a host and collects resource usage and performance statistics.

Metrics collected:

CPU usage per container
Memory consumption
Network I/O statistics
Filesystem usage

4. Node Exporter - Host System Metrics

The Prometheus Node Exporter exposes hardware and OS metrics of the host machine. It's essential for understanding overall system health and resource utilization.

Metrics collected:

CPU, memory, disk, and network utilization
System load averages
Filesystem statistics
Hardware temperature (where available)

5. DCGM Exporter - GPU Monitoring (Optional but Powerful)

For ML/AI workloads running on NVIDIA GPUs, the DCGM Exporter provides critical insights into GPU utilization, memory usage, and performance metrics.

Metrics collected:

GPU utilization percentage
Memory usage (used/total)
Temperature and power consumption
Performance statistics

Architecture Overview

The architecture is beautifully simple yet powerful:

┌─────────────┐
│  cAdvisor   │──┐
│  (Port 8080)│  │
└─────────────┘  │
                 │
┌─────────────┐  │    ┌─────────────┐    ┌─────────────┐
│Node Exporter│──┼───▶│ Prometheus  │───▶│   Grafana   │
│  (Port 9100)│  │    │ (Port 9090) │    │ (Port 3000) │
└─────────────┘  │    └─────────────┘    └─────────────┘
                 │
┌─────────────┐  │
│DCGM Exporter│──┘
│  (Port 9400)│
└─────────────┘

Each exporter collects metrics from its respective domain and exposes them via HTTP endpoints. Prometheus scrapes these endpoints at regular intervals (every 15 seconds by default) and stores the metrics. Grafana connects to Prometheus as a data source and visualizes the metrics through dashboards.

Getting Started: Installation Guide

Prerequisites

Before diving in, ensure you have:

Docker (v20.10+) installed
Docker Compose (v2.0+) installed
(Optional) NVIDIA Container Toolkit for GPU monitoring

Step 1: Set Up the Stack

The entire stack is defined in a single docker-compose.yaml file, making deployment straightforward. The configuration includes:

services:
  grafana:
    image: docker.io/grafana/grafana-oss:12.1.1
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana

  prometheus:
    image: docker.io/prom/prometheus:v3.6.0
    ports:
      - "9090:9090"
    command:
      - --storage.tsdb.retention.time=7d
      - --storage.tsdb.retention.size=2GB

Key configuration points:

Data persistence: Both Grafana and Prometheus use Docker volumes to persist data
Retention policy: Prometheus is configured to retain 7 days of data or 2GB, whichever comes first
Auto-restart: All services are configured to restart automatically unless stopped

Step 2: Configure Prometheus

Prometheus needs to know which endpoints to scrape. This is configured in prometheus.yaml:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'dcgm_exporter'
    static_configs:
      - targets: ['dcgm_exporter:9400']

Each job represents a different source of metrics, and Prometheus will scrape them at the configured interval (15 seconds by default).

Step 3: Deploy and Verify

Once deployed, you can verify everything is working:

Check Prometheus targets: Navigate to http://localhost:9090/targets to see all configured endpoints. All should show as "UP" (green).

Figure 1: Prometheus Targets Status

Test metrics endpoints: You can directly query each exporter:

curl http://localhost:8080/metrics  # cAdvisor
curl http://localhost:9100/metrics  # Node Exporter
curl http://localhost:9400/metrics  # DCGM Exporter (if enabled)

Verify service health: Check that all containers are running and try to restart prometheus if any target is down.

Figure 2: Portainer monitoring stack

Visualizing Metrics with Grafana

With Prometheus collecting metrics, Grafana brings them to life with beautiful dashboards.

Initial Setup

Access Grafana at http://localhost:3000
Default credentials: admin/admin (you'll be prompted to change the password)
Add Prometheus as a data source:
- Navigate to Connections → Data Sources
- Click Add new data source
- Select Prometheus
- Set URL to http://prometheus:9090
- Click Save & Test

Import Pre-built Dashboards

The Grafana community has created excellent dashboards for each exporter. Here are the recommended ones:

Dashboard	Purpose	Dashboard ID
Container Monitoring	cAdvisor metrics	19792
Node Exporter Full	System metrics	1860
NVIDIA DCGM Exporter	GPU metrics	12239

Importing is straightforward:

Click + → Import
Enter the Dashboard ID
Select your Prometheus data source
Click Import

Dashboard Highlights

The dashboards provide comprehensive views into your infrastructure:

Figure 3: Grafana Container Dashboard

Container Metrics Dashboard shows:

CPU usage per container
Memory consumption trends
Network I/O statistics
Container restart counts

Figure 4: Grafana System Dashboard

System Metrics Dashboard provides:

Overall CPU, memory, and disk utilization
Network traffic patterns
Load averages
Filesystem usage

Figure 5: Grafana GPU Dashboard

GPU Metrics Dashboard (for ML workloads) displays:

GPU utilization percentages
Memory usage across all GPUs
Temperature monitoring
Power consumption

Real-World Benefits

This monitoring stack provides several immediate benefits:

1. Proactive Issue Detection

By continuously monitoring metrics, you can detect anomalies before they become critical. For example:

Sudden spikes in memory usage might indicate a memory leak
Increased container restart counts signal application instability
GPU temperature increases could indicate cooling issues

2. Resource Optimization

Understanding resource consumption helps you:

Right-size containers based on actual usage
Identify over-provisioned resources
Plan capacity more accurately

3. Performance Insights

The detailed metrics help you:

Identify performance bottlenecks
Understand application behavior under load
Make data-driven optimization decisions

4. Cost Management

For cloud deployments, visibility into resource usage directly translates to cost awareness:

Identify unused or underutilized resources
Track resource trends over time
Make informed scaling decisions

Advanced Configuration Options

Custom Scrape Intervals

You can customize how often Prometheus scrapes each target:

scrape_configs:
  - job_name: 'high-frequency-metrics'
    scrape_interval: 5s  # Scrape every 5 seconds
    
  - job_name: 'low-priority-metrics'
    scrape_interval: 60s  # Scrape every minute

Retention Policies

Adjust data retention based on your needs:

# In docker-compose.yaml
command:
  - --storage.tsdb.retention.time=30d  # Keep 30 days
  - --storage.tsdb.retention.size=10GB  # Or 10GB

Custom Dashboards

While pre-built dashboards are great for getting started, you'll likely want to create custom dashboards tailored to your specific needs. Grafana's query builder makes it straightforward to create visualizations from Prometheus metrics.

Troubleshooting Tips

If something isn't working as expected:

Check container logs: docker logs <container-name>
Verify network connectivity: Ensure containers can communicate (they're on the same Docker network by default)
Test endpoints directly: Use curl to verify exporters are responding
Check Prometheus targets page: This is your single source of truth for exporter health

Security Considerations

For production deployments, consider:

Authentication: Set up proper authentication for Grafana and Prometheus
Network isolation: Use Docker networks to limit access
HTTPS: Configure TLS for production environments
Secret management: Use Docker secrets or environment files for sensitive configuration

Conclusion

Setting up a comprehensive monitoring stack doesn't have to be complicated. With Docker Compose and well-configured open-source tools, you can achieve enterprise-grade observability in minutes rather than days.

This stack provides the foundation for understanding your infrastructure's behavior, optimizing resource usage, and catching issues before they impact users. Whether you're running simple web applications or complex ML workloads, visibility is key to maintaining healthy systems.

The best part? All the tools are open-source and battle-tested at scale. Companies from startups to Fortune 500s rely on Prometheus and Grafana for monitoring, so you're using tools with proven track records.

Next Steps

Explore Prometheus's alerting capabilities to get notified of issues
How to create your own custom metric exporter of your REST API
We'll explore adding distributed tracing (Jaeger/Tempo) for complete observability

Happy monitoring! 🚀

If you found this guide helpful, feel free to check out the complete setup on my GitHub. The repository includes all configuration files and documentation to get you started quickly.

View on GitHub Back to Writings