Building a Complete Docker Monitoring Stack with Prometheus & Grafana: A Hands-On Guide
How to set up enterprise-grade monitoring for your Docker infrastructure with GPU support
How to set up enterprise-grade monitoring for your Docker infrastructure with GPU support
Introduction
In today's cloud-native world, visibility into your infrastructure is not just a luxury—it's a necessity. Whether you're running containerized applications, managing ML workloads with GPU resources, or simply ensuring your services stay healthy, having comprehensive monitoring in place can mean the difference between catching issues early and facing production outages.
In this article, I'll walk you through building a complete monitoring stack for Docker environments using industry-standard tools: Prometheus for metrics collection and Grafana for visualization. What makes this setup special is its comprehensive coverage—from container-level metrics to system-wide performance, and even GPU monitoring for ML/AI workloads.
The Challenge: Why Monitor Docker Environments?
Running applications in Docker containers brings numerous benefits: isolation, scalability, and portability. However, this containerization also introduces new complexities:
- Limited visibility: Traditional monitoring tools often can't peek inside containers
- Resource isolation: Understanding which containers are consuming resources requires specialized tools
- GPU workloads: ML/AI applications need GPU monitoring, which traditional tools don't handle well
- Fragmented metrics: Different components expose metrics in different ways
This is where a purpose-built monitoring stack becomes essential. By combining the right tools, we can achieve complete observability across our entire Docker infrastructure.
The Solution: A Complete Monitoring Stack
Our monitoring stack brings together five powerful components that work in harmony:
1. Prometheus - The Metrics Database
Prometheus is an open-source monitoring and alerting toolkit that's become the de facto standard for cloud-native monitoring. It collects metrics by scraping HTTP endpoints and stores them in a time-series database. Its powerful query language (PromQL) allows you to analyze metrics and create complex alerts.
Key features:
- Pull-based metric collection
- Powerful query language
- Efficient time-series storage
- Built-in service discovery
2. Grafana - The Visualization Layer
Grafana provides beautiful, customizable dashboards that transform raw metrics into actionable insights. With its extensive library of pre-built dashboards, you can get started monitoring within minutes.
Key features:
- Rich visualization options (graphs, gauges, heatmaps)
- Pre-built dashboard library
- Alerting capabilities
- Multi-data source support
3. cAdvisor - Container Metrics
Google's cAdvisor (Container Advisor) provides detailed metrics about running containers. It automatically discovers all containers on a host and collects resource usage and performance statistics.
Metrics collected:
- CPU usage per container
- Memory consumption
- Network I/O statistics
- Filesystem usage
4. Node Exporter - Host System Metrics
The Prometheus Node Exporter exposes hardware and OS metrics of the host machine. It's essential for understanding overall system health and resource utilization.
Metrics collected:
- CPU, memory, disk, and network utilization
- System load averages
- Filesystem statistics
- Hardware temperature (where available)
5. DCGM Exporter - GPU Monitoring (Optional but Powerful)
For ML/AI workloads running on NVIDIA GPUs, the DCGM Exporter provides critical insights into GPU utilization, memory usage, and performance metrics.
Metrics collected:
- GPU utilization percentage
- Memory usage (used/total)
- Temperature and power consumption
- Performance statistics
Architecture Overview
The architecture is beautifully simple yet powerful:
┌─────────────┐
│ cAdvisor │──┐
│ (Port 8080)│ │
└─────────────┘ │
│
┌─────────────┐ │ ┌─────────────┐ ┌─────────────┐
│Node Exporter│──┼───▶│ Prometheus │───▶│ Grafana │
│ (Port 9100)│ │ │ (Port 9090) │ │ (Port 3000) │
└─────────────┘ │ └─────────────┘ └─────────────┘
│
┌─────────────┐ │
│DCGM Exporter│──┘
│ (Port 9400)│
└─────────────┘
Each exporter collects metrics from its respective domain and exposes them via HTTP endpoints. Prometheus scrapes these endpoints at regular intervals (every 15 seconds by default) and stores the metrics. Grafana connects to Prometheus as a data source and visualizes the metrics through dashboards.
Getting Started: Installation Guide
Prerequisites
Before diving in, ensure you have:
- Docker (v20.10+) installed
- Docker Compose (v2.0+) installed
- (Optional) NVIDIA Container Toolkit for GPU monitoring
Step 1: Set Up the Stack
The entire stack is defined in a single docker-compose.yaml file, making deployment straightforward. The configuration includes:
services:
grafana:
image: docker.io/grafana/grafana-oss:12.1.1
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
prometheus:
image: docker.io/prom/prometheus:v3.6.0
ports:
- "9090:9090"
command:
- --storage.tsdb.retention.time=7d
- --storage.tsdb.retention.size=2GB
Key configuration points:
- Data persistence: Both Grafana and Prometheus use Docker volumes to persist data
- Retention policy: Prometheus is configured to retain 7 days of data or 2GB, whichever comes first
- Auto-restart: All services are configured to restart automatically unless stopped
Step 2: Configure Prometheus
Prometheus needs to know which endpoints to scrape. This is configured in prometheus.yaml:
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'dcgm_exporter'
static_configs:
- targets: ['dcgm_exporter:9400']
Each job represents a different source of metrics, and Prometheus will scrape them at the configured interval (15 seconds by default).
Step 3: Deploy and Verify
Once deployed, you can verify everything is working:
- Check Prometheus targets: Navigate to
http://localhost:9090/targetsto see all configured endpoints. All should show as "UP" (green).
Figure 1: Prometheus Targets Status
- Test metrics endpoints: You can directly query each exporter:
curl http://localhost:8080/metrics # cAdvisor curl http://localhost:9100/metrics # Node Exporter curl http://localhost:9400/metrics # DCGM Exporter (if enabled) - Verify service health: Check that all containers are running and try to restart prometheus if any target is down.
Figure 2: Portainer monitoring stack
Visualizing Metrics with Grafana
With Prometheus collecting metrics, Grafana brings them to life with beautiful dashboards.
Initial Setup
- Access Grafana at
http://localhost:3000 - Default credentials:
admin/admin(you'll be prompted to change the password) - Add Prometheus as a data source:
- Navigate to Connections → Data Sources
- Click Add new data source
- Select Prometheus
- Set URL to
http://prometheus:9090 - Click Save & Test
Import Pre-built Dashboards
The Grafana community has created excellent dashboards for each exporter. Here are the recommended ones:
| Dashboard | Purpose | Dashboard ID |
|---|---|---|
| Container Monitoring | cAdvisor metrics | 19792 |
| Node Exporter Full | System metrics | 1860 |
| NVIDIA DCGM Exporter | GPU metrics | 12239 |
Importing is straightforward:
- Click + → Import
- Enter the Dashboard ID
- Select your Prometheus data source
- Click Import
Dashboard Highlights
The dashboards provide comprehensive views into your infrastructure:
Figure 3: Grafana Container Dashboard
Container Metrics Dashboard shows:
- CPU usage per container
- Memory consumption trends
- Network I/O statistics
- Container restart counts
Figure 4: Grafana System Dashboard
System Metrics Dashboard provides:
- Overall CPU, memory, and disk utilization
- Network traffic patterns
- Load averages
- Filesystem usage
Figure 5: Grafana GPU Dashboard
GPU Metrics Dashboard (for ML workloads) displays:
- GPU utilization percentages
- Memory usage across all GPUs
- Temperature monitoring
- Power consumption
Real-World Benefits
This monitoring stack provides several immediate benefits:
1. Proactive Issue Detection
By continuously monitoring metrics, you can detect anomalies before they become critical. For example:
- Sudden spikes in memory usage might indicate a memory leak
- Increased container restart counts signal application instability
- GPU temperature increases could indicate cooling issues
2. Resource Optimization
Understanding resource consumption helps you:
- Right-size containers based on actual usage
- Identify over-provisioned resources
- Plan capacity more accurately
3. Performance Insights
The detailed metrics help you:
- Identify performance bottlenecks
- Understand application behavior under load
- Make data-driven optimization decisions
4. Cost Management
For cloud deployments, visibility into resource usage directly translates to cost awareness:
- Identify unused or underutilized resources
- Track resource trends over time
- Make informed scaling decisions
Advanced Configuration Options
Custom Scrape Intervals
You can customize how often Prometheus scrapes each target:
scrape_configs:
- job_name: 'high-frequency-metrics'
scrape_interval: 5s # Scrape every 5 seconds
- job_name: 'low-priority-metrics'
scrape_interval: 60s # Scrape every minute
Retention Policies
Adjust data retention based on your needs:
# In docker-compose.yaml
command:
- --storage.tsdb.retention.time=30d # Keep 30 days
- --storage.tsdb.retention.size=10GB # Or 10GB
Custom Dashboards
While pre-built dashboards are great for getting started, you'll likely want to create custom dashboards tailored to your specific needs. Grafana's query builder makes it straightforward to create visualizations from Prometheus metrics.
Troubleshooting Tips
If something isn't working as expected:
- Check container logs:
docker logs <container-name> - Verify network connectivity: Ensure containers can communicate (they're on the same Docker network by default)
- Test endpoints directly: Use
curlto verify exporters are responding - Check Prometheus targets page: This is your single source of truth for exporter health
Security Considerations
For production deployments, consider:
- Authentication: Set up proper authentication for Grafana and Prometheus
- Network isolation: Use Docker networks to limit access
- HTTPS: Configure TLS for production environments
- Secret management: Use Docker secrets or environment files for sensitive configuration
Conclusion
Setting up a comprehensive monitoring stack doesn't have to be complicated. With Docker Compose and well-configured open-source tools, you can achieve enterprise-grade observability in minutes rather than days.
This stack provides the foundation for understanding your infrastructure's behavior, optimizing resource usage, and catching issues before they impact users. Whether you're running simple web applications or complex ML workloads, visibility is key to maintaining healthy systems.
The best part? All the tools are open-source and battle-tested at scale. Companies from startups to Fortune 500s rely on Prometheus and Grafana for monitoring, so you're using tools with proven track records.
Next Steps
- Explore Prometheus's alerting capabilities to get notified of issues
- How to create your own custom metric exporter of your REST API
- We'll explore adding distributed tracing (Jaeger/Tempo) for complete observability
Happy monitoring! 🚀
If you found this guide helpful, feel free to check out the complete setup on my GitHub. The repository includes all configuration files and documentation to get you started quickly.