Prometheus & Grafana Monitoring
Set up application monitoring and alerting with Prometheus and Grafana dashboards.
Prompt
Set up a complete Prometheus and Grafana monitoring solution with custom metrics, alerts, and dashboards for the following application:
Application Details
Application Overview
- Application Name: [e.g., MyAPI, E-commerce Platform, Microservices]
- Application Type: [REST API / GraphQL / Web App / Microservices / Batch Jobs]
- Tech Stack: [Node.js / Python / Go / Java / .NET / Multiple]
- Deployment: [Kubernetes / Docker / VMs / Serverless / Bare metal]
- Number of Services: [Single service / 3-5 services / 10+ microservices]
Infrastructure
- Environment: [Production / Staging / Development / All]
- Hosting: [AWS / GCP / Azure / On-premise / Hybrid]
- Load Balancer: [ALB / Nginx / HAProxy / Traefik / None]
- Database: [PostgreSQL / MySQL / MongoDB / Redis / Multiple]
- Message Queue: [RabbitMQ / Kafka / SQS / None]
- Cache: [Redis / Memcached / None]
Metrics Requirements
Application Metrics
Define custom metrics to track:
Metric 1: [Name, e.g., HTTP Requests]
- Metric Type: [Counter / Gauge / Histogram / Summary]
- Purpose: [What this metric measures]
- Labels: [method, route, status, etc.]
- Scrape Interval: [15s / 30s / 1m]
- Retention: [15d / 30d / 90d]
Metric 2: [Name, e.g., Database Query Duration]
- Metric Type: [Type]
- Purpose: [Purpose]
- Labels: [Labels list]
- Scrape Interval: [Interval]
- Retention: [Retention period]
Metric 3: [Name, e.g., Active Users]
- Metric Type: [Type]
- Purpose: [Purpose]
- Labels: [Labels]
- Scrape Interval: [Interval]
- Retention: [Retention]
[Define 5-15 custom metrics]
Business Metrics
- Revenue/Transactions: [Track revenue, orders, conversions, etc.]
- User Activity: [Active users, sessions, signups, etc.]
- Feature Usage: [Feature adoption, API usage, etc.]
- SLA Metrics: [Uptime, availability, success rate, etc.]
Infrastructure Metrics
- CPU Usage: [Per service / Per container / Per node]
- Memory Usage: [Heap, RSS, container limits]
- Disk I/O: [Read/write operations, latency]
- Network: [Bandwidth, connections, errors]
- Container Metrics: [If using Docker/Kubernetes]
External Dependencies
- Database Metrics: [Connection pool, query performance, replication lag]
- Cache Metrics: [Hit rate, evictions, memory usage]
- Message Queue: [Queue depth, processing rate, lag]
- Third-party APIs: [Response time, error rate, rate limits]
Scrape Targets
Service 1: [ServiceName]
- Endpoint: [http://service1:3000/metrics]
- Scrape Interval: [15s / 30s / 1m]
- Scrape Timeout: [10s / 30s]
- Metrics Exposed: [List key metrics]
- Service Discovery: [Static / Kubernetes / Consul / EC2]
Service 2: [ServiceName]
- Endpoint: [Metrics endpoint]
- Scrape Interval: [Interval]
- Scrape Timeout: [Timeout]
- Metrics Exposed: [Metrics list]
- Service Discovery: [Discovery method]
[Define 1-20 scrape targets]
Exporters
- Node Exporter: [For system metrics on all nodes]
- cAdvisor: [For container metrics if using Docker]
- PostgreSQL Exporter: [If using PostgreSQL]
- Redis Exporter: [If using Redis]
- Custom Exporters: [Any custom exporters needed]
Alert Rules
Critical Alerts (Page immediately)
Alert 1: [Name, e.g., High Error Rate]
- Condition: [rate(http_requests_total{status=~"5.."}[5m]) > 0.05]
- Duration: [5m / 10m / 15m]
- Severity: [critical]
- Description: [What this alert means]
- Runbook: [Steps to investigate/resolve]
- Notification: [PagerDuty / Slack / Email / SMS]
Alert 2: [Name, e.g., Service Down]
- Condition: [Alert condition]
- Duration: [Duration]
- Severity: [critical]
- Description: [Description]
- Runbook: [Runbook link or steps]
- Notification: [Notification channel]
[Define 3-8 critical alerts]
Warning Alerts (Notify but don't page)
Alert 1: [Name, e.g., High Memory Usage]
- Condition: [Memory usage > 80%]
- Duration: [15m]
- Severity: [warning]
- Description: [What this means]
- Notification: [Slack / Email]
Alert 2: [Name, e.g., Slow Requests]
- Condition: [P95 latency > 2s]
- Duration: [10m]
- Severity: [warning]
- Description: [Description]
- Notification: [Channel]
[Define 5-10 warning alerts]
Info Alerts (For awareness)
- Deployment Events: [Track deployments]
- Scaling Events: [Auto-scaling triggers]
- Backup Status: [Backup completion/failures]
Grafana Dashboards
Dashboard 1: [Name, e.g., Application Overview]
- Purpose: [High-level application health]
- Refresh Rate: [5s / 10s / 30s / 1m]
- Time Range: [Last 1h / 6h / 24h]
Panels:
-
Request Rate
- Visualization: [Graph / Stat / Gauge]
- Query: [rate(http_requests_total[5m])]
- Description: [Requests per second]
-
Error Rate
- Visualization: [Graph / Stat]
- Query: [Error rate query]
- Description: [Percentage of failed requests]
-
Response Time (P50, P95, P99)
- Visualization: [Graph]
- Query: [Histogram quantiles]
- Description: [Latency percentiles]
-
Active Connections
- Visualization: [Gauge]
- Query: [Current connections]
- Description: [Active connections]
[Define 6-12 panels]
Dashboard 2: [Name, e.g., Infrastructure Metrics]
- Purpose: [System resource monitoring]
- Refresh Rate: [Rate]
- Time Range: [Range]
Panels:
- CPU Usage per service
- Memory Usage per service
- Disk I/O
- Network traffic
- Container metrics (if applicable)
[Define 4-8 panels]
Dashboard 3: [Name, e.g., Business Metrics]
- Purpose: [Business KPIs]
- Refresh Rate: [Rate]
- Time Range: [Range]
Panels:
- Revenue/Transactions
- Active Users
- Conversion Rate
- Feature Usage
[Define 3-6 panels]
[Define 3-6 dashboards total]
Alerting Channels
Notification Channels
- Slack: [Webhook URL for #alerts channel]
- PagerDuty: [Integration key for critical alerts]
- Email: [Distribution list for warnings]
- Webhook: [Custom webhook for integrations]
- OpsGenie: [If using OpsGenie]
Alert Routing
- Critical Alerts: [PagerDuty + Slack]
- Warning Alerts: [Slack + Email]
- Info Alerts: [Slack only]
- Business Hours: [Different routing for off-hours?]
Recording Rules
For frequently used queries:
Rule 1: [Name, e.g., Request Rate by Service]
- Query: [rate(http_requests_total[5m])]
- Interval: [1m]
- Purpose: [Pre-compute for dashboards]
Rule 2: [Name, e.g., Error Rate]
- Query: [Error rate calculation]
- Interval: [Interval]
- Purpose: [Purpose]
[Define 3-8 recording rules]
Data Retention & Storage
Prometheus Storage
- Retention Period: [15d / 30d / 90d]
- Storage Size: [Estimated size needed]
- Disk Type: [SSD / HDD]
- Backup Strategy: [Snapshots / Remote storage / None]
Long-term Storage
- Remote Write: [Thanos / Cortex / VictoriaMetrics / None]
- Retention: [1y / 2y / Indefinite]
- Downsampling: [After 30d / 90d / None]
High Availability
Prometheus HA
- Multiple Instances: [2+ Prometheus instances]
- Federation: [Federate from multiple Prometheus]
- Alertmanager Clustering: [Clustered Alertmanager]
Grafana HA
- Load Balancing: [Multiple Grafana instances]
- Shared Database: [PostgreSQL / MySQL for dashboards]
- Session Storage: [Redis / Database]
Security
Authentication
- Prometheus: [Basic auth / OAuth / None]
- Grafana: [OAuth / LDAP / Built-in / SSO]
- Alertmanager: [Basic auth / None]
Network Security
- Firewall Rules: [Restrict access to monitoring stack]
- TLS/SSL: [Enable HTTPS for all components]
- API Keys: [Secure API access]
Code Generation Requirements
Generate a complete monitoring solution including:
-
Prometheus Configuration:
- prometheus.yml with all scrape configs
- Service discovery configurations
- Recording rules
- Alert rules
- Remote write configuration (if needed)
-
Application Instrumentation:
- Metrics middleware for the application
- Custom metric definitions (Counter, Gauge, Histogram)
- Metrics endpoint implementation
- Proper labeling strategy
- Language-specific best practices
-
Alert Rules:
- alerts.yml with all alert definitions
- Proper grouping and severity levels
- Meaningful annotations and descriptions
- Runbook links
- Alert routing configuration
-
Alertmanager Configuration:
- alertmanager.yml with routing rules
- Notification channel configurations
- Inhibition rules
- Grouping and throttling
-
Grafana Dashboards:
- JSON dashboard definitions for all dashboards
- Proper panel configurations
- Template variables for filtering
- Annotations for deployments
- Alert visualizations
-
Exporters Configuration:
- Node exporter setup
- Database exporter configurations
- Custom exporter implementations (if needed)
-
Docker Compose / Kubernetes Manifests:
- Complete deployment configuration
- Prometheus deployment
- Grafana deployment
- Alertmanager deployment
- Exporter deployments
- Persistent volume configurations
-
Documentation:
- Setup and installation guide
- Metrics catalog (all metrics documented)
- Alert runbooks
- Dashboard usage guide
- Troubleshooting guide
-
Scripts and Utilities:
- Backup and restore scripts
- Health check scripts
- Alert testing scripts
- Dashboard provisioning scripts
Output production-ready monitoring infrastructure following best practices with:
- Comprehensive metric coverage (RED/USE methodology)
- Actionable alerts with clear severity levels
- Informative dashboards with proper visualizations
- Proper metric naming conventions
- Efficient recording rules for complex queries
- High availability configuration
- Secure authentication and authorization
- Automated dashboard provisioning
- Alert routing and escalation
- Long-term storage strategy
- Clear documentation and runbooks