Prometheus & Grafana Monitoring

Set up a complete Prometheus and Grafana monitoring solution with custom metrics, alerts, and dashboards for the following application:

Application Details

Application Overview

Application Name: [e.g., MyAPI, E-commerce Platform, Microservices]
Application Type: [REST API / GraphQL / Web App / Microservices / Batch Jobs]
Tech Stack: [Node.js / Python / Go / Java / .NET / Multiple]
Deployment: [Kubernetes / Docker / VMs / Serverless / Bare metal]
Number of Services: [Single service / 3-5 services / 10+ microservices]

Infrastructure

Environment: [Production / Staging / Development / All]
Hosting: [AWS / GCP / Azure / On-premise / Hybrid]
Load Balancer: [ALB / Nginx / HAProxy / Traefik / None]
Database: [PostgreSQL / MySQL / MongoDB / Redis / Multiple]
Message Queue: [RabbitMQ / Kafka / SQS / None]
Cache: [Redis / Memcached / None]

Metrics Requirements

Application Metrics

Define custom metrics to track:

Metric 1: [Name, e.g., HTTP Requests]

Metric Type: [Counter / Gauge / Histogram / Summary]
Purpose: [What this metric measures]
Labels: [method, route, status, etc.]
Scrape Interval: [15s / 30s / 1m]
Retention: [15d / 30d / 90d]

Metric 2: [Name, e.g., Database Query Duration]

Metric Type: [Type]
Purpose: [Purpose]
Labels: [Labels list]
Scrape Interval: [Interval]
Retention: [Retention period]

Metric 3: [Name, e.g., Active Users]

Metric Type: [Type]
Purpose: [Purpose]
Labels: [Labels]
Scrape Interval: [Interval]
Retention: [Retention]

[Define 5-15 custom metrics]

Business Metrics

Revenue/Transactions: [Track revenue, orders, conversions, etc.]
User Activity: [Active users, sessions, signups, etc.]
Feature Usage: [Feature adoption, API usage, etc.]
SLA Metrics: [Uptime, availability, success rate, etc.]

Infrastructure Metrics

CPU Usage: [Per service / Per container / Per node]
Memory Usage: [Heap, RSS, container limits]
Disk I/O: [Read/write operations, latency]
Network: [Bandwidth, connections, errors]
Container Metrics: [If using Docker/Kubernetes]

External Dependencies

Database Metrics: [Connection pool, query performance, replication lag]
Cache Metrics: [Hit rate, evictions, memory usage]
Message Queue: [Queue depth, processing rate, lag]
Third-party APIs: [Response time, error rate, rate limits]

Scrape Targets

Service 1: [ServiceName]

Endpoint: [http://service1:3000/metrics]
Scrape Interval: [15s / 30s / 1m]
Scrape Timeout: [10s / 30s]
Metrics Exposed: [List key metrics]
Service Discovery: [Static / Kubernetes / Consul / EC2]

Service 2: [ServiceName]

Endpoint: [Metrics endpoint]
Scrape Interval: [Interval]
Scrape Timeout: [Timeout]
Metrics Exposed: [Metrics list]
Service Discovery: [Discovery method]

[Define 1-20 scrape targets]

Exporters

Node Exporter: [For system metrics on all nodes]
cAdvisor: [For container metrics if using Docker]
PostgreSQL Exporter: [If using PostgreSQL]
Redis Exporter: [If using Redis]
Custom Exporters: [Any custom exporters needed]

Alert Rules

Critical Alerts (Page immediately)

Alert 1: [Name, e.g., High Error Rate]

Condition: [rate(http_requests_total{status=~"5.."}[5m]) > 0.05]
Duration: [5m / 10m / 15m]
Severity: [critical]
Description: [What this alert means]
Runbook: [Steps to investigate/resolve]
Notification: [PagerDuty / Slack / Email / SMS]

Alert 2: [Name, e.g., Service Down]

Condition: [Alert condition]
Duration: [Duration]
Severity: [critical]
Description: [Description]
Runbook: [Runbook link or steps]
Notification: [Notification channel]

[Define 3-8 critical alerts]

Warning Alerts (Notify but don't page)

Alert 1: [Name, e.g., High Memory Usage]

Condition: [Memory usage > 80%]
Duration: [15m]
Severity: [warning]
Description: [What this means]
Notification: [Slack / Email]

Alert 2: [Name, e.g., Slow Requests]

Condition: [P95 latency > 2s]
Duration: [10m]
Severity: [warning]
Description: [Description]
Notification: [Channel]

[Define 5-10 warning alerts]

Info Alerts (For awareness)

Deployment Events: [Track deployments]
Scaling Events: [Auto-scaling triggers]
Backup Status: [Backup completion/failures]

Grafana Dashboards

Dashboard 1: [Name, e.g., Application Overview]

Purpose: [High-level application health]
Refresh Rate: [5s / 10s / 30s / 1m]
Time Range: [Last 1h / 6h / 24h]

Panels:

Request Rate
- Visualization: [Graph / Stat / Gauge]
- Query: [rate(http_requests_total[5m])]
- Description: [Requests per second]
Error Rate
- Visualization: [Graph / Stat]
- Query: [Error rate query]
- Description: [Percentage of failed requests]
Response Time (P50, P95, P99)
- Visualization: [Graph]
- Query: [Histogram quantiles]
- Description: [Latency percentiles]
Active Connections
- Visualization: [Gauge]
- Query: [Current connections]
- Description: [Active connections]

[Define 6-12 panels]

Dashboard 2: [Name, e.g., Infrastructure Metrics]

Purpose: [System resource monitoring]
Refresh Rate: [Rate]
Time Range: [Range]

Panels:

CPU Usage per service
Memory Usage per service
Disk I/O
Network traffic
Container metrics (if applicable)

[Define 4-8 panels]

Dashboard 3: [Name, e.g., Business Metrics]

Purpose: [Business KPIs]
Refresh Rate: [Rate]
Time Range: [Range]

Panels:

Revenue/Transactions
Active Users
Conversion Rate
Feature Usage

[Define 3-6 panels]

[Define 3-6 dashboards total]

Alerting Channels

Notification Channels

Slack: [Webhook URL for #alerts channel]
PagerDuty: [Integration key for critical alerts]
Email: [Distribution list for warnings]
Webhook: [Custom webhook for integrations]
OpsGenie: [If using OpsGenie]

Alert Routing

Critical Alerts: [PagerDuty + Slack]
Warning Alerts: [Slack + Email]
Info Alerts: [Slack only]
Business Hours: [Different routing for off-hours?]

Recording Rules

For frequently used queries:

Rule 1: [Name, e.g., Request Rate by Service]

Query: [rate(http_requests_total[5m])]
Interval: [1m]
Purpose: [Pre-compute for dashboards]

Rule 2: [Name, e.g., Error Rate]

Query: [Error rate calculation]
Interval: [Interval]
Purpose: [Purpose]

[Define 3-8 recording rules]

Data Retention & Storage

Prometheus Storage

Retention Period: [15d / 30d / 90d]
Storage Size: [Estimated size needed]
Disk Type: [SSD / HDD]
Backup Strategy: [Snapshots / Remote storage / None]

Long-term Storage

Remote Write: [Thanos / Cortex / VictoriaMetrics / None]
Retention: [1y / 2y / Indefinite]
Downsampling: [After 30d / 90d / None]

High Availability

Prometheus HA

Multiple Instances: [2+ Prometheus instances]
Federation: [Federate from multiple Prometheus]
Alertmanager Clustering: [Clustered Alertmanager]

Grafana HA

Load Balancing: [Multiple Grafana instances]
Shared Database: [PostgreSQL / MySQL for dashboards]
Session Storage: [Redis / Database]

Security

Authentication

Prometheus: [Basic auth / OAuth / None]
Grafana: [OAuth / LDAP / Built-in / SSO]
Alertmanager: [Basic auth / None]

Network Security

Firewall Rules: [Restrict access to monitoring stack]
TLS/SSL: [Enable HTTPS for all components]
API Keys: [Secure API access]

Code Generation Requirements

Generate a complete monitoring solution including:

Prometheus Configuration:
- prometheus.yml with all scrape configs
- Service discovery configurations
- Recording rules
- Alert rules
- Remote write configuration (if needed)
Application Instrumentation:
- Metrics middleware for the application
- Custom metric definitions (Counter, Gauge, Histogram)
- Metrics endpoint implementation
- Proper labeling strategy
- Language-specific best practices
Alert Rules:
- alerts.yml with all alert definitions
- Proper grouping and severity levels
- Meaningful annotations and descriptions
- Runbook links
- Alert routing configuration
Alertmanager Configuration:
- alertmanager.yml with routing rules
- Notification channel configurations
- Inhibition rules
- Grouping and throttling
Grafana Dashboards:
- JSON dashboard definitions for all dashboards
- Proper panel configurations
- Template variables for filtering
- Annotations for deployments
- Alert visualizations
Exporters Configuration:
- Node exporter setup
- Database exporter configurations
- Custom exporter implementations (if needed)
Docker Compose / Kubernetes Manifests:
- Complete deployment configuration
- Prometheus deployment
- Grafana deployment
- Alertmanager deployment
- Exporter deployments
- Persistent volume configurations
Documentation:
- Setup and installation guide
- Metrics catalog (all metrics documented)
- Alert runbooks
- Dashboard usage guide
- Troubleshooting guide
Scripts and Utilities:
- Backup and restore scripts
- Health check scripts
- Alert testing scripts
- Dashboard provisioning scripts

Output production-ready monitoring infrastructure following best practices with:

Comprehensive metric coverage (RED/USE methodology)
Actionable alerts with clear severity levels
Informative dashboards with proper visualizations
Proper metric naming conventions
Efficient recording rules for complex queries
High availability configuration
Secure authentication and authorization
Automated dashboard provisioning
Alert routing and escalation
Long-term storage strategy
Clear documentation and runbooks

Prometheus & Grafana Monitoring

Prompt

Application Details

Application Overview

Infrastructure

Metrics Requirements

Application Metrics

Metric 1: [Name, e.g., HTTP Requests]

Metric 2: [Name, e.g., Database Query Duration]

Metric 3: [Name, e.g., Active Users]

Business Metrics

Infrastructure Metrics

External Dependencies

Scrape Targets

Service 1: [ServiceName]

Service 2: [ServiceName]

Exporters

Alert Rules

Critical Alerts (Page immediately)

Alert 1: [Name, e.g., High Error Rate]

Alert 2: [Name, e.g., Service Down]

Warning Alerts (Notify but don't page)

Alert 1: [Name, e.g., High Memory Usage]

Alert 2: [Name, e.g., Slow Requests]

Info Alerts (For awareness)

Grafana Dashboards

Dashboard 1: [Name, e.g., Application Overview]

Dashboard 2: [Name, e.g., Infrastructure Metrics]

Dashboard 3: [Name, e.g., Business Metrics]

Alerting Channels

Notification Channels

Alert Routing

Recording Rules

Rule 1: [Name, e.g., Request Rate by Service]

Rule 2: [Name, e.g., Error Rate]

Data Retention & Storage

Prometheus Storage

Long-term Storage

High Availability

Prometheus HA

Grafana HA

Security

Authentication

Network Security

Code Generation Requirements

Tags

Tested Models

Comments (0)