Prometheus & Grafana Monitoring

devops
YAML
monitoring
mentor
Remix

Set up application monitoring and alerting with Prometheus and Grafana dashboards.

12/8/2025

Prompt

Set up a complete Prometheus and Grafana monitoring solution with custom metrics, alerts, and dashboards for the following application:

Application Details

Application Overview

  • Application Name: [e.g., MyAPI, E-commerce Platform, Microservices]
  • Application Type: [REST API / GraphQL / Web App / Microservices / Batch Jobs]
  • Tech Stack: [Node.js / Python / Go / Java / .NET / Multiple]
  • Deployment: [Kubernetes / Docker / VMs / Serverless / Bare metal]
  • Number of Services: [Single service / 3-5 services / 10+ microservices]

Infrastructure

  • Environment: [Production / Staging / Development / All]
  • Hosting: [AWS / GCP / Azure / On-premise / Hybrid]
  • Load Balancer: [ALB / Nginx / HAProxy / Traefik / None]
  • Database: [PostgreSQL / MySQL / MongoDB / Redis / Multiple]
  • Message Queue: [RabbitMQ / Kafka / SQS / None]
  • Cache: [Redis / Memcached / None]

Metrics Requirements

Application Metrics

Define custom metrics to track:

Metric 1: [Name, e.g., HTTP Requests]

  • Metric Type: [Counter / Gauge / Histogram / Summary]
  • Purpose: [What this metric measures]
  • Labels: [method, route, status, etc.]
  • Scrape Interval: [15s / 30s / 1m]
  • Retention: [15d / 30d / 90d]

Metric 2: [Name, e.g., Database Query Duration]

  • Metric Type: [Type]
  • Purpose: [Purpose]
  • Labels: [Labels list]
  • Scrape Interval: [Interval]
  • Retention: [Retention period]

Metric 3: [Name, e.g., Active Users]

  • Metric Type: [Type]
  • Purpose: [Purpose]
  • Labels: [Labels]
  • Scrape Interval: [Interval]
  • Retention: [Retention]

[Define 5-15 custom metrics]

Business Metrics

  • Revenue/Transactions: [Track revenue, orders, conversions, etc.]
  • User Activity: [Active users, sessions, signups, etc.]
  • Feature Usage: [Feature adoption, API usage, etc.]
  • SLA Metrics: [Uptime, availability, success rate, etc.]

Infrastructure Metrics

  • CPU Usage: [Per service / Per container / Per node]
  • Memory Usage: [Heap, RSS, container limits]
  • Disk I/O: [Read/write operations, latency]
  • Network: [Bandwidth, connections, errors]
  • Container Metrics: [If using Docker/Kubernetes]

External Dependencies

  • Database Metrics: [Connection pool, query performance, replication lag]
  • Cache Metrics: [Hit rate, evictions, memory usage]
  • Message Queue: [Queue depth, processing rate, lag]
  • Third-party APIs: [Response time, error rate, rate limits]

Scrape Targets

Service 1: [ServiceName]

  • Endpoint: [http://service1:3000/metrics]
  • Scrape Interval: [15s / 30s / 1m]
  • Scrape Timeout: [10s / 30s]
  • Metrics Exposed: [List key metrics]
  • Service Discovery: [Static / Kubernetes / Consul / EC2]

Service 2: [ServiceName]

  • Endpoint: [Metrics endpoint]
  • Scrape Interval: [Interval]
  • Scrape Timeout: [Timeout]
  • Metrics Exposed: [Metrics list]
  • Service Discovery: [Discovery method]

[Define 1-20 scrape targets]

Exporters

  • Node Exporter: [For system metrics on all nodes]
  • cAdvisor: [For container metrics if using Docker]
  • PostgreSQL Exporter: [If using PostgreSQL]
  • Redis Exporter: [If using Redis]
  • Custom Exporters: [Any custom exporters needed]

Alert Rules

Critical Alerts (Page immediately)

Alert 1: [Name, e.g., High Error Rate]

  • Condition: [rate(http_requests_total{status=~"5.."}[5m]) > 0.05]
  • Duration: [5m / 10m / 15m]
  • Severity: [critical]
  • Description: [What this alert means]
  • Runbook: [Steps to investigate/resolve]
  • Notification: [PagerDuty / Slack / Email / SMS]

Alert 2: [Name, e.g., Service Down]

  • Condition: [Alert condition]
  • Duration: [Duration]
  • Severity: [critical]
  • Description: [Description]
  • Runbook: [Runbook link or steps]
  • Notification: [Notification channel]

[Define 3-8 critical alerts]

Warning Alerts (Notify but don't page)

Alert 1: [Name, e.g., High Memory Usage]

  • Condition: [Memory usage > 80%]
  • Duration: [15m]
  • Severity: [warning]
  • Description: [What this means]
  • Notification: [Slack / Email]

Alert 2: [Name, e.g., Slow Requests]

  • Condition: [P95 latency > 2s]
  • Duration: [10m]
  • Severity: [warning]
  • Description: [Description]
  • Notification: [Channel]

[Define 5-10 warning alerts]

Info Alerts (For awareness)

  • Deployment Events: [Track deployments]
  • Scaling Events: [Auto-scaling triggers]
  • Backup Status: [Backup completion/failures]

Grafana Dashboards

Dashboard 1: [Name, e.g., Application Overview]

  • Purpose: [High-level application health]
  • Refresh Rate: [5s / 10s / 30s / 1m]
  • Time Range: [Last 1h / 6h / 24h]

Panels:

  1. Request Rate

    • Visualization: [Graph / Stat / Gauge]
    • Query: [rate(http_requests_total[5m])]
    • Description: [Requests per second]
  2. Error Rate

    • Visualization: [Graph / Stat]
    • Query: [Error rate query]
    • Description: [Percentage of failed requests]
  3. Response Time (P50, P95, P99)

    • Visualization: [Graph]
    • Query: [Histogram quantiles]
    • Description: [Latency percentiles]
  4. Active Connections

    • Visualization: [Gauge]
    • Query: [Current connections]
    • Description: [Active connections]

[Define 6-12 panels]

Dashboard 2: [Name, e.g., Infrastructure Metrics]

  • Purpose: [System resource monitoring]
  • Refresh Rate: [Rate]
  • Time Range: [Range]

Panels:

  • CPU Usage per service
  • Memory Usage per service
  • Disk I/O
  • Network traffic
  • Container metrics (if applicable)

[Define 4-8 panels]

Dashboard 3: [Name, e.g., Business Metrics]

  • Purpose: [Business KPIs]
  • Refresh Rate: [Rate]
  • Time Range: [Range]

Panels:

  • Revenue/Transactions
  • Active Users
  • Conversion Rate
  • Feature Usage

[Define 3-6 panels]

[Define 3-6 dashboards total]

Alerting Channels

Notification Channels

  • Slack: [Webhook URL for #alerts channel]
  • PagerDuty: [Integration key for critical alerts]
  • Email: [Distribution list for warnings]
  • Webhook: [Custom webhook for integrations]
  • OpsGenie: [If using OpsGenie]

Alert Routing

  • Critical Alerts: [PagerDuty + Slack]
  • Warning Alerts: [Slack + Email]
  • Info Alerts: [Slack only]
  • Business Hours: [Different routing for off-hours?]

Recording Rules

For frequently used queries:

Rule 1: [Name, e.g., Request Rate by Service]

  • Query: [rate(http_requests_total[5m])]
  • Interval: [1m]
  • Purpose: [Pre-compute for dashboards]

Rule 2: [Name, e.g., Error Rate]

  • Query: [Error rate calculation]
  • Interval: [Interval]
  • Purpose: [Purpose]

[Define 3-8 recording rules]

Data Retention & Storage

Prometheus Storage

  • Retention Period: [15d / 30d / 90d]
  • Storage Size: [Estimated size needed]
  • Disk Type: [SSD / HDD]
  • Backup Strategy: [Snapshots / Remote storage / None]

Long-term Storage

  • Remote Write: [Thanos / Cortex / VictoriaMetrics / None]
  • Retention: [1y / 2y / Indefinite]
  • Downsampling: [After 30d / 90d / None]

High Availability

Prometheus HA

  • Multiple Instances: [2+ Prometheus instances]
  • Federation: [Federate from multiple Prometheus]
  • Alertmanager Clustering: [Clustered Alertmanager]

Grafana HA

  • Load Balancing: [Multiple Grafana instances]
  • Shared Database: [PostgreSQL / MySQL for dashboards]
  • Session Storage: [Redis / Database]

Security

Authentication

  • Prometheus: [Basic auth / OAuth / None]
  • Grafana: [OAuth / LDAP / Built-in / SSO]
  • Alertmanager: [Basic auth / None]

Network Security

  • Firewall Rules: [Restrict access to monitoring stack]
  • TLS/SSL: [Enable HTTPS for all components]
  • API Keys: [Secure API access]

Code Generation Requirements

Generate a complete monitoring solution including:

  1. Prometheus Configuration:

    • prometheus.yml with all scrape configs
    • Service discovery configurations
    • Recording rules
    • Alert rules
    • Remote write configuration (if needed)
  2. Application Instrumentation:

    • Metrics middleware for the application
    • Custom metric definitions (Counter, Gauge, Histogram)
    • Metrics endpoint implementation
    • Proper labeling strategy
    • Language-specific best practices
  3. Alert Rules:

    • alerts.yml with all alert definitions
    • Proper grouping and severity levels
    • Meaningful annotations and descriptions
    • Runbook links
    • Alert routing configuration
  4. Alertmanager Configuration:

    • alertmanager.yml with routing rules
    • Notification channel configurations
    • Inhibition rules
    • Grouping and throttling
  5. Grafana Dashboards:

    • JSON dashboard definitions for all dashboards
    • Proper panel configurations
    • Template variables for filtering
    • Annotations for deployments
    • Alert visualizations
  6. Exporters Configuration:

    • Node exporter setup
    • Database exporter configurations
    • Custom exporter implementations (if needed)
  7. Docker Compose / Kubernetes Manifests:

    • Complete deployment configuration
    • Prometheus deployment
    • Grafana deployment
    • Alertmanager deployment
    • Exporter deployments
    • Persistent volume configurations
  8. Documentation:

    • Setup and installation guide
    • Metrics catalog (all metrics documented)
    • Alert runbooks
    • Dashboard usage guide
    • Troubleshooting guide
  9. Scripts and Utilities:

    • Backup and restore scripts
    • Health check scripts
    • Alert testing scripts
    • Dashboard provisioning scripts

Output production-ready monitoring infrastructure following best practices with:

  • Comprehensive metric coverage (RED/USE methodology)
  • Actionable alerts with clear severity levels
  • Informative dashboards with proper visualizations
  • Proper metric naming conventions
  • Efficient recording rules for complex queries
  • High availability configuration
  • Secure authentication and authorization
  • Automated dashboard provisioning
  • Alert routing and escalation
  • Long-term storage strategy
  • Clear documentation and runbooks

Tags

prometheus
grafana
monitoring
observability

Tested Models

gpt-4
claude-3-5-sonnet

Comments (0)

Sign in to leave a comment

Sign In