This guide demonstrates how to create a scalable, centralized monitoring solution for your Kubernetes clusters using a push-based architecture implemented through Infrastructure as Code.
We'll use three key components:
Grafana Cloud — an observability platform that stores and visualizes all our telemetry data.
Grafana Alloy (formerly known as Grafana Agent) — a flexible telemetry collection agent designed to discover, collect, process, and forward observability data from various sources:
k8s-monitoring Helm chart — a pre-configured setup for deploying monitoring components in Kubernetes clusters, simplifying the deployment and configuration of Grafana Alloy.
With this architecture, you can collect logs, metrics, traces, and more across multiple Kubernetes clusters and centralize it in one place:
This diagram shows how metrics can be pulled from applications in a Kubernetes cluster and pushed to Grafana Cloud or its self-hosted alternative
Setting Up Grafana Cloud
Let's start by configuring Grafana Cloud to receive our monitoring data.
Create Access Policies
Firstly, you need to configure access policies in Grafana Cloud to allow Terraform to manage resources. Access policies define what actions can be performed on which resources in Grafana Cloud. Sign up at Grafana Cloud, then go to “My Account” → “Security” → “Access Policies”, and create terraform-access-policy that grants Terraform the permissions needed to create and manage resources in Grafana Cloud:
For production use, you should define more specific permissions
Now let's create a Grafana stack—this is a logical grouping of services like metrics, logs, traces, etc.:
Create Access Policy for Grafana Alloy
Also, we need to allow Grafana Alloy to send data to our stack. This policy defines the specific permissions (scopes) that Alloy needs to push metrics, logs, and traces:
Expose Connection Details
Now we have all the necessary tokens and URLs to push data from your Kubernetes clusters into Grafana Cloud's stack via Grafana Alloy. These outputs provide the connection details needed for your monitoring setup:
Deploy the k8s-monitoring Helm Chart
With Grafana Cloud configured, we can now set up monitoring in our Kubernetes clusters.
⚠️
To deploy Helm charts with Terraform, you'll need to configure the Helm provider.
See the Terraform Helm provider documentation for setup instructions.
📌
While this guide shows deploying the Helm chart via Terraform for simplicity and to illustrate the concept, in production environments you might prefer using Kustomize or ArgoCD for managing Helm releases (these approaches offer better separation of concerns and more granular control over Kubernetes resources).
Understanding Grafana Alloy
Grafana Alloy uses a configuration language called River (similar to HCL syntax in Terraform) to define how telemetry data is collected and processed. While you could manually configure it as shown below, this quickly becomes complex for production environments:
Here's how you would deploy such a configuration using Terraform:
This approach above can quickly become tedious and time-consuming to write and maintain. That's why, instead of writing this complex configuration by hand, we'll use the k8s-monitoring Helm chart.
What the k8s-monitoring Helm Chart Deploys
The k8s-monitoring chart deploys several components to provide comprehensive monitoring of your Kubernetes cluster like:
Prometheus for metrics collection.
Loki for log collection.
Tempo for distributed tracing.
Pyroscope for continuous profiling.
Core Kubernetes monitoring components like:
kube-state-metrics for exposing K8s object state metrics to show cluster state.
node-exporter for hardware metrics.
kubelet for exposing node and container runtime metrics.
cAdvisor for container resource usage metrics like CPU, memory, network, and disk.
Examples for the k8s-monitoring Helm chart might look like:
Deploying with Terraform
Here's how to deploy the chart using Terraform:
After deploying, you'll see several pods in your monitoring namespace:
Customizing Your Monitoring Pipeline
When implementing monitoring across different environments and clusters, you'll often need to customize how metrics are collected and labeled. The k8s-monitoring Helm chart allows you to apply these customizations without writing complex River configurations directly. You can implement them by adding extra relabeling rules to your values:
These rules follow the same syntax as River configuration but are applied within the managed Helm chart deployment. This gives you the flexibility of the River language while keeping deployment simple and maintainable.
Managing Costs
Monitoring in the cloud comes with costs, especially when collecting data from large clusters. Here are some strategies to keep expenses under control:
Start small: enable only essential namespaces/components, begin by monitoring only critical applications and infrastructure components.
Monitor usage: regularly check the usage dashboards in Grafana Cloud to understand how much data you're ingesting and which metrics are taking up the most space.
Filter metrics: drop logs or metrics you don't need. Use the metricsTuning settings of the k8s-monitoring chart to limit which metrics are collected:
Limit namespaces from which you collect logs:
provider "grafana" {
alias = "cloud"
cloud_access_policy_token = var.grafana_cloud_access_policy_token
}
resource "grafana_cloud_stack" "this" {
name = "Your name for Grafana Cloud Stack"
slug = "yourgcloudstack"
region_slug = "us"
}
# For access to your Grafana stack
resource "grafana_cloud_stack_service_account" "cloud_sa" {
provider = grafana.cloud
stack_slug = resource.grafana_cloud_stack.this[0].slug
name = "cloud service account"
role = "Admin"
is_disabled = false
}
# Token for authenticating the service account
resource "grafana_cloud_stack_service_account_token" "cloud_sa" {
provider = grafana.cloud
stack_slug = resource.grafana_cloud_stack.this[0].slug
name = "terraform service account key"
service_account_id = grafana_cloud_stack_service_account.cloud_sa[0].id
}
resource "grafana_cloud_access_policy" "grafana_alloy" {
provider = grafana.cloud
region = resource.grafana_cloud_stack.this[0].region_slug
name = "grafana-alloy-access-policy"
display_name = "Grafana Alloy Access Policy (created by Terraform)"
scopes = [
"metrics:write",
"metrics:import",
"logs:write",
"traces:write",
]
# Define which resources the policy applies to
realm {
type = "stack"
identifier = resource.grafana_cloud_stack.this[0].id
}
}
resource "grafana_cloud_access_policy_token" "grafana_alloy" {
provider = grafana.cloud
region = resource.grafana_cloud_stack.this[0].region_slug
access_policy_id = grafana_cloud_access_policy.grafana_alloy[0].policy_id
name = "grafana-alloy-access-token"
display_name = "Grafana Alloy Access Token (created by Terraform)"
}
output "alloy_access_token" {
description = "Grafana Alloy access token."
value = grafana_cloud_access_policy_token.grafana_alloy[0].token
sensitive = true
}
output "prometheus_url" {
description = "Grafana Cloud Prometheus URL."
value = resource.grafana_cloud_stack.this[0].prometheus_url
}
output "prometheus_user_id" {
description = "Grafana Cloud Prometheus user ID."
value = resource.grafana_cloud_stack.this[0].prometheus_user_id
}
output "loki_url" {
description = "Grafana Cloud Loki URL."
value = resource.grafana_cloud_stack.this[0].logs_url
}
output "loki_user_id" {
description = "Grafana Cloud Loki user ID."
value = resource.grafana_cloud_stack.this[0].logs_user_id
}
This Terraform configuration defines outputs for Prometheus (Mimir) and Loki, providing user IDs and endpoint URLs. While Grafana Cloud also supports Tempo for tracing, k6 for load testing, and more, this setup focuses solely on metrics and logs to keep things simple.
//@ Prometheus Scrapping
// Discover Kubernetes pods to collect metrics from
// https://grafana.com/docs/agent/latest/flow/reference/components/discovery.kubernetes
discovery.kubernetes "pod_metrics" {
role = "pod"
// Limit namespaces where pods are discovered
namespaces {
own_namespace = false // whether you want to search for Pods in the Namespace Grafana Agent Flow is running in
names = ["default", "kube-system", "monitoring", "celery"]
}
}
// Expose pod labels as metric labels
// https://grafana.com/docs/agent/latest/flow/reference/components/discovery.relabel
discovery.relabel "pod_metrics" {
targets = discovery.kubernetes.pod_metrics.targets
rule {
action = "labelmap"
regex = "__meta_kubernetes_pod_label_(.+)"
}
rule {
action = "replace"
source_labels = ["__meta_kubernetes_pod_node_name"]
target_label = "node"
}
rule {
action = "replace"
replacement = "${cluster_name}"
target_label = "cluster"
}
}
// Scrape metrics from Kubernetes pods and send to a prometheus.remote_write component
// https://grafana.com/docs/agent/latest/flow/reference/components/prometheus.scrape
prometheus.scrape "pod_metrics" {
targets = discovery.relabel.pod_metrics.output
forward_to = [prometheus.remote_write.default.receiver]
honor_labels = true
}
// Collect and send metrics to a Prometheus remote_write endpoint
// https://grafana.com/docs/agent/latest/flow/reference/components/prometheus.remote_write
prometheus.remote_write "default" {
// If you have more than one endpoint to write metrics to, repeat the endpoint block for additional endpoints
endpoint {
url = "${prometheus_url}"
basic_auth {
username = "${prometheus_user_id}"
password = "${grafana_agent_access_token}"
}
}
}
//@ Logs Scrapping
// Discover Kubernetes pods to collect logs from
// https://grafana.com/docs/agent/latest/flow/reference/components/discovery.kubernetes
discovery.kubernetes "pod_logs" {
role = "pod"
// Limit namespaces where pods are discovered
namespaces {
own_namespace = false // whether you want to search for Pods in the Namespace Grafana Agent Flow is running in
names = ["celery"]
}
}
// Apply relabeling for log scraping
// https://grafana.com/docs/agent/latest/flow/reference/components/discovery.relabel
discovery.relabel "pod_logs" {
targets = discovery.kubernetes.pod_logs.targets
rule {
action = "labelmap"
regex = "__meta_kubernetes_pod_label_(.+)"
}
rule {
action = "replace"
source_labels = ["__meta_kubernetes_pod_node_name"]
target_label = "__host__"
}
rule {
action = "replace"
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
rule {
action = "replace"
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "pod"
}
rule {
action = "replace"
source_labels = ["__meta_kubernetes_container_name"]
target_label = "container"
}
rule {
action = "replace"
replacement = "/var/log/pods/*$1/*.log"
separator = "/"
source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
target_label = "__path__"
}
}
// Tails logs from Kubernetes containers using the Kubernetes API
// https://grafana.com/docs/agent/latest/flow/reference/components/discovery.relabel
loki.source.kubernetes "pods" {
targets = discovery.relabel.pod_logs.output
forward_to = [loki.write.default.receiver]
}
// Receives log entries from other loki components and sends them over the network using Loki’s logproto format
// https://grafana.com/docs/agent/latest/flow/reference/components/loki.write/
loki.write "default" {
endpoint {
url = "${loki_url}"
basic_auth {
username = "${loki_user_id}"
password = "${grafana_agent_access_token}"
}
}
}
Data collection pipeline that automatically discovers, relabels, and forwards metrics (Prometheus) and logs (Loki) from Kubernetes pods to Grafana Cloud endpoints using Grafana Alloy.
cluster:
# -- The name of this cluster, which will be set in all labels. Required.
name: ${cluster_name}
externalServices:
prometheus:
# -- Prometheus host where metrics will be sent
host: "${prometheus_host}"
basicAuth:
username: "${prometheus_username}"
password: "${grafana_agent_access_token}"
loki:
# -- Loki host where logs and events will be sent
host: "${loki_host}"
basicAuth:
username: "${loki_username}"
password: "${grafana_agent_access_token}"
# Settings related to capturing and forwarding metrics
metrics:
# -- Capture and forward metrics
enabled: true
# -- How frequently to scrape metrics
scrapeInterval: 60s
# Annotation-based autodiscovery allows for discovering metric sources solely on their annotations and does
# not require adding any extra configuration.
autoDiscover:
# Enable annotation-based autodiscovery
enabled: true
# -- Annotations that are used to discover and configure metric scraping targets. Add these annotations
# to your services or pods to control how autodiscovery will find and scrape metrics from your service or pod.
annotations:
# -- Annotation for enabling scraping for this service or pod. Value should be either "true" or "false"
scrape: "k8s.grafana.com/scrape"
# -- Annotation for setting or overriding the metrics path. If not set, it defaults to /metrics
metricsPath: "k8s.grafana.com/metrics.path"
# -- Annotation for setting the metrics port by number.
metricsPortNumber: "k8s.grafana.com/metrics.portNumber"
# Metrics from Grafana Alloy
alloy:
enabled: false
# Cluster object metrics from Kube State Metrics
kube-state-metrics:
enabled: true
# Node metrics from Node Exporter
node-exporter:
enabled: true
# Cluster metrics from the Kubelet
kubelet:
enabled: true
# Container metrics from cAdvisor
cadvisor:
enabled: true
# Metrics from the API Server
apiserver:
enabled: false
# Metrics from the Kube Controller Manager
kubeControllerManager:
enabled: false
# Metrics from the Kube Proxy
kubeProxy:
enabled: false
# Metrics from the Kube Scheduler
kubeScheduler:
enabled: false
# Cost related metrics from OpenCost
cost:
enabled: false
# Settings related to capturing and forwarding logs
logs:
enabled: true
# Settings for Kubernetes pod logs
pod_logs:
enabled: true
# Controls the behavior of discovering pods for logs.
# When set to "all", every pod (filtered by the namespaces list below) will have their logs gathered, but you can
# use the annotation to remove a pod from that list.
# When set to "annotation", only pods with the annotation set to true will be gathered.
# Possible values: "all" "annotation"
discovery: "all"
# The annotation to control the behavior of gathering logs from this pod. If you put this annotation on to your pod,
# it will either enable or disable auto gathering of logs from this pod.
annotation: "k8s.grafana.com/logs.autogather"
# Settings for scraping Kubernetes cluster events
cluster_events:
enabled: false
# Settings related to capturing and forwarding traces
traces:
enabled: false
# Settings for the Node Exporter deployment
# You can use this sections to make modifications to the Node Exporter deployment.
# See https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-node-exporter for available values.
prometheus-node-exporter:
# -- Should this helm chart deploy Node Exporter to the cluster.
# Set this to false if your cluster already has Node Exporter, or if you do
# not want to scrape metrics from Node Exporter.
enabled: true
tolerations:
- effect: NoSchedule
operator: Exists
- effect: NoExecute
operator: Exists
# Settings for the Grafana Alloy instance that gathers pod logs.
# See https://github.com/grafana/alloy/tree/main/operations/helm/charts/alloy for available values.
# @ignored -- This skips including these values in README.md
alloy-logs:
controller:
type: daemonset
nodeSelector:
kubernetes.io/os: linux
# Allow this pod to be scheduled on GPU nodes
tolerations:
- effect: NoSchedule
operator: Exists
- effect: NoExecute
operator: Exists
Monitoring components in the monitoring namespace, including Grafana Alloy (metrics and logs), kube-state-metrics, and node exporter.
# Settings related to capturing and forwarding metrics
metrics:
# -- Rule blocks to be added to the discovery.relabel component for all metric sources.
# See https://grafana.com/docs/agent/latest/flow/reference/components/discovery.relabel/#rule-block
extraRelabelingRules: |-
rule {
action = "replace"
source_labels = ["__meta_kubernetes_pod_node_name"]
target_label = "node"
}
rule { // NB(khaykingleb): to differentiate the environments of pods in the GPU cluster
action = "replace"
source_labels = ["__meta_kubernetes_pod_label_Env"]
target_label = "env"
}
This configuration assigns a node label based on the pod’s Kubernetes node and an env label to differentiate environments in a GPU cluster, enhancing observability in Grafana Alloy.
# Cluster object metrics from Kube State Metrics
kube-state-metrics:
# Adjustments to the scraped metrics to filter the amount of data sent to storage.
metricsTuning:
# -- Filter the list of metrics from Kube State Metrics to a useful, minimal set. See https://github.com/grafana/k8s-monitoring-helm/blob/85d679e23af4e79eeaae5f2207237e30fef06ff8/charts/k8s-monitoring/README.md#allow-list-for-kube-state-metrics
useDefaultAllowList: true
# -- Metrics to keep. Can use regex.
includeMetrics:
- kube_pod_status_qos_class
- kube_namespace_created
- kube_deployment_status_replicas_unavailable
- kube_pod_container_status_restarts_total
- kube_node_labels
# -- Metrics to drop. Can use regex.
excludeMetrics:
- kube_lease_owner
- kube_lease_renew_time
- kube_pod_tolerations
- kube_pod_status_ready
- kube_pod_status_scheduled
- kube_pod_owner
- kube_pod_start_time
- kube_pod_container_state_started
- kube_node_status_allocatable
- kube_node_spec_unschedulable
- kube_pod_created
- kube_pod_ips
- kube_pod_restart_policy
- kube_pod_service_account
- kube_pod_status_initialized_time
- kube_pod_status_scheduled_time
- kube_pod_status_container_ready_time
- kube_pod_status_ready_time
- kube_horizontalpodautoscaler_status_condition
- kube_replicaset_created
- kube_replicaset_metadata_generation
- kube_replicaset_spec_replicas
- kube_replicaset_status_fully_labeled_replicas
- kube_replicaset_status_observed_generation
- kube_replicaset_status_ready_replicas
- kube_replicaset_status_replicas
- kube_daemonset_created
- kube_daemonset_metadata_generation
- kube_deployment_status_condition
- kube_namespace_status_phase
- kube_endpoint_ports
- kube_endpoint_address
- kube_secret_created
- kube_secret_metadata_resource_version
# Node metrics from Node Exporter
node-exporter:
# Adjustments to the scraped metrics to filter the amount of data sent to storage.
metricsTuning:
# -- Filter the list of metrics from Node Exporter to the minimal set required for Kubernetes Monitoring. See https://github.com/grafana/k8s-monitoring-helm/blob/85d679e23af4e79eeaae5f2207237e30fef06ff8/charts/k8s-monitoring/README.md#allow-list-for-node-exporter
useDefaultAllowList: true
# -- Filter the list of metrics from Node Exporter to the minimal set required for Kubernetes Monitoring as well as the Node Exporter integration.
useIntegrationAllowList: false
# -- Metrics to keep. Can use regex.
includeMetrics:
- node_uname_info
- node_time_seconds
- node_boot_time_seconds
- node_cpu_core_throttles_total
- node_load1
- node_context_switches_total
- node_filefd_maximum
- node_timex_estimated_error_seconds
- node_network_receive_bytes_total
- node_network_receive_errs_total
- node_network_receive_packets_total
- node_network_receive_drop_total
- node_netstat_Tcp_CurrEstab
- node_nf_conntrack_entries
- node_disk_read_bytes_total
- node_disk_written_bytes_total
- node_disk_reads_completed_total
- node_disk_writes_completed_total
- node_disk_io_now
# -- Metrics to drop. Can use regex.
excludeMetrics:
- node_filesystem_readonly
- node_filesystem_free_bytes
- node_scrape_collector_duration_seconds
- node_scrape_collector_success
- node_cpu_guest_seconds_total
# Cluster metrics from the Kubelet
kubelet:
# Adjustments to the scraped metrics to filter the amount of data sent to storage.
metricsTuning:
# -- Filter the list of metrics from the Kubelet to the minimal set required for Kubernetes Monitoring. See https://github.com/grafana/k8s-monitoring-helm/blob/85d679e23af4e79eeaae5f2207237e30fef06ff8/charts/k8s-monitoring/README.md#allow-list-for-kubelet
useDefaultAllowList: true
# -- Metrics to keep. Can use regex.
includeMetrics: []
# -- Metrics to drop. Can use regex.
excludeMetrics:
- kubelet_pod_worker_duration_seconds_bucket
- kubelet_cgroup_manager_duration_seconds_bucket
- kubelet_runtime_operations_total
- kubelet_pleg_relist_duration_seconds_bucket
- kubelet_pleg_relist_interval_seconds_bucket
- kubelet_pod_start_duration_seconds_bucket
- rest_client_requests_total
- storage_operation_duration_seconds_count
- volume_manager_total_volumes
# Container metrics from cAdvisor
cadvisor:
# Adjustments to the scraped metrics to filter the amount of data sent to storage.
metricsTuning:
# -- Filter the list of metrics from cAdvisor to the minimal set required for Kubernetes Monitoring. See https://github.com/grafana/k8s-monitoring-helm/blob/85d679e23af4e79eeaae5f2207237e30fef06ff8/charts/k8s-monitoring/README.md#allow-list-for-cadvisor
useDefaultAllowList: true
# -- Metrics to keep. Can use regex.
includeMetrics:
- machine_cpu_cores
- container_oom_events_total
- container_network_receive_errors_total
- container_cpu_cfs_throttled_seconds_total
# -- Metrics to drop. Can use regex.
excludeMetrics:
- container_memory_cache
- container_memory_swap
- container_fs_reads_total
- container_fs_writes_total
- container_fs_reads_bytes_total
- container_fs_writes_bytes_total
logs:
pod_logs:
# -- Only capture logs from pods in these namespaces (`[]` means all namespaces)
namespaces:
- celery
- supabase