Kubernetes Monitoring with Grafana Cloud

Push-Based Kubernetes Monitoring with Grafana Cloud

Introduction Setting Up Grafana Cloud Create Access Policies Configure Terraform Provider Create a Grafana Stack Create Access Policy for Grafana Alloy Expose Connection Details Deploy the k8s-monitoring Helm Chart Understanding Grafana Alloy What the k8s-monitoring Helm Chart Deploys Deploying with Terraform Customizing Your Monitoring Pipeline Managing Costs

Introduction

This guide demonstrates how to create a scalable, centralized monitoring solution for your Kubernetes clusters using a push-based architecture. We'll use three key components:

Grafana Cloud — an observability platform that stores and visualizes all our telemetry data.

Grafana Alloy — a flexible telemetry collection agent designed to discover, collect, process, and forward observability data from various sources:

K8s-monitoring Helm chart — a pre-configured setup for deploying monitoring components in Kubernetes clusters, simplifying the deployment and configuration of Grafana Alloy.

With this architecture, you can collect logs, metrics, traces, and more across multiple Kubernetes clusters and centralize it in one place:

This diagram shows how metrics can be pulled from applications in a Kubernetes cluster and pushed to Grafana Cloud or its self-hosted alternative

Setting Up Grafana Cloud

Let's start by configuring Grafana Cloud to receive our monitoring data.

Create Access Policies

Firstly, you need to configure access policies in Grafana Cloud to allow Terraform to manage resources. Access policies define what actions can be performed on which resources in Grafana Cloud. Sign up at Grafana Cloud, then go to “My Account” → “Security” → “Access Policies”, and create terraform-access-policy that grants Terraform the permissions needed to create and manage resources in Grafana Cloud:

For production use, you should define more specific permissions

Configure Terraform Provider

Next, we'll set up the Grafana Terraform provider to interact with Grafana Cloud:

Create a Grafana Stack

Now let's create a Grafana stack which is a logical grouping of services like metrics, logs, traces, etc.:

Create Access Policy for Grafana Alloy

Also, we need to allow Grafana Alloy to send data to our stack. This policy defines the specific permissions (scopes) that Alloy needs to push metrics, logs, and traces:

Expose Connection Details

Now we have all the necessary tokens and URLs to push data from your Kubernetes clusters into Grafana Cloud's stack via Grafana Alloy. These outputs provide the connection details needed for your monitoring setup:

Deploy the k8s-monitoring Helm Chart

With Grafana Cloud configured, we can now set up monitoring in our Kubernetes clusters.

⚠️

To deploy Helm charts with Terraform, you'll need to configure the Helm provider. See the Terraform Helm provider documentation for setup instructions.

📌

While this guide shows deploying the Helm chart via Terraform for simplicity and to illustrate the concept, in production environments you might prefer using ArgoCD with Kustomize for managing Helm charts (this approach offers better separation of concerns and more granular control over Kubernetes resources).

Understanding Grafana Alloy

Grafana Alloy uses a configuration language called River (similar to HCL syntax in Terraform) to define how telemetry data is collected and processed. While you could manually configure it as shown below, this quickly becomes complex for production environments:

Here's how you would deploy such a configuration using Terraform:

This approach above can quickly become tedious and time-consuming to write and maintain. That's why, instead of writing this complex configuration by hand, we'll use the k8s-monitoring Helm chart.

What the k8s-monitoring Helm Chart Deploys

The k8s-monitoring chart deploys several components to provide comprehensive monitoring of your Kubernetes cluster like:

Prometheus for metrics collection.

Loki for log collection.

Tempo for distributed tracing.

Pyroscope for continuous profiling.

Core Kubernetes monitoring components like:
- kube-state-metrics for exposing K8s object state metrics to show cluster state.
- node-exporter for hardware metrics.
- kubelet for exposing node and container runtime metrics.
- cAdvisor for container resource usage metrics like CPU, memory, network, and disk.

and even more (check out their GitHub repo).

Examples for the k8s-monitoring Helm chart might look like:

Deploying with Terraform

Here's how to deploy the chart using Terraform:

After deploying, you'll see several pods in your monitoring namespace:

Customizing Your Monitoring Pipeline

When implementing monitoring across different environments and clusters, you'll often need to customize how metrics are collected and labeled. The k8s-monitoring Helm chart allows you to apply these customizations without writing complex River configurations directly. You can implement them by adding extra relabeling rules to your values:

These rules follow the same syntax as River configuration but are applied within the managed Helm chart deployment. This gives you the flexibility of the River language while keeping deployment simple and maintainable.

Managing Costs

Monitoring in the cloud comes with costs, especially when collecting data from large clusters. Here are some strategies to keep expenses under control:

Start small: enable only essential namespaces/components, begin by monitoring only critical applications and infrastructure components.

Monitor usage: regularly check the usage dashboards in Grafana Cloud to understand how much data you're ingesting and which metrics are taking up the most space.

Filter metrics: drop logs or metrics you don't need.

//@ Prometheus Scrapping // Discover Kubernetes pods to collect metrics from // https://grafana.com/docs/agent/latest/flow/reference/components/discovery.kubernetes discovery.kubernetes "pod_metrics" { role = "pod" // Limit namespaces where pods are discovered namespaces { own_namespace = false // whether you want to search for Pods in the Namespace Grafana Agent Flow is running in names = ["default", "kube-system", "monitoring", "celery"] } } // Expose pod labels as metric labels // https://grafana.com/docs/agent/latest/flow/reference/components/discovery.relabel discovery.relabel "pod_metrics" { targets = discovery.kubernetes.pod_metrics.targets rule { action = "labelmap" regex = "__meta_kubernetes_pod_label_(.+)" } rule { action = "replace" source_labels = ["__meta_kubernetes_pod_node_name"] target_label = "node" } rule { action = "replace" replacement = "${cluster_name}" target_label = "cluster" } } // Scrape metrics from Kubernetes pods and send to a prometheus.remote_write component // https://grafana.com/docs/agent/latest/flow/reference/components/prometheus.scrape prometheus.scrape "pod_metrics" { targets = discovery.relabel.pod_metrics.output forward_to = [prometheus.remote_write.default.receiver] honor_labels = true } // Collect and send metrics to a Prometheus remote_write endpoint // https://grafana.com/docs/agent/latest/flow/reference/components/prometheus.remote_write prometheus.remote_write "default" { // If you have more than one endpoint to write metrics to, repeat the endpoint block for additional endpoints endpoint { url = "${prometheus_url}" basic_auth { username = "${prometheus_user_id}" password = "${grafana_agent_access_token}" } } } //@ Logs Scrapping // Discover Kubernetes pods to collect logs from // https://grafana.com/docs/agent/latest/flow/reference/components/discovery.kubernetes discovery.kubernetes "pod_logs" { role = "pod" // Limit namespaces where pods are discovered namespaces { own_namespace = false // whether you want to search for Pods in the Namespace Grafana Agent Flow is running in names = ["celery"] } } // Apply relabeling for log scraping // https://grafana.com/docs/agent/latest/flow/reference/components/discovery.relabel discovery.relabel "pod_logs" { targets = discovery.kubernetes.pod_logs.targets rule { action = "labelmap" regex = "__meta_kubernetes_pod_label_(.+)" } rule { action = "replace" source_labels = ["__meta_kubernetes_pod_node_name"] target_label = "__host__" } rule { action = "replace" source_labels = ["__meta_kubernetes_namespace"] target_label = "namespace" } rule { action = "replace" source_labels = ["__meta_kubernetes_pod_name"] target_label = "pod" } rule { action = "replace" source_labels = ["__meta_kubernetes_container_name"] target_label = "container" } rule { action = "replace" replacement = "/var/log/pods/*$1/*.log" separator = "/" source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"] target_label = "__path__" } } // Tails logs from Kubernetes containers using the Kubernetes API // https://grafana.com/docs/agent/latest/flow/reference/components/discovery.relabel loki.source.kubernetes "pods" { targets = discovery.relabel.pod_logs.output forward_to = [loki.write.default.receiver] } // Receives log entries from other loki components and sends them over the network using Loki’s logproto format // https://grafana.com/docs/agent/latest/flow/reference/components/loki.write/ loki.write "default" { endpoint { url = "${loki_url}" basic_auth { username = "${loki_user_id}" password = "${grafana_agent_access_token}" } } }

cluster: # -- The name of this cluster, which will be set in all labels. Required. name: ${cluster_name} externalServices: prometheus: # -- Prometheus host where metrics will be sent host: "${prometheus_host}" basicAuth: username: "${prometheus_username}" password: "${grafana_agent_access_token}" loki: # -- Loki host where logs and events will be sent host: "${loki_host}" basicAuth: username: "${loki_username}" password: "${grafana_agent_access_token}" # Settings related to capturing and forwarding metrics metrics: # -- Capture and forward metrics enabled: true # -- How frequently to scrape metrics scrapeInterval: 60s # Annotation-based autodiscovery allows for discovering metric sources solely on their annotations and does # not require adding any extra configuration. autoDiscover: # Enable annotation-based autodiscovery enabled: true # -- Annotations that are used to discover and configure metric scraping targets. Add these annotations # to your services or pods to control how autodiscovery will find and scrape metrics from your service or pod. annotations: # -- Annotation for enabling scraping for this service or pod. Value should be either "true" or "false" scrape: "k8s.grafana.com/scrape" # -- Annotation for setting or overriding the metrics path. If not set, it defaults to /metrics metricsPath: "k8s.grafana.com/metrics.path" # -- Annotation for setting the metrics port by number. metricsPortNumber: "k8s.grafana.com/metrics.portNumber" # Metrics from Grafana Alloy alloy: enabled: false # Cluster object metrics from Kube State Metrics kube-state-metrics: enabled: true # Node metrics from Node Exporter node-exporter: enabled: true # Cluster metrics from the Kubelet kubelet: enabled: true # Container metrics from cAdvisor cadvisor: enabled: true # Metrics from the API Server apiserver: enabled: false # Metrics from the Kube Controller Manager kubeControllerManager: enabled: false # Metrics from the Kube Proxy kubeProxy: enabled: false # Metrics from the Kube Scheduler kubeScheduler: enabled: false # Cost related metrics from OpenCost cost: enabled: false # Settings related to capturing and forwarding logs logs: enabled: true # Settings for Kubernetes pod logs pod_logs: enabled: true # Controls the behavior of discovering pods for logs. # When set to "all", every pod (filtered by the namespaces list below) will have their logs gathered, but you can # use the annotation to remove a pod from that list. # When set to "annotation", only pods with the annotation set to true will be gathered. # Possible values: "all" "annotation" discovery: "all" # The annotation to control the behavior of gathering logs from this pod. If you put this annotation on to your pod, # it will either enable or disable auto gathering of logs from this pod. annotation: "k8s.grafana.com/logs.autogather" # Settings for scraping Kubernetes cluster events cluster_events: enabled: false # Settings related to capturing and forwarding traces traces: enabled: false # Settings for the Node Exporter deployment # You can use this sections to make modifications to the Node Exporter deployment. # See https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-node-exporter for available values. prometheus-node-exporter: # -- Should this helm chart deploy Node Exporter to the cluster. # Set this to false if your cluster already has Node Exporter, or if you do # not want to scrape metrics from Node Exporter. enabled: true tolerations: - effect: NoSchedule operator: Exists - effect: NoExecute operator: Exists # Settings for the Grafana Alloy instance that gathers pod logs. # See https://github.com/grafana/alloy/tree/main/operations/helm/charts/alloy for available values. # @ignored -- This skips including these values in README.md alloy-logs: controller: type: daemonset nodeSelector: kubernetes.io/os: linux # Allow this pod to be scheduled on GPU nodes tolerations: - effect: NoSchedule operator: Exists - effect: NoExecute operator: Exists

# Cluster object metrics from Kube State Metrics kube-state-metrics: # Adjustments to the scraped metrics to filter the amount of data sent to storage. metricsTuning: # -- Filter the list of metrics from Kube State Metrics to a useful, minimal set. See https://github.com/grafana/k8s-monitoring-helm/blob/85d679e23af4e79eeaae5f2207237e30fef06ff8/charts/k8s-monitoring/README.md#allow-list-for-kube-state-metrics useDefaultAllowList: true # -- Metrics to keep. Can use regex. includeMetrics: - kube_pod_status_qos_class - kube_namespace_created - kube_deployment_status_replicas_unavailable - kube_pod_container_status_restarts_total - kube_node_labels # -- Metrics to drop. Can use regex. excludeMetrics: - kube_lease_owner - kube_lease_renew_time - kube_pod_tolerations - kube_pod_status_ready - kube_pod_status_scheduled - kube_pod_owner - kube_pod_start_time - kube_pod_container_state_started - kube_node_status_allocatable - kube_node_spec_unschedulable - kube_pod_created - kube_pod_ips - kube_pod_restart_policy - kube_pod_service_account - kube_pod_status_initialized_time - kube_pod_status_scheduled_time - kube_pod_status_container_ready_time - kube_pod_status_ready_time - kube_horizontalpodautoscaler_status_condition - kube_replicaset_created - kube_replicaset_metadata_generation - kube_replicaset_spec_replicas - kube_replicaset_status_fully_labeled_replicas - kube_replicaset_status_observed_generation - kube_replicaset_status_ready_replicas - kube_replicaset_status_replicas - kube_daemonset_created - kube_daemonset_metadata_generation - kube_deployment_status_condition - kube_namespace_status_phase - kube_endpoint_ports - kube_endpoint_address - kube_secret_created - kube_secret_metadata_resource_version # Node metrics from Node Exporter node-exporter: # Adjustments to the scraped metrics to filter the amount of data sent to storage. metricsTuning: # -- Filter the list of metrics from Node Exporter to the minimal set required for Kubernetes Monitoring. See https://github.com/grafana/k8s-monitoring-helm/blob/85d679e23af4e79eeaae5f2207237e30fef06ff8/charts/k8s-monitoring/README.md#allow-list-for-node-exporter useDefaultAllowList: true # -- Filter the list of metrics from Node Exporter to the minimal set required for Kubernetes Monitoring as well as the Node Exporter integration. useIntegrationAllowList: false # -- Metrics to keep. Can use regex. includeMetrics: - node_uname_info - node_time_seconds - node_boot_time_seconds - node_cpu_core_throttles_total - node_load1 - node_context_switches_total - node_filefd_maximum - node_timex_estimated_error_seconds - node_network_receive_bytes_total - node_network_receive_errs_total - node_network_receive_packets_total - node_network_receive_drop_total - node_netstat_Tcp_CurrEstab - node_nf_conntrack_entries - node_disk_read_bytes_total - node_disk_written_bytes_total - node_disk_reads_completed_total - node_disk_writes_completed_total - node_disk_io_now # -- Metrics to drop. Can use regex. excludeMetrics: - node_filesystem_readonly - node_filesystem_free_bytes - node_scrape_collector_duration_seconds - node_scrape_collector_success - node_cpu_guest_seconds_total # Cluster metrics from the Kubelet kubelet: # Adjustments to the scraped metrics to filter the amount of data sent to storage. metricsTuning: # -- Filter the list of metrics from the Kubelet to the minimal set required for Kubernetes Monitoring. See https://github.com/grafana/k8s-monitoring-helm/blob/85d679e23af4e79eeaae5f2207237e30fef06ff8/charts/k8s-monitoring/README.md#allow-list-for-kubelet useDefaultAllowList: true # -- Metrics to keep. Can use regex. includeMetrics: [] # -- Metrics to drop. Can use regex. excludeMetrics: - kubelet_pod_worker_duration_seconds_bucket - kubelet_cgroup_manager_duration_seconds_bucket - kubelet_runtime_operations_total - kubelet_pleg_relist_duration_seconds_bucket - kubelet_pleg_relist_interval_seconds_bucket - kubelet_pod_start_duration_seconds_bucket - rest_client_requests_total - storage_operation_duration_seconds_count - volume_manager_total_volumes # Container metrics from cAdvisor cadvisor: # Adjustments to the scraped metrics to filter the amount of data sent to storage. metricsTuning: # -- Filter the list of metrics from cAdvisor to the minimal set required for Kubernetes Monitoring. See https://github.com/grafana/k8s-monitoring-helm/blob/85d679e23af4e79eeaae5f2207237e30fef06ff8/charts/k8s-monitoring/README.md#allow-list-for-cadvisor useDefaultAllowList: true # -- Metrics to keep. Can use regex. includeMetrics: - machine_cpu_cores - container_oom_events_total - container_network_receive_errors_total - container_cpu_cfs_throttled_seconds_total # -- Metrics to drop. Can use regex. excludeMetrics: - container_memory_cache - container_memory_swap - container_fs_reads_total - container_fs_writes_total - container_fs_reads_bytes_total - container_fs_writes_bytes_total