Skip to main content

Cluster metrics

We're going to explore how to enable CloudWatch Container Insights metrics for an EKS cluster with the ADOT Collector. The first thing we'll need to do is create a collector in our cluster to gather metrics related to various aspects of the cluster such as nodes, pods and containers.

You can view the full collector manifest below, then we'll break it down.

Expand for full collector manifest
~/environment/eks-workshop/modules/observability/container-insights/adot/opentelemetrycollector.yaml
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: adot-container-ci
namespace: other
spec:
image: public.ecr.aws/aws-observability/aws-otel-collector:v0.40.0
mode: daemonset
serviceAccount: adot-collector-ci
config:
receivers:
awscontainerinsightreceiver:
add_full_pod_name_metric_label: true

processors:
batch/metrics:
timeout: 60s

exporters:
awsemf/performance:
namespace: ContainerInsights
log_group_name: "/aws/containerinsights/${EKS_CLUSTER_NAME}/performance"
log_stream_name: "{NodeName}"
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources, kubernetes]
metric_declarations:
# node metrics
- dimensions: [[NodeName, InstanceId, ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- dimensions: [[ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- node_cpu_usage_total
- node_cpu_limit
- node_memory_working_set
- node_memory_limit

# pod metrics
- dimensions:
[
[FullPodName, PodName, Namespace, ClusterName],
[PodName, Namespace, ClusterName],
[Service, Namespace, ClusterName],
[Namespace, ClusterName],
[ClusterName],
]
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
- dimensions:
[
[FullPodName, PodName, Namespace, ClusterName],
[PodName, Namespace, ClusterName],
[ClusterName],
]
metric_name_selectors:
- pod_cpu_reserved_capacity
- pod_memory_reserved_capacity
- dimensions:
[
[FullPodName, PodName, Namespace, ClusterName],
[PodName, Namespace, ClusterName],
]
metric_name_selectors:
- pod_number_of_container_restarts

# container metrics
- dimensions:
[
[FullPodName, PodName, Namespace, ClusterName, ContainerName],
[PodName, Namespace, ClusterName, ContainerName],
[Namespace, ClusterName, ContainerName],
[ClusterName, ContainerName],
]
metric_name_selectors:
- container_cpu_utilization
- container_memory_utilization
- number_of_container_restarts

# cluster metrics
- dimensions: [[ClusterName]]
metric_name_selectors:
- cluster_node_count
- cluster_failed_node_count

# service metrics
- dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- service_number_of_running_pods

# node fs metrics
- dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
metric_name_selectors:
- node_filesystem_utilization

# namespace metrics
- dimensions: [[Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- namespace_number_of_running_pods

extensions:
health_check: {}

service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [batch/metrics]
exporters: [awsemf/performance]
extensions: [health_check]

securityContext:
runAsUser: 0
runAsGroup: 0

env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: "K8S_POD_NAME"
valueFrom:
fieldRef:
fieldPath: "metadata.name"
volumeMounts:
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: dockersock
mountPath: /var/run/docker.sock
readOnly: true
- name: containerdsock
mountPath: /run/containerd/containerd.sock
- name: varlibdocker
mountPath: /var/lib/docker
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: devdisk
mountPath: /dev/disk
readOnly: true
volumes:
- name: rootfs
hostPath:
path: /
- name: dockersock
hostPath:
path: /var/run/docker.sock
- name: varlibdocker
hostPath:
path: /var/lib/docker
- name: containerdsock
hostPath:
path: /run/containerd/containerd.sock
- name: sys
hostPath:
path: /sys
- name: devdisk
hostPath:
path: /dev/disk/

We can review this in several parts to make better sense of it.

  image: public.ecr.aws/aws-observability/aws-otel-collector:v0.40.0
mode: daemonset

The OpenTelemetry collector can run in several different modes depending on the telemetry it is collecting. In this case we'll run it as a DaemonSet so that a pod runs on each node in the EKS cluster. This allows us to collect telemetry from the node and container runtime.

Next we can start to break down the collector configuration itself.

  config:
receivers:
awscontainerinsightreceiver:
add_full_pod_name_metric_label: true

First we'll configure the AWS Container Insights Receiver to collect metrics from the node.

    processors:
batch/metrics:
timeout: 60s

Next we'll use a batch processor to reduce the number of API calls to CloudWatch by flushing metrics buffered over at most 60 seconds.

    exporters:
awsemf/performance:
namespace: ContainerInsights
log_group_name: "/aws/containerinsights/${EKS_CLUSTER_NAME}/performance"

And now we'll use the AWS CloudWatch EMF Exporter for OpenTelemetry Collector to convert the OpenTelemetry metrics to AWS CloudWatch Embedded Metric Format (EMF) and then send them directly to CloudWatch Logs using the PutLogEvents API. The log entries will be sent to the CloudWatch Logs log group shown and use the metrics will appear in the ContainerInsights namespace. This rest of this section is too long to view in full but see the complete manifest above.

      pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [batch/metrics]
exporters: [awsemf/performance]

Finally we need to use an OpenTelemetry pipeline to combine our receiver, processor and exporter.

We'll use the managed IAM policy CloudWatchAgentServerPolicy to provide the collector with the IAM permissions it needs via IAM Roles for Service Accounts to send the metrics to CloudWatch:

~$aws iam list-attached-role-policies \
--role-name eks-workshop-adot-collector-ci | jq .
{
  "AttachedPolicies": [
    {
      "PolicyName": "CloudWatchAgentServerPolicy",
      "PolicyArn": "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
    }
  ]
}

This IAM role will be added to the ServiceAccount for the collector:

~/environment/eks-workshop/modules/observability/container-insights/adot/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: adot-collector-ci
annotations:
eks.amazonaws.com/role-arn: ${ADOT_IAM_ROLE_CI}

Create the resources we've explored above:

~$kubectl kustomize ~/environment/eks-workshop/modules/observability/container-insights/adot \
| envsubst | kubectl apply -f- && sleep 5
~$kubectl rollout status -n other daemonset/adot-container-ci-collector --timeout=120s

We can confirm that our collector is running by inspecting the Pods created by the DaemonSet:

~$kubectl get pod -n other -l app.kubernetes.io/name=adot-container-ci-collector
NAME                               READY   STATUS    RESTARTS   AGE
adot-container-ci-collector-5lp5g  1/1     Running   0          15s
adot-container-ci-collector-ctvgs  1/1     Running   0          15s
adot-container-ci-collector-w4vqs  1/1     Running   0          15s

This shows the collector is running and collecting metrics from the cluster. To view metrics first open the CloudWatch console and navigate to Container Insights:

tip

Please note that:

  1. It may take a few minutes for data to start appearing in CloudWatch
  2. It is expected that some metrics are missing since they are provided by the CloudWatch agent with enhanced observability
AWS console iconOpen CloudWatch console

ContainerInsightsConsole

You can take some time to explore around the console to see the various ways that metrics are presented such as by cluster, namespace or pod.