Observability

Metrics, logs, and traces — the three pillars of observability

This project implements the three pillars of observability, following the pyramid model where each layer builds on the previous.

The Observability Pyramid

LayerSignalToolQuestion it answers
TopTracesGrafana Tempo + OpenTelemetryWhy is it slow? Where did it fail?
MiddleLogsLoki + Grafana AlloyWhat happened and when?
BaseMetricsPrometheus + GrafanaIs something wrong?

The pyramid reflects signal volume and cost: metrics are cheap and always-on; logs are richer but larger; traces are the most detailed and selectively sampled.

Coverage

PillarTool(s)What it answersStatus
MetricsPrometheus + Grafana + Node Exporter + kube-state-metricsIs anything wrong? What are the SLIs?✅ Enabled by default
LogsLoki 3.x + Grafana AlloyWhat happened and when? Which pod crashed?✅ Enabled with monitoring
TracesGrafana Tempo + OpenTelemetry CollectorWhy is a request slow? Which service failed?✅ Enabled with tracing: otel-tempo
Mesh VisibilityKialiHow are services connected? What is the error rate between them?✅ Enabled with monitoring

Cross-Signal Correlation (Grafana)

FromToHow
Log line with traceID=Trace in TempoClick derived field link
Trace spanLogs for that service“Logs” button in Tempo UI
Trace spanRED metrics in Prometheusservice.name tag link
Service Map (Tempo)Prometheus rate/error/durationClick node in graph

Monitoring (Prometheus + Grafana)

Components

ComponentDescription
Prometheus OperatorManages Prometheus instances
PrometheusMetrics collection and storage
GrafanaVisualization and dashboards
Node ExporterHost metrics (CPU, memory, disk)
kube-state-metricsKubernetes object metrics

Accessing Grafana

open https://grafana.local

Credentials:

  • Username: admin
  • Password: from Vault → vault kv get -field=grafana_admin_password secret/k8s-provisioner/api-keys

    If Vault is disabled: admin123

Accessing Prometheus

open https://prometheus.local

Accessing Alertmanager

open https://alertmanager.local

Import from grafana.com: DashboardsImport → Enter ID → Load

Kubernetes

IDDashboardDescription
15760Kubernetes / Views / GlobalCluster overview
15757Kubernetes / Views / NamespacesPer namespace metrics
15759Kubernetes / Views / PodsPod details
10000Kubernetes ClusterComplete cluster view

Node

IDDashboardDescription
1860Node Exporter FullDetailed host metrics

Java/JVM

IDDashboardDescription
4701JVM MicrometerSpring Boot with Micrometer
8563JVM DashboardJMX Exporter metrics
11955JVM MetricsHeap, GC, Threads
14430JVM OverviewComplete JVM view

Java apps need to expose metrics via Micrometer or JMX Exporter

Go

IDDashboardDescription
10826Go ProcessesGo runtime metrics
6671Go MetricsGoroutines, GC, Memory
14061Go RuntimeDetailed runtime

Go apps need to expose metrics via prometheus/client_golang

cert-manager

IDDashboardDescription
20842cert-managerCertificate expiry, renewal, and issuer status

Logging (Loki + Grafana Alloy)

Components

ComponentDescription
Loki 3.xLog aggregation and storage (TSDB schema v13)
Grafana AlloyLog collector DaemonSet — replaces Promtail, collects pod logs and Kubernetes events

Storage by Environment

EnvironmentStorageReason
Lab/Dev (this project)NFS dynamicSimple, zero configuration
On-premise productionCeph / LonghornDistributed block storage, HA
Cloud productionS3 / GCS / Azure BlobNative object storage support
Hybrid productionMinIO (S3-compatible)Self-hosted S3 API

Accessing Logs

  1. Open https://grafana.local
  2. Go to Explore → select Loki as datasource

LogQL Examples

# All logs from a namespace
{namespace="default"}

# Search for errors
{namespace="default"} |= "error"

# Filter by pod name pattern
{pod=~"nginx.*"}

# Parse JSON logs
{namespace="default"} | json | level="error"

# Count errors per pod over 5 minutes
sum by (pod) (count_over_time({namespace="default"} |= "error" [5m]))
IDDashboardDescription
13639Loki DashboardGeneral log overview
12611Loki & AlloyLogs with Alloy stats
15141Loki LogsSimple log viewer

Tracing (OpenTelemetry + Grafana Tempo)

Enabled via components.tracing: "otel-tempo".

Components

ComponentDescriptionPort
Grafana TempoTrace storage and query backend3200 (HTTP), 4317 (OTLP gRPC), 4318 (OTLP HTTP)
OpenTelemetry CollectorDaemonSet that receives and forwards traces4317/4318 (hostPort)

Automatic Trace Injection (Zero Code Changes)

Layer 1 — Istio Mesh Tracing (always active)

The Istio sidecar proxy (Envoy) forwards HTTP/gRPC traces to the OTel Collector automatically for every namespace with istio-injection: enabled.

AppPodEnvoySidecarOTelCollectorGrafanaTempo(autoinjected)(DaemonSet)App Pod → Envoy Sidecar → OTel Collector → Grafana Tempo (auto-injected) (DaemonSet)

What you get for free:

  • Inbound/outbound HTTP spans with latency, status code, method, URL
  • Service-to-service call graph (Service Map in Grafana)
  • 100% sampling rate (configurable in Telemetry resource)

Layer 2 — OTel Operator (deep instrumentation, opt-in)

For code-level spans (database queries, function calls, custom business logic), annotate your pod:

annotations:
  instrumentation.opentelemetry.io/inject-java: "true"
  instrumentation.opentelemetry.io/inject-nodejs: "true"
  instrumentation.opentelemetry.io/inject-python: "true"
  instrumentation.opentelemetry.io/inject-go: "true"

Sending Traces Manually (SDK)

# From inside the cluster
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.monitoring.svc:4317   # gRPC
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.monitoring.svc:4318   # HTTP

# From any node (via DaemonSet hostPort)
OTEL_EXPORTER_OTLP_ENDPOINT=http://<node-ip>:4317

Viewing Traces in Grafana

  1. Open https://grafana.local
  2. Go to Explore → select Tempo as datasource
  3. Search by TraceID, service name, or use the Service Graph

Instrumenting a New Application

For full trace → log correlation (“Log for this span” button in Tempo), your application must satisfy three requirements:

1. Pod label app: <name>

Alloy automatically extracts the app label from pod labels and adds it to every log stream in Loki. Without it, Tempo cannot find the matching logs.

spec:
  template:
    metadata:
      labels:
        app: my-app   # required

2. Log format with traceID=<hex>

Include the active trace ID in every log line using logfmt key traceID. Loki’s derived fields use the regex traceID=(\w+) to create a clickable link to Tempo.

// Go
traceID := span.SpanContext().TraceID().String()
log.Printf("traceID=%s level=info msg=\"handled request\"", traceID)
# Python
import logging
trace_id = trace.get_current_span().get_span_context().trace_id
logging.info(f"traceID={trace_id:032x} level=info msg=handled_request")
// Java (SLF4J + OTel SDK)
import io.opentelemetry.api.trace.Span;

String traceId = Span.current().getSpanContext().getTraceId();
logger.info("traceID={} level=info msg=handled_request", traceId);

3. Set service.name matching the app label

When initialising the OTel tracer, set the resource attribute service.name to the same value as your pod’s app label. Tempo uses this to query Loki with {app="my-app"}.

resource.NewWithAttributes(
    semconv.SchemaURL,
    semconv.ServiceName("my-app"),   // must match pod label app=my-app
)

Full checklist

RequirementWhy
Pod label app: my-appAlloy adds it as Loki stream label
Log line contains traceID=<hex>Loki derived field links log → trace
OTel resource service.name=my-appTempo queries {app="my-app"} in Loki
Namespace has istio-injection: enabledIstio auto-injects sidecar for mesh tracing
OTLP endpoint otel-collector.monitoring.svc:4317Collector forwards to Tempo

See examples/otel-demo-app/ for a complete working reference.

Verifying Installation

kubectl get pods -n monitoring -l app=tempo
kubectl get pods -n monitoring -l app=otel-collector

# Check Tempo is receiving traces
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl http://localhost:3200/ready