Observability

Metrics, logs, and traces — the three pillars of observability

This project implements the three pillars of observability, following the pyramid model where each layer builds on the previous.

The Observability Pyramid

Layer	Signal	Tool	Question it answers
Top	Traces	Grafana Tempo + OpenTelemetry	Why is it slow? Where did it fail?
Middle	Logs	Loki + Grafana Alloy	What happened and when?
Base	Metrics	Prometheus + Grafana	Is something wrong?

The pyramid reflects signal volume and cost: metrics are cheap and always-on; logs are richer but larger; traces are the most detailed and selectively sampled.

Coverage

Pillar	Tool(s)	What it answers	Status
Metrics	Prometheus + Grafana + Node Exporter + kube-state-metrics	Is anything wrong? What are the SLIs?	✅ Enabled by default
Logs	Loki 3.x + Grafana Alloy	What happened and when? Which pod crashed?	✅ Enabled with monitoring
Traces	Grafana Tempo + OpenTelemetry Collector	Why is a request slow? Which service failed?	✅ Enabled with `tracing: otel-tempo`
Mesh Visibility	Kiali	How are services connected? What is the error rate between them?	✅ Enabled with monitoring

Cross-Signal Correlation (Grafana)

From	To	How
Log line with `traceID=`	Trace in Tempo	Click derived field link
Trace span	Logs for that service	“Logs” button in Tempo UI
Trace span	RED metrics in Prometheus	`service.name` tag link
Service Map (Tempo)	Prometheus rate/error/duration	Click node in graph

Monitoring (Prometheus + Grafana)

Components

Component	Description
Prometheus Operator	Manages Prometheus instances
Prometheus	Metrics collection and storage
Grafana	Visualization and dashboards
Node Exporter	Host metrics (CPU, memory, disk)
kube-state-metrics	Kubernetes object metrics

Accessing Grafana

open https://grafana.local

Credentials:

Username: admin
Password: from Vault → vault kv get -field=grafana_admin_password secret/k8s-provisioner/api-keys
If Vault is disabled: admin123

Accessing Prometheus

open https://prometheus.local

Accessing Alertmanager

open https://alertmanager.local

Recommended Dashboards

Import from grafana.com: Dashboards → Import → Enter ID → Load

Kubernetes

ID	Dashboard	Description
`15760`	Kubernetes / Views / Global	Cluster overview
`15757`	Kubernetes / Views / Namespaces	Per namespace metrics
`15759`	Kubernetes / Views / Pods	Pod details
`10000`	Kubernetes Cluster	Complete cluster view

Node

ID	Dashboard	Description
`1860`	Node Exporter Full	Detailed host metrics

Java/JVM

ID	Dashboard	Description
`4701`	JVM Micrometer	Spring Boot with Micrometer
`8563`	JVM Dashboard	JMX Exporter metrics
`11955`	JVM Metrics	Heap, GC, Threads
`14430`	JVM Overview	Complete JVM view

Java apps need to expose metrics via Micrometer or JMX Exporter

Go

ID	Dashboard	Description
`10826`	Go Processes	Go runtime metrics
`6671`	Go Metrics	Goroutines, GC, Memory
`14061`	Go Runtime	Detailed runtime

Go apps need to expose metrics via prometheus/client_golang

cert-manager

ID	Dashboard	Description
`20842`	cert-manager	Certificate expiry, renewal, and issuer status

Logging (Loki + Grafana Alloy)

Components

Component	Description
Loki 3.x	Log aggregation and storage (TSDB schema v13)
Grafana Alloy	Log collector DaemonSet — replaces Promtail, collects pod logs and Kubernetes events

Storage by Environment

Environment	Storage	Reason
Lab/Dev (this project)	NFS dynamic	Simple, zero configuration
On-premise production	Ceph / Longhorn	Distributed block storage, HA
Cloud production	S3 / GCS / Azure Blob	Native object storage support
Hybrid production	MinIO (S3-compatible)	Self-hosted S3 API

Accessing Logs

Open https://grafana.local
Go to Explore → select Loki as datasource

LogQL Examples

# All logs from a namespace
{namespace="default"}

# Search for errors
{namespace="default"} |= "error"

# Filter by pod name pattern
{pod=~"nginx.*"}

# Parse JSON logs
{namespace="default"} | json | level="error"

# Count errors per pod over 5 minutes
sum by (pod) (count_over_time({namespace="default"} |= "error" [5m]))

Recommended Log Dashboards

ID	Dashboard	Description
`13639`	Loki Dashboard	General log overview
`12611`	Loki & Alloy	Logs with Alloy stats
`15141`	Loki Logs	Simple log viewer

Tracing (OpenTelemetry + Grafana Tempo)

Enabled via components.tracing: "otel-tempo".

Components

Component	Description	Port
Grafana Tempo	Trace storage and query backend	3200 (HTTP), 4317 (OTLP gRPC), 4318 (OTLP HTTP)
OpenTelemetry Collector	DaemonSet that receives and forwards traces	4317/4318 (hostPort)

Automatic Trace Injection (Zero Code Changes)

Layer 1 — Istio Mesh Tracing (always active)

The Istio sidecar proxy (Envoy) forwards HTTP/gRPC traces to the OTel Collector automatically for every namespace with istio-injection: enabled.

App Pod → Envoy Sidecar → OTel Collector → Grafana Tempo (auto-injected) (DaemonSet)

What you get for free:

Inbound/outbound HTTP spans with latency, status code, method, URL
Service-to-service call graph (Service Map in Grafana)
100% sampling rate (configurable in Telemetry resource)

Layer 2 — OTel Operator (deep instrumentation, opt-in)

For code-level spans (database queries, function calls, custom business logic), annotate your pod:

annotations:
  instrumentation.opentelemetry.io/inject-java: "true"
  instrumentation.opentelemetry.io/inject-nodejs: "true"
  instrumentation.opentelemetry.io/inject-python: "true"
  instrumentation.opentelemetry.io/inject-go: "true"

Sending Traces Manually (SDK)

# From inside the cluster
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.monitoring.svc:4317   # gRPC
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.monitoring.svc:4318   # HTTP

# From any node (via DaemonSet hostPort)
OTEL_EXPORTER_OTLP_ENDPOINT=http://<node-ip>:4317

Viewing Traces in Grafana

Open https://grafana.local
Go to Explore → select Tempo as datasource
Search by TraceID, service name, or use the Service Graph

Instrumenting a New Application

For full trace → log correlation (“Log for this span” button in Tempo), your application must satisfy three requirements:

1. Pod label `app: <name>`

Alloy automatically extracts the app label from pod labels and adds it to every log stream in Loki. Without it, Tempo cannot find the matching logs.

spec:
  template:
    metadata:
      labels:
        app: my-app   # required

2. Log format with `traceID=<hex>`

Include the active trace ID in every log line using logfmt key traceID. Loki’s derived fields use the regex traceID=(\w+) to create a clickable link to Tempo.

// Go
traceID := span.SpanContext().TraceID().String()
log.Printf("traceID=%s level=info msg=\"handled request\"", traceID)

# Python
import logging
trace_id = trace.get_current_span().get_span_context().trace_id
logging.info(f"traceID={trace_id:032x} level=info msg=handled_request")

// Java (SLF4J + OTel SDK)
import io.opentelemetry.api.trace.Span;

String traceId = Span.current().getSpanContext().getTraceId();
logger.info("traceID={} level=info msg=handled_request", traceId);

3. Set `service.name` matching the `app` label

When initialising the OTel tracer, set the resource attribute service.name to the same value as your pod’s app label. Tempo uses this to query Loki with {app="my-app"}.

resource.NewWithAttributes(
    semconv.SchemaURL,
    semconv.ServiceName("my-app"),   // must match pod label app=my-app
)

Full checklist

Requirement	Why
Pod label `app: my-app`	Alloy adds it as Loki stream label
Log line contains `traceID=<hex>`	Loki derived field links log → trace
OTel resource `service.name=my-app`	Tempo queries `{app="my-app"}` in Loki
Namespace has `istio-injection: enabled`	Istio auto-injects sidecar for mesh tracing
OTLP endpoint `otel-collector.monitoring.svc:4317`	Collector forwards to Tempo

See examples/otel-demo-app/ for a complete working reference.

Verifying Installation

kubectl get pods -n monitoring -l app=tempo
kubectl get pods -n monitoring -l app=otel-collector

# Check Tempo is receiving traces
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl http://localhost:3200/ready

Observability

The Observability Pyramid

Coverage

Cross-Signal Correlation (Grafana)

Monitoring (Prometheus + Grafana)

Components

Accessing Grafana

Accessing Prometheus

Accessing Alertmanager

Recommended Dashboards

Kubernetes

Node

Java/JVM

Go

cert-manager

Logging (Loki + Grafana Alloy)

Components

Storage by Environment

Accessing Logs

LogQL Examples

Recommended Log Dashboards

Tracing (OpenTelemetry + Grafana Tempo)

Components

Automatic Trace Injection (Zero Code Changes)

Layer 1 — Istio Mesh Tracing (always active)

Layer 2 — OTel Operator (deep instrumentation, opt-in)

Sending Traces Manually (SDK)

Viewing Traces in Grafana

Instrumenting a New Application

1. Pod label app: <name>

2. Log format with traceID=<hex>

3. Set service.name matching the app label

Full checklist

Verifying Installation

1. Pod label `app: <name>`

2. Log format with `traceID=<hex>`

3. Set `service.name` matching the `app` label