Observability
This project implements the three pillars of observability, following the pyramid model where each layer builds on the previous.
The Observability Pyramid
| Layer | Signal | Tool | Question it answers |
|---|---|---|---|
| Top | Traces | Grafana Tempo + OpenTelemetry | Why is it slow? Where did it fail? |
| Middle | Logs | Loki + Grafana Alloy | What happened and when? |
| Base | Metrics | Prometheus + Grafana | Is something wrong? |
The pyramid reflects signal volume and cost: metrics are cheap and always-on; logs are richer but larger; traces are the most detailed and selectively sampled.
Coverage
| Pillar | Tool(s) | What it answers | Status |
|---|---|---|---|
| Metrics | Prometheus + Grafana + Node Exporter + kube-state-metrics | Is anything wrong? What are the SLIs? | ✅ Enabled by default |
| Logs | Loki 3.x + Grafana Alloy | What happened and when? Which pod crashed? | ✅ Enabled with monitoring |
| Traces | Grafana Tempo + OpenTelemetry Collector | Why is a request slow? Which service failed? | ✅ Enabled with tracing: otel-tempo |
| Mesh Visibility | Kiali | How are services connected? What is the error rate between them? | ✅ Enabled with monitoring |
Cross-Signal Correlation (Grafana)
| From | To | How |
|---|---|---|
Log line with traceID= | Trace in Tempo | Click derived field link |
| Trace span | Logs for that service | “Logs” button in Tempo UI |
| Trace span | RED metrics in Prometheus | service.name tag link |
| Service Map (Tempo) | Prometheus rate/error/duration | Click node in graph |
Monitoring (Prometheus + Grafana)
Components
| Component | Description |
|---|---|
| Prometheus Operator | Manages Prometheus instances |
| Prometheus | Metrics collection and storage |
| Grafana | Visualization and dashboards |
| Node Exporter | Host metrics (CPU, memory, disk) |
| kube-state-metrics | Kubernetes object metrics |
Accessing Grafana
open https://grafana.local
Credentials:
- Username:
admin - Password: from Vault →
vault kv get -field=grafana_admin_password secret/k8s-provisioner/api-keysIf Vault is disabled:
admin123
Accessing Prometheus
open https://prometheus.local
Accessing Alertmanager
open https://alertmanager.local
Recommended Dashboards
Import from grafana.com: Dashboards → Import → Enter ID → Load
Kubernetes
| ID | Dashboard | Description |
|---|---|---|
15760 | Kubernetes / Views / Global | Cluster overview |
15757 | Kubernetes / Views / Namespaces | Per namespace metrics |
15759 | Kubernetes / Views / Pods | Pod details |
10000 | Kubernetes Cluster | Complete cluster view |
Node
| ID | Dashboard | Description |
|---|---|---|
1860 | Node Exporter Full | Detailed host metrics |
Java/JVM
| ID | Dashboard | Description |
|---|---|---|
4701 | JVM Micrometer | Spring Boot with Micrometer |
8563 | JVM Dashboard | JMX Exporter metrics |
11955 | JVM Metrics | Heap, GC, Threads |
14430 | JVM Overview | Complete JVM view |
Java apps need to expose metrics via Micrometer or JMX Exporter
Go
| ID | Dashboard | Description |
|---|---|---|
10826 | Go Processes | Go runtime metrics |
6671 | Go Metrics | Goroutines, GC, Memory |
14061 | Go Runtime | Detailed runtime |
Go apps need to expose metrics via prometheus/client_golang
cert-manager
| ID | Dashboard | Description |
|---|---|---|
20842 | cert-manager | Certificate expiry, renewal, and issuer status |
Logging (Loki + Grafana Alloy)
Components
| Component | Description |
|---|---|
| Loki 3.x | Log aggregation and storage (TSDB schema v13) |
| Grafana Alloy | Log collector DaemonSet — replaces Promtail, collects pod logs and Kubernetes events |
Storage by Environment
| Environment | Storage | Reason |
|---|---|---|
| Lab/Dev (this project) | NFS dynamic | Simple, zero configuration |
| On-premise production | Ceph / Longhorn | Distributed block storage, HA |
| Cloud production | S3 / GCS / Azure Blob | Native object storage support |
| Hybrid production | MinIO (S3-compatible) | Self-hosted S3 API |
Accessing Logs
- Open https://grafana.local
- Go to Explore → select Loki as datasource
LogQL Examples
# All logs from a namespace
{namespace="default"}
# Search for errors
{namespace="default"} |= "error"
# Filter by pod name pattern
{pod=~"nginx.*"}
# Parse JSON logs
{namespace="default"} | json | level="error"
# Count errors per pod over 5 minutes
sum by (pod) (count_over_time({namespace="default"} |= "error" [5m]))
Recommended Log Dashboards
| ID | Dashboard | Description |
|---|---|---|
13639 | Loki Dashboard | General log overview |
12611 | Loki & Alloy | Logs with Alloy stats |
15141 | Loki Logs | Simple log viewer |
Tracing (OpenTelemetry + Grafana Tempo)
Enabled via components.tracing: "otel-tempo".
Components
| Component | Description | Port |
|---|---|---|
| Grafana Tempo | Trace storage and query backend | 3200 (HTTP), 4317 (OTLP gRPC), 4318 (OTLP HTTP) |
| OpenTelemetry Collector | DaemonSet that receives and forwards traces | 4317/4318 (hostPort) |
Automatic Trace Injection (Zero Code Changes)
Layer 1 — Istio Mesh Tracing (always active)
The Istio sidecar proxy (Envoy) forwards HTTP/gRPC traces to the OTel Collector automatically for every namespace with istio-injection: enabled.
What you get for free:
- Inbound/outbound HTTP spans with latency, status code, method, URL
- Service-to-service call graph (Service Map in Grafana)
- 100% sampling rate (configurable in
Telemetryresource)
Layer 2 — OTel Operator (deep instrumentation, opt-in)
For code-level spans (database queries, function calls, custom business logic), annotate your pod:
annotations:
instrumentation.opentelemetry.io/inject-java: "true"
instrumentation.opentelemetry.io/inject-nodejs: "true"
instrumentation.opentelemetry.io/inject-python: "true"
instrumentation.opentelemetry.io/inject-go: "true"
Sending Traces Manually (SDK)
# From inside the cluster
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.monitoring.svc:4317 # gRPC
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.monitoring.svc:4318 # HTTP
# From any node (via DaemonSet hostPort)
OTEL_EXPORTER_OTLP_ENDPOINT=http://<node-ip>:4317
Viewing Traces in Grafana
- Open https://grafana.local
- Go to Explore → select Tempo as datasource
- Search by TraceID, service name, or use the Service Graph
Instrumenting a New Application
For full trace → log correlation (“Log for this span” button in Tempo), your application must satisfy three requirements:
1. Pod label app: <name>
Alloy automatically extracts the app label from pod labels and adds it to every log stream in Loki. Without it, Tempo cannot find the matching logs.
spec:
template:
metadata:
labels:
app: my-app # required
2. Log format with traceID=<hex>
Include the active trace ID in every log line using logfmt key traceID. Loki’s derived fields use the regex traceID=(\w+) to create a clickable link to Tempo.
// Go
traceID := span.SpanContext().TraceID().String()
log.Printf("traceID=%s level=info msg=\"handled request\"", traceID)
# Python
import logging
trace_id = trace.get_current_span().get_span_context().trace_id
logging.info(f"traceID={trace_id:032x} level=info msg=handled_request")
// Java (SLF4J + OTel SDK)
import io.opentelemetry.api.trace.Span;
String traceId = Span.current().getSpanContext().getTraceId();
logger.info("traceID={} level=info msg=handled_request", traceId);
3. Set service.name matching the app label
When initialising the OTel tracer, set the resource attribute service.name to the same value as your pod’s app label. Tempo uses this to query Loki with {app="my-app"}.
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("my-app"), // must match pod label app=my-app
)
Full checklist
| Requirement | Why |
|---|---|
Pod label app: my-app | Alloy adds it as Loki stream label |
Log line contains traceID=<hex> | Loki derived field links log → trace |
OTel resource service.name=my-app | Tempo queries {app="my-app"} in Loki |
Namespace has istio-injection: enabled | Istio auto-injects sidecar for mesh tracing |
OTLP endpoint otel-collector.monitoring.svc:4317 | Collector forwards to Tempo |
See examples/otel-demo-app/ for a complete working reference.
Verifying Installation
kubectl get pods -n monitoring -l app=tempo
kubectl get pods -n monitoring -l app=otel-collector
# Check Tempo is receiving traces
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl http://localhost:3200/ready