Observability — OpenTelemetry Across Services
Status: Implemented — all standard ASGI services conformant. server/observability.py deployed to rest, sentinel, parallax, cortex, and oracle. All gate on OTEL_ENABLED; Helm configmap vars in place. oracle updated to add ElasticsearchInstrumentor, Resource.create({SERVICE_NAME: ...}), instance-method instrumentor calls, and set_logging_packages=True. rest updated to use Resource.create({SERVICE_NAME: ...}).
Depends on: platform.service-contract (service contract), platform.axonis-core (axonis-core)
Milestone: P2 (cross-cutting; rolled out per service)
Purpose
Every Axonis service must emit OpenTelemetry traces for incoming requests, outgoing HTTP, Elasticsearch queries, and logs. This spec standardizes the bootstrap pattern, dependency set, environment configuration, and rollout plan.
Current adoption:
| Service | State |
|---|---|
| rest | Conformant — Flask + Elasticsearch instrumentation (reference implementation) |
| sentinel | Conformant — Elasticsearch + Redis instrumentation |
| parallax | Conformant — Elasticsearch instrumentation |
| cortex | Conformant — Elasticsearch + Redis + HTTPX instrumentation; Helm uses cortex.config.otel.* keys (maps to same OTEL_* env vars) |
| oracle | Conformant — Elasticsearch + Redis + HTTPX instrumentation |
| titan, xanadu, beacon | Out of scope — different service shapes (see §6) |
This spec defines the baseline all standard ASGI services must follow.
Required Behavior
A spec-conformant service:
- Provides a
server/observability.pymodule that exportsinstrument(asgi_app, fastapi_app=None). - Calls
instrument(...)fromserver/__main__.create_app()after the Starlette app is constructed and before the OAuth middleware wraps it. - Reads
OTEL_ENABLED(defaultfalse) — the bootstrap is a no-op when unset, so dev runs are unaffected. - When enabled, configures a
TracerProviderwithservice.nameresource attribute, attaches aBatchSpanProcessorwithOTLPSpanExporter(HTTP), and instruments at minimum: Starlette, FastAPI, logging, requests, aiohttp-client. - Adds tech-specific instrumentors when the service uses the corresponding library (Elasticsearch, Redis, Flask, etc).
- Exposes the same configuration via
pyproject.tomldependencies — no runtime imports without declared dependencies.
Canonical instrument() Reference
This is the atlas implementation (atlas/server/observability.py). All services should mirror this shape; the only customisation is the instrumentor list at the bottom.
"""OpenTelemetry bootstrap. Off unless OTEL_ENABLED=true.
Callers pass the ASGI app; we return it untouched if OTel is disabled,
so the import of instrumentation packages stays cheap when unused.
"""
from server.config import config
def instrument(asgi_app, fastapi_app=None):
if not config.otel_enabled:
return asgi_app
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.aiohttp_client import AioHttpClientInstrumentor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.logging import LoggingInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.starlette import StarletteInstrumentor
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
resource = Resource.create({SERVICE_NAME: config.service_name})
trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
StarletteInstrumentor().instrument_app(asgi_app)
if fastapi_app is not None:
FastAPIInstrumentor().instrument_app(fastapi_app)
LoggingInstrumentor().instrument(set_logging_packages=True)
RequestsInstrumentor().instrument()
AioHttpClientInstrumentor().instrument()
return asgi_app
Rationale for this shape:
- Lazy imports — when OTEL is off, none of the OTel packages are imported. Saves import time and lets tests ignore them.
- Returns the same app — keeps the call site readable:
asgi_app = instrument(asgi_app, fastapi_app). - Single function — easier to grep for OTEL setup and to test.
- No singleton state — repeat calls are idempotent because OTel's
set_tracer_provideroverwrites cleanly in dev reload.
Required Dependencies
Every service's pyproject.toml must include:
"opentelemetry-api>=1.24.0,<2",
"opentelemetry-sdk>=1.24.0,<2",
"opentelemetry-exporter-otlp>=1.24.0,<2",
"opentelemetry-instrumentation-fastapi>=0.45b0,<1",
"opentelemetry-instrumentation-starlette>=0.45b0,<1",
"opentelemetry-instrumentation-logging>=0.45b0,<1",
"opentelemetry-instrumentation-requests>=0.45b0,<1",
"opentelemetry-instrumentation-aiohttp-client>=0.45b0,<1",
Optional Per-Service Additions
| Library used | Add this dependency | Add this instrumentation call |
|---|---|---|
| Elasticsearch | opentelemetry-instrumentation-elasticsearch>=0.45b0,<1 |
ElasticsearchInstrumentor().instrument() |
| Redis | opentelemetry-instrumentation-redis>=0.45b0,<1 |
RedisInstrumentor().instrument() |
| Flask (legacy/Connexion) | opentelemetry-instrumentation-flask>=0.45b0,<1 |
FlaskInstrumentor().instrument_app(flask_app) |
| SQLAlchemy | opentelemetry-instrumentation-sqlalchemy>=0.45b0,<1 |
SQLAlchemyInstrumentor().instrument() |
| HTTPX | opentelemetry-instrumentation-httpx>=0.45b0,<1 |
HTTPXClientInstrumentor().instrument() |
Service-specific add-ons:
- rest — Elasticsearch + Flask (Connexion legacy)
- oracle, cortex — Redis (sessions, cache) + Elasticsearch
- parallax, sentinel — Elasticsearch
- xanadu — currently no HTTP-style ASGI instrumentation needed; trace propagation across RabbitMQ is open work (see §7)
Configuration
Standard OTEL env vars are honored as-is (no Axonis-specific names). All services must read the same set:
| Var | Default | Purpose |
|---|---|---|
OTEL_ENABLED |
false |
Master switch. When false, instrument() is a no-op. |
OTEL_SERVICE_NAME |
repo name | Sets service.name resource attribute. Falls back to config.service_name when unset. |
OTEL_EXPORTER_OTLP_ENDPOINT |
unset | OTLP HTTP collector URL (e.g. http://otel-collector:4318). Required when enabled. |
OTEL_EXPORTER_OTLP_HEADERS |
unset | Extra headers (e.g. auth). Standard OTel format: key1=value1,key2=value2. |
OTEL_TRACES_SAMPLER |
parentbased_always_on |
Sampler. Use parentbased_traceidratio with OTEL_TRACES_SAMPLER_ARG=0.1 for 10% sampling in production. |
OTEL_RESOURCE_ATTRIBUTES |
unset | Extra resource attrs (e.g. deployment.environment=development.axonis.ai). |
Services must NOT introduce custom env vars for OTEL behavior. Use the standard names so a single platform-wide configuration block in helm values applies to every chart.
Service Configuration File
Add this to each service's server/config.py:
@dataclass
class Config:
...
otel_enabled: bool = field(default_factory=lambda: getenv_bool("OTEL_ENABLED", default=False))
service_name: str = field(default_factory=lambda: os.getenv("OTEL_SERVICE_NAME", "<repo-name>"))
Use axonis.env.getenv_bool (in axonis-core; split out of the legacy misc.py per the 2026-05-07 namespace restructure).
Helm Chart Conventions
Each service's charts/<service>/values.yaml must expose an observability block, defaulting to disabled:
observability:
enabled: false
endpoint: ""
sampler: parentbased_traceidratio
samplerArg: "0.1"
resourceAttributes:
deployment.environment: development
The deployment template translates this to OTEL_* env vars on the pod. Do not invent chart-local env var names.
Rollout Plan
All standard ASGI services are now conformant. Completed in order:
| Service | Status |
|---|---|
| atlas | Done — reference implementation |
| sentinel | Done |
| parallax | Done |
| cortex | Done |
| oracle | Done — added ElasticsearchInstrumentor, fixed Resource.create, instance-method calls, set_logging_packages=True |
| rest | Done — fixed Resource.create({SERVICE_NAME: ...}) |
For new services forked from atlas, conformance is automatic. The checklist for new services:
1. Add OTel dependencies to pyproject.toml (copy from atlas)
2. Copy server/observability.py from atlas; add service-specific instrumentors
3. Call instrument(asgi_app, fastapi_app) in server/__main__.create_app
4. Add otel_enabled and service_name to server/config.py
5. Add observability: block to charts/<service>/values.yaml
Service Shapes That Differ
These services do not match the standard ASGI pattern:
| Service | Notes |
|---|---|
| titan | No server/. Once added (per component.titan.runtime open item), it conforms. Until then, OTEL via library calls only. |
| xanadu | RabbitMQ-based, no HTTP request entry point. Trace propagation across RMQ requires aio-pika instrumentation (opentelemetry-instrumentation-aio-pika); experimental as of this spec. Tracked separately. |
| beacon | Angular SPA + FastAPI proxy. Backend follows this spec; frontend traces are out of scope. |
| conduit | Newly added; follow this spec from the start. |
Open Items
- xanadu trace context propagation over RabbitMQ. The
aio-pikainstrumentor is in OTel contrib but not stable; requires evaluation before adoption. - OTEL metrics (counters, histograms) are out of scope for this revision. This spec covers traces and logs only. A follow-up will add
MeterProvidersetup once a metrics backend is chosen. - Centralized log enrichment —
LoggingInstrumentorinjectstrace_id/span_idinto log records, but the platform's structured-log fields are not yet aligned with that envelope. Coordinate with axonis-core'slogger.pyowners to settle on a single shape.
Depends on: platform.axonis-core, platform.service-contract
Required by: component.oracle.apollo, component.postern.proxy, platform.devops-cicd