Skip to content

Observability — OpenTelemetry Across Services

Status: Implemented — all standard ASGI services conformant. server/observability.py deployed to rest, sentinel, parallax, cortex, and oracle. All gate on OTEL_ENABLED; Helm configmap vars in place. oracle updated to add ElasticsearchInstrumentor, Resource.create({SERVICE_NAME: ...}), instance-method instrumentor calls, and set_logging_packages=True. rest updated to use Resource.create({SERVICE_NAME: ...}). Depends on: platform.service-contract (service contract), platform.axonis-core (axonis-core) Milestone: P2 (cross-cutting; rolled out per service)

Purpose

Every Axonis service must emit OpenTelemetry traces for incoming requests, outgoing HTTP, Elasticsearch queries, and logs. This spec standardizes the bootstrap pattern, dependency set, environment configuration, and rollout plan.

Current adoption:

Service State
rest Conformant — Flask + Elasticsearch instrumentation (reference implementation)
sentinel Conformant — Elasticsearch + Redis instrumentation
parallax Conformant — Elasticsearch instrumentation
cortex Conformant — Elasticsearch + Redis + HTTPX instrumentation; Helm uses cortex.config.otel.* keys (maps to same OTEL_* env vars)
oracle Conformant — Elasticsearch + Redis + HTTPX instrumentation
titan, xanadu, beacon Out of scope — different service shapes (see §6)

This spec defines the baseline all standard ASGI services must follow.

Required Behavior

A spec-conformant service:

  1. Provides a server/observability.py module that exports instrument(asgi_app, fastapi_app=None).
  2. Calls instrument(...) from server/__main__.create_app() after the Starlette app is constructed and before the OAuth middleware wraps it.
  3. Reads OTEL_ENABLED (default false) — the bootstrap is a no-op when unset, so dev runs are unaffected.
  4. When enabled, configures a TracerProvider with service.name resource attribute, attaches a BatchSpanProcessor with OTLPSpanExporter (HTTP), and instruments at minimum: Starlette, FastAPI, logging, requests, aiohttp-client.
  5. Adds tech-specific instrumentors when the service uses the corresponding library (Elasticsearch, Redis, Flask, etc).
  6. Exposes the same configuration via pyproject.toml dependencies — no runtime imports without declared dependencies.

Canonical instrument() Reference

This is the atlas implementation (atlas/server/observability.py). All services should mirror this shape; the only customisation is the instrumentor list at the bottom.

"""OpenTelemetry bootstrap. Off unless OTEL_ENABLED=true.

Callers pass the ASGI app; we return it untouched if OTel is disabled,
so the import of instrumentation packages stays cheap when unused.
"""
from server.config import config


def instrument(asgi_app, fastapi_app=None):
    if not config.otel_enabled:
        return asgi_app

    from opentelemetry import trace
    from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
    from opentelemetry.instrumentation.aiohttp_client import AioHttpClientInstrumentor
    from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
    from opentelemetry.instrumentation.logging import LoggingInstrumentor
    from opentelemetry.instrumentation.requests import RequestsInstrumentor
    from opentelemetry.instrumentation.starlette import StarletteInstrumentor
    from opentelemetry.sdk.resources import SERVICE_NAME, Resource
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor

    resource = Resource.create({SERVICE_NAME: config.service_name})
    trace.set_tracer_provider(TracerProvider(resource=resource))
    trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))

    StarletteInstrumentor().instrument_app(asgi_app)
    if fastapi_app is not None:
        FastAPIInstrumentor().instrument_app(fastapi_app)
    LoggingInstrumentor().instrument(set_logging_packages=True)
    RequestsInstrumentor().instrument()
    AioHttpClientInstrumentor().instrument()

    return asgi_app

Rationale for this shape:

  • Lazy imports — when OTEL is off, none of the OTel packages are imported. Saves import time and lets tests ignore them.
  • Returns the same app — keeps the call site readable: asgi_app = instrument(asgi_app, fastapi_app).
  • Single function — easier to grep for OTEL setup and to test.
  • No singleton state — repeat calls are idempotent because OTel's set_tracer_provider overwrites cleanly in dev reload.

Required Dependencies

Every service's pyproject.toml must include:

"opentelemetry-api>=1.24.0,<2",
"opentelemetry-sdk>=1.24.0,<2",
"opentelemetry-exporter-otlp>=1.24.0,<2",
"opentelemetry-instrumentation-fastapi>=0.45b0,<1",
"opentelemetry-instrumentation-starlette>=0.45b0,<1",
"opentelemetry-instrumentation-logging>=0.45b0,<1",
"opentelemetry-instrumentation-requests>=0.45b0,<1",
"opentelemetry-instrumentation-aiohttp-client>=0.45b0,<1",

Optional Per-Service Additions

Library used Add this dependency Add this instrumentation call
Elasticsearch opentelemetry-instrumentation-elasticsearch>=0.45b0,<1 ElasticsearchInstrumentor().instrument()
Redis opentelemetry-instrumentation-redis>=0.45b0,<1 RedisInstrumentor().instrument()
Flask (legacy/Connexion) opentelemetry-instrumentation-flask>=0.45b0,<1 FlaskInstrumentor().instrument_app(flask_app)
SQLAlchemy opentelemetry-instrumentation-sqlalchemy>=0.45b0,<1 SQLAlchemyInstrumentor().instrument()
HTTPX opentelemetry-instrumentation-httpx>=0.45b0,<1 HTTPXClientInstrumentor().instrument()

Service-specific add-ons:

  • rest — Elasticsearch + Flask (Connexion legacy)
  • oracle, cortex — Redis (sessions, cache) + Elasticsearch
  • parallax, sentinel — Elasticsearch
  • xanadu — currently no HTTP-style ASGI instrumentation needed; trace propagation across RabbitMQ is open work (see §7)

Configuration

Standard OTEL env vars are honored as-is (no Axonis-specific names). All services must read the same set:

Var Default Purpose
OTEL_ENABLED false Master switch. When false, instrument() is a no-op.
OTEL_SERVICE_NAME repo name Sets service.name resource attribute. Falls back to config.service_name when unset.
OTEL_EXPORTER_OTLP_ENDPOINT unset OTLP HTTP collector URL (e.g. http://otel-collector:4318). Required when enabled.
OTEL_EXPORTER_OTLP_HEADERS unset Extra headers (e.g. auth). Standard OTel format: key1=value1,key2=value2.
OTEL_TRACES_SAMPLER parentbased_always_on Sampler. Use parentbased_traceidratio with OTEL_TRACES_SAMPLER_ARG=0.1 for 10% sampling in production.
OTEL_RESOURCE_ATTRIBUTES unset Extra resource attrs (e.g. deployment.environment=development.axonis.ai).

Services must NOT introduce custom env vars for OTEL behavior. Use the standard names so a single platform-wide configuration block in helm values applies to every chart.

Service Configuration File

Add this to each service's server/config.py:

@dataclass
class Config:
    ...
    otel_enabled: bool = field(default_factory=lambda: getenv_bool("OTEL_ENABLED", default=False))
    service_name: str = field(default_factory=lambda: os.getenv("OTEL_SERVICE_NAME", "<repo-name>"))

Use axonis.env.getenv_bool (in axonis-core; split out of the legacy misc.py per the 2026-05-07 namespace restructure).

Helm Chart Conventions

Each service's charts/<service>/values.yaml must expose an observability block, defaulting to disabled:

observability:
  enabled: false
  endpoint: ""
  sampler: parentbased_traceidratio
  samplerArg: "0.1"
  resourceAttributes:
    deployment.environment: development

The deployment template translates this to OTEL_* env vars on the pod. Do not invent chart-local env var names.

Rollout Plan

All standard ASGI services are now conformant. Completed in order:

Service Status
atlas Done — reference implementation
sentinel Done
parallax Done
cortex Done
oracle Done — added ElasticsearchInstrumentor, fixed Resource.create, instance-method calls, set_logging_packages=True
rest Done — fixed Resource.create({SERVICE_NAME: ...})

For new services forked from atlas, conformance is automatic. The checklist for new services: 1. Add OTel dependencies to pyproject.toml (copy from atlas) 2. Copy server/observability.py from atlas; add service-specific instrumentors 3. Call instrument(asgi_app, fastapi_app) in server/__main__.create_app 4. Add otel_enabled and service_name to server/config.py 5. Add observability: block to charts/<service>/values.yaml

Service Shapes That Differ

These services do not match the standard ASGI pattern:

Service Notes
titan No server/. Once added (per component.titan.runtime open item), it conforms. Until then, OTEL via library calls only.
xanadu RabbitMQ-based, no HTTP request entry point. Trace propagation across RMQ requires aio-pika instrumentation (opentelemetry-instrumentation-aio-pika); experimental as of this spec. Tracked separately.
beacon Angular SPA + FastAPI proxy. Backend follows this spec; frontend traces are out of scope.
conduit Newly added; follow this spec from the start.

Open Items

  • xanadu trace context propagation over RabbitMQ. The aio-pika instrumentor is in OTel contrib but not stable; requires evaluation before adoption.
  • OTEL metrics (counters, histograms) are out of scope for this revision. This spec covers traces and logs only. A follow-up will add MeterProvider setup once a metrics backend is chosen.
  • Centralized log enrichmentLoggingInstrumentor injects trace_id / span_id into log records, but the platform's structured-log fields are not yet aligned with that envelope. Coordinate with axonis-core's logger.py owners to settle on a single shape.

Depends on: platform.axonis-core, platform.service-contract

Required by: component.oracle.apollo, component.postern.proxy, platform.devops-cicd