Skip to content

DevOps — CI/CD, Helm, and Docker Conventions

Status: Proposed (2026-06-03). Documents the current ci-components catalog (v1.22.5) as the golden baseline and codifies the conventions every component service must conform to. Conformance gaps to be closed via /sdd-spec-gap against each repo. Depends on: platform.axonis-core, platform.service-contract (service contract — config file location, chart structure), platform.service-configuration (service configuration — AxonisSettings, env-var naming) Relates to: platform.ingress-routing (ingress routing — deploy domains), platform.observability (observability) Milestone: P2 (all component services must conform; net-new services conform from day one)

Purpose

Every Axonis component service ships the same way: lint + test, scan, build a Python wheel, build a Docker image, package a Helm chart, version via conventional commits, and deploy to Kubernetes via ArgoCD. This spec defines the single standard for that pipeline so the nine component services (axonis-core, fedai-rest, oracle, cortex, parallax, sentinel, xanadu, geodex, titan) are built, packaged, and deployed identically.

It has three parts:

  1. CI/CD — the shared GitLab CI components in ci-components are the only sanctioned pipeline building blocks. A service's .gitlab-ci.yml is an inclusion list, not a place to hand-author jobs.
  2. Helm — chart structure and values conventions: reuse the shared base first, and never prefix a values/env key with the package name (the SERVICE_PORT exception aside).
  3. Docker — one multi-stage build pattern off the shared atlasfml-base image, uv with frozen locks, non-root runtime.

This spec applies to both net-new services (follow it from the start) and existing services (migrate to it). Where a service legitimately diverges (titan's GPU lineage, xanadu's wheel build), the divergence must be named here, not invented per-repo.

Scope — the component services

Service GitLab path Deployable image Helm chart
axonis-core product-development/axonis-core no (library — wheel only) no
fedai-rest fedai-rest yes charts/fedai-rest
oracle product-development/oracle yes charts/oracle
cortex axonis-intelligence/cortex yes charts/cortex
parallax product-development/parallax yes charts/parallax
sentinel product-development/sentinel yes charts/sentinel
xanadu product-development/xanadu yes chart/xanaqu
geodex product-development/geodex yes charts/geodex
titan product-development/atlas-fl-titan yes (GPU) n/a

axonis-core is the upstream: it publishes a wheel, and its release fans out to every downstream service (see #cicd.fanout).


Part 1 — CI/CD via ci-components

The rule: include components, do not hand-author jobs

ci-components (gitlab.axonis.ai/federated-ml-platform/product-development/ci-components, currently v1.22.5) is the catalog. A component service's .gitlab-ci.yml MUST be assembled from these components and MUST NOT redefine their jobs inline. Pin to a released version tag in production lines; @~latest is permitted only on a service's own development branches.

The catalog (golden baseline as of v1.22.5)

Component Stage Purpose Key inputs
workflow.yml (project include) Shared workflow: rules (.axonis_workflow_rules): skip draft MRs, skip semantic-release bot commits, allow web runs. Must be a project include:, not a component:workflow: is not honored from components.
qa qa Ruff lint + format check, pytest with coverage (cobertura + JUnit). Defines qa-code-analysis, the job every packaging stage needs:. source_dirs, python_image, cache_key, sync_extras
security security Container scanning, gemnasium dependency scanning, semgrep SAST, secret detection, GitLab advanced SAST. enable_container_scanning
package-python package Builds + publishes wheels. python-dev (${BASE}.dev${PIPELINE_ID}, MR channel) and python-release (from tag). python_image, cache_key
package-docker package Buildx images: docker-branch (MR), docker-staging (main), docker-release (tag → internal + Docker Hub). base_image_tag
package-helm package helm-devel (MR, on charts/ change) and helm-stable (main/tag, version-change-gated). chart_name
semantic-release release python-semantic-release on conventional commits: bumps project.version, writes CHANGELOG.md, tags v{version}. python_image
deploy-branch deploy Per-MR ArgoCD environment at ${CI_COMMIT_REF_SLUG}.axonis.ai, with paired stop- action. image_registry, helm_image_prefix
deploy-staging deploy Main-branch rollout to development.axonis.ai (ArgoCD patch + rollout restart). Requires deploy-branch included first (shares .deploy-image).
axonis-core-fanout downstream On an axonis-core tag, triggers downstream rebuilds (oracle, cortex, parallax, sentinel, fedai-rest, conduit, prism, forge, geodex, beacon). needs_job, target_branch

Mandatory pipeline shape

Every deployable component service MUST include, in this dependency order:

stages: [qa, security, package, release, deploy]

include:
  - project: 'federated-ml-platform/product-development/ci-components'
    ref: <pinned-tag>
    file: '/templates/workflow.yml'
  - component: gitlab.axonis.ai/federated-ml-platform/product-development/ci-components/qa@<tag>
    inputs: { source_dirs: "<pkg>/ server/" }
  - component: .../ci-components/security@<tag>
  - component: .../ci-components/package-python@<tag>        # if it ships a wheel
  - component: .../ci-components/package-docker@<tag>
  - component: .../ci-components/package-helm@<tag>
    inputs: { chart_name: "<service>" }
  - component: .../ci-components/semantic-release@<tag>
  - component: .../ci-components/deploy-branch@<tag>
  - component: .../ci-components/deploy-staging@<tag>
  • qa-code-analysis is the universal gate; package-* and semantic-release needs: it.
  • axonis-core includes only workflow + qa + security + package-python + semantic-release + axonis-core-fanout (no docker/helm/deploy — it is a library).
  • developers-environment (spec-docs site) includes only workflow + qa + security + package-docker + package-helm (no package-python/semantic-release/deploy-* — a docs image, not a versioned service).

Versioning contract

  • Version of record is project.version in pyproject.toml.
  • Releases are driven by conventional commits on main (feat → minor, fix/perf → patch).
  • semantic-release commits chore(release): {version} and tags v{version}; the bot commit is skipped by workflow.yml to avoid loops.
  • A chore(release)/bot commit must never itself trigger a release pipeline.

axonis-core fanout

When axonis-core publishes a release tag, axonis-core-fanout triggers each downstream service's target_branch pipeline. Downstream qa, package-python, and package-docker jobs detect the upstream trigger (CI_PIPELINE_SOURCE == pipeline) and rebuild with uv sync --upgrade-package axonis-core (and a Docker PIPELINE_ID cache-bust), so every service re-pins to the new axonis-core without a manual bump. New downstream services MUST be added to the fanout trigger list.

Registries & auth (informational)

  • Internal PyPI: https://gitlab.axonis.ai/api/v4/projects/405/packages/pypi/simple (axonis-core), auth gitlab-ci-token + CI_JOB_TOKEN.
  • Docker: internal ${CI_REGISTRY} for branch/staging; release double-pushes internal + ${DOCKERIO_REGISTRY} (Docker Hub).
  • Helm: GitLab Helm registry, devel and stable channels.
  • Deploy: ArgoCD via kube context federated-ml-platform/cluster-management:shared-agent; secrets decrypted with SOPS age key.

Part 2 — Helm chart conventions

Chart layout

  • A deployable service's chart lives at charts/<service>/ (canonical). xanadu's chart/xanaqu/ is a known deviation pending rename.
  • Every chart depends on bitnami common (currently 2.31.4) and consumes its globals: global.imageRegistry, global.imagePullSecrets, global.defaultStorageClass.
  • Service-specific values are nested under a single top-level block named for the service (oracle:, cortex:, parallax:), never sprayed at the chart root.

Rule A — reuse the shared base first

A chart MUST consume shared definitions before declaring its own:

  1. Image registry resolves through the global: .Values.<svc>.image.registry | default .Values.global.imageRegistry.
  2. Config field names map to the AxonisSettings env-var names from platform.service-configuration. Shared server fields are emitted unprefixed: HOST, LOG_LEVEL, WORKERS, DEBUG.
  3. Common templates (labels, pod/container security context 1001:1001, /health probes, worker-derived resource requests) come from the shared library helpers, not re-invented per chart.

Gap (P2): there is today no shared Axonis Helm library chart — each service re-declares podSecurityContext, probes, and resource helpers. The golden state is a shared library chart (proposed home: a charts/axonis-common library, or argo-baseline) that the nine charts import. Until it exists, charts must at minimum reuse bitnami common globals and the platform.service-configuration env names.

Rule B — no <PACKAGE_NAME>_ prefix on keys

Env-var keys rendered into the ConfigMap/Secret MUST NOT be prefixed with the chart's own service name. Concretely:

Pattern Verdict Example
Shared server field unprefixed HOST, LOG_LEVEL, WORKERS, DEBUG
This service's port only sanctioned self-prefix ORACLE_PORT, CORTEX_PORT, GEODEX_PORT
External dependency prefixed by the dependency, not this service ELASTIC_HOST, REDIS_PORT, SSO_CLIENT_SECRET
Internal subsystem prefixed by the subsystem, not the service APOLLO_LLM_PROVIDER, CORTEX_LLM_MODEL
Self-prefixed shared field forbidden ~~PARALLAX_HOST~~, ~~SENTINEL_LOG_LEVEL~~

Rationale: AxonisSettings reads HOST/LOG_LEVEL/WORKERS/DEBUG unprefixed; a <SERVICE>_-prefixed key would simply never be read. SERVICE_PORT is the single exception because ports are per-service and the central port registry (platform.service-configuration) keys on <SERVICE>_PORT.

The current charts already conform to Rule B (audited 2026-06-03 — no violations). This spec freezes that as a hard rule and a package-helm lint target.

Secrets

Charts use the existingSecret + inline secrets: fallback pattern (per [[feedback_keep_secret_defaults_in_values]] — keep inline defaults; ArgoCD overrides at deploy time). Do not strip inline secret values from values.yaml.

Known chart inconsistencies to reconcile (P2)

  • Port placement differs (service.ports.http vs gateway.containerPorts.http vs server.containerPorts.https) — standardize on service.ports.http.
  • global.imageRegistry default differs ("" for most vs registry.axonis.ai for xanadu).
  • Probe verbosity and LLM-config nesting differ across charts — standardize on the explicit probe form and a single llm: block shape.
  • xanadu chart dir chart/xanaqu/charts/xanadu/.

Part 3 — Docker image build standard

The standard pattern (golden)

oracle, parallax, sentinel, geodex are 100% conformant and define the reference. Every deployable service image MUST:

  1. Base off registry.axonis.ai/axonisai/atlasfml-base:py310-v4.0.0 (the current pinned base), parameterized via ARG BASE / IMAGE / TAG.
  2. Be multi-stage — a builder (alias base) stage that installs deps + strips build toolchain, and a clean final stage that copies /app from the builder.
  3. Install dependencies with uv against a frozen lock: uv sync --frozen --no-dev --extra service, with an axonis-core fanout fallback (uv sync --upgrade-package axonis-core ...) gated on a PIPELINE_ID cache-bust arg. Private axonis-core is fetched via a transient ~/.netrc carrying PYPI_TOKEN, removed in the same layer.
  4. Strip the build toolchain in the builder stage (dnf remove gcc/cmake/GDAL-dev, clean caches).
  5. Run as non-root UID 1001; build steps that need root switch USER 0 then back to 1001.
  6. WORKDIR /app; ENV PATH=/app/.venv/bin:$PATH, PYTHONPATH=/app, PYTHONUNBUFFERED=1.
  7. Ship entrypoint.sh (TLS CA-bundle trust setup) as the ENTRYPOINT; CMD launches python -m server.
  8. EXPOSE <SERVICE_PORT> and define a HEALTHCHECK against /health (enabled, not commented).
  9. Ship a .dockerignore (exclude .venv, .git, tests/, *.md, charts/, examples).

Image tags are set by package-docker: branch slug (MR), branch slug (staging), CI_COMMIT_TAG (release). The Dockerfile does not embed a version.

Sanctioned deviations (must stay documented here)

Service Deviation Reason
titan base atlas-fl-athena:staging; skips toolchain cleanup; /fedai workdir; ports 5000/9000; CPU-vs-GPU torch split GPU/ML serving lineage. titan is the only service that pulls GPU torch ([[project_training_lives_in_titan]]); others use the pytorch-cpu index.
xanadu base federated-ml-platform/atlasfml-base:py310-v3.0.9; uv pip install wheel build; /fedai workdir; no healthcheck Federation infra, wheel-distribution model. Should converge to v4.0.0 base + /app + healthcheck.
parallax examples/ python:3.10-slim, pip, single-stage Local dev/test only; .dockerignored out of the production image. Never a production target.
spec-docs (developers-environment) base python:3.12-slim both stages; no atlasfml-base Docs site + corpus chat service; final stage is the spec-docs Python service (uvicorn) serving statics + chat, /health probe per chart.

Conformance gaps to close (P2)

  • fedai-rest, cortex: enable their (currently commented-out) HEALTHCHECK.
  • sentinel, xanadu, titan: add a .dockerignore.
  • xanadu: migrate to atlasfml-base:py310-v4.0.0, /app workdir, uv sync --frozen, healthcheck.
  • Base-image version drift: pin all non-GPU services to the same atlasfml-base tag.

Conformance summary (2026-06-03 baseline)

Concern Conformant today Gap
Pipeline via ci-components all 9 keep fanout list current as services are added
Helm Rule B (no self-prefix) all charts freeze as package-helm lint
Helm Rule A (shared-first) partial no shared library chart yet
Docker standard oracle, parallax, sentinel, geodex (100%) fedai-rest/cortex healthcheck; xanadu base+layout; titan documented exception

Acceptance criteria

  1. Each deployable component service's .gitlab-ci.yml includes the mandatory component set (#cicd.pipeline-shape) at a pinned tag, with no hand-authored duplicates of catalog jobs.
  2. Each chart conforms to Rule A and Rule B; package-helm fails on a <SERVICE>_-prefixed shared key.
  3. Each deployable image follows the #docker standard or is one of the three named, documented deviations.
  4. A shared Axonis Helm library chart exists and is consumed by the nine charts (closes the Rule A gap).
  5. New component services satisfy 1–3 on their first MR.

Depends on: platform.axonis-core, platform.ingress-routing, platform.observability, platform.service-configuration, platform.service-contract