DevOps — CI/CD, Helm, and Docker Conventions
Status: Proposed (2026-06-03). Documents the current ci-components catalog (v1.22.5) as the golden baseline and codifies the conventions every component service must conform to. Conformance gaps to be closed via /sdd-spec-gap against each repo.
Depends on: platform.axonis-core, platform.service-contract (service contract — config file location, chart structure), platform.service-configuration (service configuration — AxonisSettings, env-var naming)
Relates to: platform.ingress-routing (ingress routing — deploy domains), platform.observability (observability)
Milestone: P2 (all component services must conform; net-new services conform from day one)
Purpose
Every Axonis component service ships the same way: lint + test, scan, build a Python wheel, build a Docker image, package a Helm chart, version via conventional commits, and deploy to Kubernetes via ArgoCD. This spec defines the single standard for that pipeline so the nine component services (axonis-core, fedai-rest, oracle, cortex, parallax, sentinel, xanadu, geodex, titan) are built, packaged, and deployed identically.
It has three parts:
- CI/CD — the shared GitLab CI components in
ci-componentsare the only sanctioned pipeline building blocks. A service's.gitlab-ci.ymlis an inclusion list, not a place to hand-author jobs. - Helm — chart structure and values conventions: reuse the shared base first, and never prefix
a values/env key with the package name (the
SERVICE_PORTexception aside). - Docker — one multi-stage build pattern off the shared
atlasfml-baseimage,uvwith frozen locks, non-root runtime.
This spec applies to both net-new services (follow it from the start) and existing services (migrate to it). Where a service legitimately diverges (titan's GPU lineage, xanadu's wheel build), the divergence must be named here, not invented per-repo.
Scope — the component services
| Service | GitLab path | Deployable image | Helm chart |
|---|---|---|---|
| axonis-core | product-development/axonis-core | no (library — wheel only) | no |
| fedai-rest | fedai-rest | yes | charts/fedai-rest |
| oracle | product-development/oracle | yes | charts/oracle |
| cortex | axonis-intelligence/cortex | yes | charts/cortex |
| parallax | product-development/parallax | yes | charts/parallax |
| sentinel | product-development/sentinel | yes | charts/sentinel |
| xanadu | product-development/xanadu | yes | chart/xanaqu |
| geodex | product-development/geodex | yes | charts/geodex |
| titan | product-development/atlas-fl-titan | yes (GPU) | n/a |
axonis-core is the upstream: it publishes a wheel, and its release fans out to every downstream
service (see #cicd.fanout).
Part 1 — CI/CD via ci-components
The rule: include components, do not hand-author jobs
ci-components (gitlab.axonis.ai/federated-ml-platform/product-development/ci-components, currently
v1.22.5) is the catalog. A component service's .gitlab-ci.yml MUST be assembled from these
components and MUST NOT redefine their jobs inline. Pin to a released version tag in
production lines; @~latest is permitted only on a service's own development branches.
The catalog (golden baseline as of v1.22.5)
| Component | Stage | Purpose | Key inputs |
|---|---|---|---|
workflow.yml (project include) |
— | Shared workflow: rules (.axonis_workflow_rules): skip draft MRs, skip semantic-release bot commits, allow web runs. Must be a project include:, not a component: — workflow: is not honored from components. |
— |
qa |
qa |
Ruff lint + format check, pytest with coverage (cobertura + JUnit). Defines qa-code-analysis, the job every packaging stage needs:. |
source_dirs, python_image, cache_key, sync_extras |
security |
security |
Container scanning, gemnasium dependency scanning, semgrep SAST, secret detection, GitLab advanced SAST. | enable_container_scanning |
package-python |
package |
Builds + publishes wheels. python-dev (${BASE}.dev${PIPELINE_ID}, MR channel) and python-release (from tag). |
python_image, cache_key |
package-docker |
package |
Buildx images: docker-branch (MR), docker-staging (main), docker-release (tag → internal + Docker Hub). |
base_image_tag |
package-helm |
package |
helm-devel (MR, on charts/ change) and helm-stable (main/tag, version-change-gated). |
chart_name |
semantic-release |
release |
python-semantic-release on conventional commits: bumps project.version, writes CHANGELOG.md, tags v{version}. |
python_image |
deploy-branch |
deploy |
Per-MR ArgoCD environment at ${CI_COMMIT_REF_SLUG}.axonis.ai, with paired stop- action. |
image_registry, helm_image_prefix |
deploy-staging |
deploy |
Main-branch rollout to development.axonis.ai (ArgoCD patch + rollout restart). Requires deploy-branch included first (shares .deploy-image). |
— |
axonis-core-fanout |
downstream |
On an axonis-core tag, triggers downstream rebuilds (oracle, cortex, parallax, sentinel, fedai-rest, conduit, prism, forge, geodex, beacon). | needs_job, target_branch |
Mandatory pipeline shape
Every deployable component service MUST include, in this dependency order:
stages: [qa, security, package, release, deploy]
include:
- project: 'federated-ml-platform/product-development/ci-components'
ref: <pinned-tag>
file: '/templates/workflow.yml'
- component: gitlab.axonis.ai/federated-ml-platform/product-development/ci-components/qa@<tag>
inputs: { source_dirs: "<pkg>/ server/" }
- component: .../ci-components/security@<tag>
- component: .../ci-components/package-python@<tag> # if it ships a wheel
- component: .../ci-components/package-docker@<tag>
- component: .../ci-components/package-helm@<tag>
inputs: { chart_name: "<service>" }
- component: .../ci-components/semantic-release@<tag>
- component: .../ci-components/deploy-branch@<tag>
- component: .../ci-components/deploy-staging@<tag>
qa-code-analysisis the universal gate;package-*andsemantic-releaseneeds:it.- axonis-core includes only
workflow+qa+security+package-python+semantic-release+axonis-core-fanout(no docker/helm/deploy — it is a library). - developers-environment (spec-docs site) includes only
workflow+qa+security+package-docker+package-helm(no package-python/semantic-release/deploy-* — a docs image, not a versioned service).
Versioning contract
- Version of record is
project.versioninpyproject.toml. - Releases are driven by conventional commits on
main(feat→ minor,fix/perf→ patch). - semantic-release commits
chore(release): {version}and tagsv{version}; the bot commit is skipped byworkflow.ymlto avoid loops. - A
chore(release)/bot commit must never itself trigger a release pipeline.
axonis-core fanout
When axonis-core publishes a release tag, axonis-core-fanout triggers each downstream service's
target_branch pipeline. Downstream qa, package-python, and package-docker jobs detect the
upstream trigger (CI_PIPELINE_SOURCE == pipeline) and rebuild with
uv sync --upgrade-package axonis-core (and a Docker PIPELINE_ID cache-bust), so every service
re-pins to the new axonis-core without a manual bump. New downstream services MUST be added to the
fanout trigger list.
Registries & auth (informational)
- Internal PyPI:
https://gitlab.axonis.ai/api/v4/projects/405/packages/pypi/simple(axonis-core), authgitlab-ci-token+CI_JOB_TOKEN. - Docker: internal
${CI_REGISTRY}for branch/staging; release double-pushes internal +${DOCKERIO_REGISTRY}(Docker Hub). - Helm: GitLab Helm registry,
develandstablechannels. - Deploy: ArgoCD via kube context
federated-ml-platform/cluster-management:shared-agent; secrets decrypted with SOPS age key.
Part 2 — Helm chart conventions
Chart layout
- A deployable service's chart lives at
charts/<service>/(canonical). xanadu'schart/xanaqu/is a known deviation pending rename. - Every chart depends on bitnami
common(currently 2.31.4) and consumes its globals:global.imageRegistry,global.imagePullSecrets,global.defaultStorageClass. - Service-specific values are nested under a single top-level block named for the service
(
oracle:,cortex:,parallax:), never sprayed at the chart root.
Rule A — reuse the shared base first
A chart MUST consume shared definitions before declaring its own:
- Image registry resolves through the global:
.Values.<svc>.image.registry | default .Values.global.imageRegistry. - Config field names map to the
AxonisSettingsenv-var names from platform.service-configuration. Shared server fields are emitted unprefixed:HOST,LOG_LEVEL,WORKERS,DEBUG. - Common templates (labels, pod/container security context
1001:1001,/healthprobes, worker-derived resource requests) come from the shared library helpers, not re-invented per chart.
Gap (P2): there is today no shared Axonis Helm library chart — each service re-declares
podSecurityContext, probes, and resource helpers. The golden state is a shared library chart (proposed home: acharts/axonis-commonlibrary, or argo-baseline) that the nine charts import. Until it exists, charts must at minimum reuse bitnamicommonglobals and the platform.service-configuration env names.
Rule B — no <PACKAGE_NAME>_ prefix on keys
Env-var keys rendered into the ConfigMap/Secret MUST NOT be prefixed with the chart's own service name. Concretely:
| Pattern | Verdict | Example |
|---|---|---|
| Shared server field | unprefixed | HOST, LOG_LEVEL, WORKERS, DEBUG |
| This service's port | only sanctioned self-prefix | ORACLE_PORT, CORTEX_PORT, GEODEX_PORT |
| External dependency | prefixed by the dependency, not this service | ELASTIC_HOST, REDIS_PORT, SSO_CLIENT_SECRET |
| Internal subsystem | prefixed by the subsystem, not the service | APOLLO_LLM_PROVIDER, CORTEX_LLM_MODEL |
| Self-prefixed shared field | forbidden | ~~PARALLAX_HOST~~, ~~SENTINEL_LOG_LEVEL~~ |
Rationale: AxonisSettings reads HOST/LOG_LEVEL/WORKERS/DEBUG unprefixed; a
<SERVICE>_-prefixed key would simply never be read. SERVICE_PORT is the single exception because
ports are per-service and the central port registry (platform.service-configuration) keys on <SERVICE>_PORT.
The current charts already conform to Rule B (audited 2026-06-03 — no violations). This spec freezes
that as a hard rule and a package-helm lint target.
Secrets
Charts use the existingSecret + inline secrets: fallback pattern (per
[[feedback_keep_secret_defaults_in_values]] — keep inline defaults; ArgoCD overrides at deploy time).
Do not strip inline secret values from values.yaml.
Known chart inconsistencies to reconcile (P2)
- Port placement differs (
service.ports.httpvsgateway.containerPorts.httpvsserver.containerPorts.https) — standardize onservice.ports.http. global.imageRegistrydefault differs (""for most vsregistry.axonis.aifor xanadu).- Probe verbosity and LLM-config nesting differ across charts — standardize on the explicit probe
form and a single
llm:block shape. - xanadu chart dir
chart/xanaqu/→charts/xanadu/.
Part 3 — Docker image build standard
The standard pattern (golden)
oracle, parallax, sentinel, geodex are 100% conformant and define the reference. Every deployable service image MUST:
- Base off
registry.axonis.ai/axonisai/atlasfml-base:py310-v4.0.0(the current pinned base), parameterized viaARG BASE / IMAGE / TAG. - Be multi-stage — a
builder(aliasbase) stage that installs deps + strips build toolchain, and a clean final stage that copies/appfrom the builder. - Install dependencies with
uvagainst a frozen lock:uv sync --frozen --no-dev --extra service, with an axonis-core fanout fallback (uv sync --upgrade-package axonis-core ...) gated on aPIPELINE_IDcache-bust arg. Private axonis-core is fetched via a transient~/.netrccarryingPYPI_TOKEN, removed in the same layer. - Strip the build toolchain in the builder stage (
dnf removegcc/cmake/GDAL-dev, clean caches). - Run as non-root UID 1001; build steps that need root switch
USER 0then back to1001. WORKDIR /app;ENV PATH=/app/.venv/bin:$PATH,PYTHONPATH=/app,PYTHONUNBUFFERED=1.- Ship
entrypoint.sh(TLS CA-bundle trust setup) as theENTRYPOINT;CMDlaunchespython -m server. EXPOSE <SERVICE_PORT>and define aHEALTHCHECKagainst/health(enabled, not commented).- Ship a
.dockerignore(exclude.venv,.git,tests/,*.md,charts/, examples).
Image tags are set by package-docker: branch slug (MR), branch slug (staging), CI_COMMIT_TAG
(release). The Dockerfile does not embed a version.
Sanctioned deviations (must stay documented here)
| Service | Deviation | Reason |
|---|---|---|
| titan | base atlas-fl-athena:staging; skips toolchain cleanup; /fedai workdir; ports 5000/9000; CPU-vs-GPU torch split |
GPU/ML serving lineage. titan is the only service that pulls GPU torch ([[project_training_lives_in_titan]]); others use the pytorch-cpu index. |
| xanadu | base federated-ml-platform/atlasfml-base:py310-v3.0.9; uv pip install wheel build; /fedai workdir; no healthcheck |
Federation infra, wheel-distribution model. Should converge to v4.0.0 base + /app + healthcheck. |
parallax examples/ |
python:3.10-slim, pip, single-stage |
Local dev/test only; .dockerignored out of the production image. Never a production target. |
| spec-docs (developers-environment) | base python:3.12-slim both stages; no atlasfml-base |
Docs site + corpus chat service; final stage is the spec-docs Python service (uvicorn) serving statics + chat, /health probe per chart. |
Conformance gaps to close (P2)
- fedai-rest, cortex: enable their (currently commented-out)
HEALTHCHECK. - sentinel, xanadu, titan: add a
.dockerignore. - xanadu: migrate to
atlasfml-base:py310-v4.0.0,/appworkdir,uv sync --frozen, healthcheck. - Base-image version drift: pin all non-GPU services to the same
atlasfml-basetag.
Conformance summary (2026-06-03 baseline)
| Concern | Conformant today | Gap |
|---|---|---|
| Pipeline via ci-components | all 9 | keep fanout list current as services are added |
| Helm Rule B (no self-prefix) | all charts | freeze as package-helm lint |
| Helm Rule A (shared-first) | partial | no shared library chart yet |
| Docker standard | oracle, parallax, sentinel, geodex (100%) | fedai-rest/cortex healthcheck; xanadu base+layout; titan documented exception |
Acceptance criteria
- Each deployable component service's
.gitlab-ci.ymlincludes the mandatory component set (#cicd.pipeline-shape) at a pinned tag, with no hand-authored duplicates of catalog jobs. - Each chart conforms to Rule A and Rule B;
package-helmfails on a<SERVICE>_-prefixed shared key. - Each deployable image follows the
#dockerstandard or is one of the three named, documented deviations. - A shared Axonis Helm library chart exists and is consumed by the nine charts (closes the Rule A gap).
- New component services satisfy 1–3 on their first MR.
Depends on: platform.axonis-core, platform.ingress-routing, platform.observability, platform.service-configuration, platform.service-contract