Skip to content

TrainedModel

Definition

A TrainedModel is the trained artifact plus the training-run record produced by training a Model on a Dataset. It realizes a Model: on create it reads the referenced Model and copies its library / framework / type, then (for federated models) deploys a training job to Kubernetes/titan and tracks it to completion. The clean distinction is Model = config/architecture; TrainedModel = the run plus the resulting weights (trained_model.py:48, where it reads Model().read(uid=source['model'])). The TrainedModel is the object that carries live training status, versioned weights in git, and the federated job handle.

Defined axonis-core/axonis/ml_userspace/trained_model.py:26 (class TrainedModel(UDS), alias Schema.TRAINED_MODEL = "trainedmodel"). Persistence: the Elasticsearch userspace index (metadata/status), plus Gitea (weights/versions) and Redis (live status pub/sub).

Lifecycle

Status codes Schema.MODEL_STATUS (schema.py:139): 0 RETRIEVING DATASET → 1 TRAINING → 2 ANALYZING PERFORMANCE → 3 STORING MODEL → 4 TRAINING COMPLETE, plus 5 ERROR, 6 QUEUED, 7 STOPPED, 8 OPTIMIZE (mirrored as the TrainingStatus enum). On create, initialize() deploys the job and sets QUEUED (trained_model.py:133); titan's training entrypoint runs it and calls update_train_status(uid, 4, …) on success or (uid, 5, …) on failure (training.py:110). update(state='start'|'stop') toggles the k8s pod via initialize / finalize; finalize stamps training_endtime and tears down the pod; delete removes the k8s deployment. Each status update republishes to the Redis channel trainedmodels:{uid} and recomputes model_versions from the git branches.

Journey through the code

REST create target=trainedmodel → rest userspace → generic UDS create in the REST process (trainedmodel is also not in the rest USERSPACE map) → federation propagates → on the federate, TrainedModel.create runs: it resolves the Model metadata, writes the ES record, then if send_to_federation(): initialize() deploys via axonis.core.deploy.training.Training(...).deploy(entrypoint='titan.modeling.training') → the titan pod titan/titan/modeling/training.py:49 entrypoint_training, which downloads the model and checkpoint and branches on framework:

  • federatedfederated_model_runner (coded) / federated_model_runner_v2.nocode_start (nocode) builds a FATE PipeLine (Reader → DataTransform → [Intersection ecdh, if vertical] → operations → DataSplit → model component → Evaluation), compile()s and fit()s it, submitting to FATE-Flow via flow_sdk.client.FlowClient (federated_model_runner.py:367). The federated_job_id is stored on the object; stop calls FlowClient.job.stop.
  • simple/advancedsimple_model_runner / advanced_model_runner (pytorch/tf/xgboost).

Artifacts and versions are stored in Gitea via axonis.core.storage.Storage (per-uid repo, branch = version); checkpoints are base64-stored on the object. The consume path is titan's userspace/predictor.py:48, which reads the TrainedModel, its source Model, and the Dataset for inference.

Data shape

Seeded on create: model (FK → Model uid), parameters (with prediction_threshold defaulted to 0.5), datasets (e.g. {training: <dataset_uid>}), version (default main), modeldir / exportdir ('saved_model'), checkpoint_exists = 0, graph_version, serving = []. Copied from the Model: library, framework, type. Runtime: status, status_message, training_starttime / training_endtime / training_duration, model_versions, federated_job_id, transform (e.g. PCA), checkpoint blobs, nocode. Storage: ES userspace index (metadata/status) + Gitea (weights/versions) + Redis (status pub/sub on trainedmodels:{uid}).

Invariants

  • parameters always exists (indexed unconditionally, trained_model.py:32).
  • A non-nocode, non-pretrained TrainedModel references a resolvable Model (trained_model.py:47).
  • library = 'federated' is rejected on an EdgeNode (trained_model.py:135).
  • training_endtime is set once, never overwritten (trained_model.py:166); every status update republishes to trainedmodels:{uid} and recomputes model_versions.
  • product.model — the TrainedModel realizes a Model (source['model']), copying its library/framework/type.
  • product.dataset — trained on a Dataset (datasets.training); a Dataset create/update also spawns a system quality-estimator training run.
  • product.fusion — the federated training path submits to FATE-Flow; fusion scoring can reference titan-trained ranker/LLM models by registry key.

Open questions

  • Source-of-truth location — the real TrainedModel class ships in the published axonis-core wheel, not a tracked local branch.
  • REST vs federate splittrainedmodel is absent from the rest USERSPACE map, so the rich create/train logic executes only after federation propagation lands on a federate.
  • MODEL_STATUS vs ASSET_STATUS — a parallel Schema.ASSET trained-artifact concept exists in the same userspace index; its relationship to TrainedModel is unconfirmed.
  • FATE result store-back — submission and status updates are confirmed, but where the trained federated artifact is pulled back from FATE and committed to Gitea was not fully traced.

Realized by: component.titan.runtime