Skip to content

Dataset

Definition

A Dataset is a user-defined, federation-aware bundle that describes how to assemble training or analysis data: a set of Elasticsearch queries, an ordered list of feature-engineering operations, and a data_model (the features and labels with their inferred dtypes). It is a recipe, not a static table — on demand it is materialized into a Dask DataFrame cached in Redis by running each query through the DataSpace and each operation through the feature Engineer. Because the platform is federated, a Dataset also records which federate contributes each label, and infers whether the split is horizontal (same columns, different rows per party) or vertical (different columns per party).

Defined rest/uds/userspace/dataset.py:35 (class Dataset(UDS), alias Schema.DATASET). Wire schema: rest/server/api/schema/objects.yml:168 (DataSet), codegen mirror rest/server/api/models/objects.py:1421. Persistence: the Elasticsearch userspace index (metadata), discriminated by subtype = "dataset", plus Redis (the materialized dataframe). Every UDS write fire-and-forget propagates to peer federates via the gateway.

Lifecycle

There is no explicit status enum; state is implicit. defined (created from queries + operations) → materialized (queries and ops executed, dataframe cached in Redis via refresh(), dataset.py:290) → analyzed (an async quality estimator job populates quality_metrics and recommended_operations) → optional merged / snapshot variants → deleted. On create — and on any update that changes the recipe — cortex auto-deploys a quality-analysis ("estimator") training job to titan (dataset.py:80, entrypoint titan.system.estimators.estimator). Re-analysis triggers only when queries, query, data_model, operations, or dataset change (ESTIMATOR_TRIGGER_FIELDS); the estimator's own output fields are excluded so it doesn't loop.

Journey through the code

REST create/update target=dataset → rest userspace (userspace.py) → server/mcp/commands.py maps to uds.userspace.dataset.Dataset → ES write + Redis materialize → estimator job to titan. The refresh() path runs DataSpace().query per query and Engineer().create per operation to build the dataframe. The load/consume path is DatasetObject (rest/uds/userspace/dataset_object.py:26), which reads the ES record and yields a Dask DataFrame (load, load_limited, get_dataframe). Training consumes it: titan's federated_model_runner.nocode_start calls Dataset().read(dataset_id) (federated_model_runner.py:357).

Data shape

Key fields (objects.yml:168): name, description, dataset (id of the source/input dataset in Redis), queries[] (a legacy single query is auto-promoted to queries[0]), operations[], query_size, avg_record_size_in_bytes (default 5000), split_type (horizontal | vertical, inferred). data_model: features[], labels[], feature_types[] / label_types[] (inferred Dask dtypes), label_federates (label → contributing platform ids). Computed / provenance: merge_metadata (ds_id_a, ds_id_b, how, join keys), result_preview, quality_metrics, recommended_operations, quality_refs, is_snapshot, username. Storage: ES userspace index (metadata) + Redis (materialized dataframe keyed by the dataset id).

Invariants

  • Every feature/label in data_model must exist as a column in the materialized data, else DataException (dataset.py:410).
  • A merged Dataset cannot be merged again (dataset.py:269).
  • split_type = vertical exactly when at least one feature/label column is filled by a single federate (dataset.py:380).
  • Delete cascades to its DatasetMetrics (dataset.py:42).
  • product.trained-model — a TrainedModel is trained on a Dataset (datasets.training); the Dataset supplies the features/labels and the federated split that drive the training pipeline.
  • product.model — the Model is the architecture a TrainedModel realizes; the Dataset is its data.
  • product.data-pipeline — ingest pipelines land the indices a Dataset's queries read.
  • product.fusion — federated entity resolution operates over the same per-party data substrate.

Open questions

  • Source-of-truth location. The rich Dataset/Model/TrainedModel classes ship in the published axonis-core wheel and are imported at runtime; they are not in any tracked branch of the local axonis-core checkout. The canonical source path needs pinning.
  • No explicit status field — should the spec formalize a Dataset state enum, or keep it inferred from quality_metrics / merge_metadata / is_snapshot?