Dataset
Definition
A Dataset is a user-defined, federation-aware bundle that describes how to assemble training or analysis data: a set
of Elasticsearch queries, an ordered list of feature-engineering operations, and a data_model (the features and labels
with their inferred dtypes). It is a recipe, not a static table — on demand it is materialized into a Dask DataFrame
cached in Redis by running each query through the DataSpace and each operation through the feature Engineer. Because the
platform is federated, a Dataset also records which federate contributes each label, and infers whether the split is
horizontal (same columns, different rows per party) or vertical (different columns per party).
Defined rest/uds/userspace/dataset.py:35 (class Dataset(UDS), alias Schema.DATASET). Wire schema:
rest/server/api/schema/objects.yml:168 (DataSet), codegen mirror rest/server/api/models/objects.py:1421.
Persistence: the Elasticsearch userspace index (metadata), discriminated by subtype = "dataset", plus Redis (the
materialized dataframe). Every UDS write fire-and-forget propagates to peer federates via the gateway.
Lifecycle
There is no explicit status enum; state is implicit. defined (created from queries + operations) → materialized
(queries and ops executed, dataframe cached in Redis via refresh(), dataset.py:290) → analyzed (an async quality
estimator job populates quality_metrics and recommended_operations) → optional merged / snapshot variants →
deleted. On create — and on any update that changes the recipe — cortex auto-deploys a quality-analysis ("estimator")
training job to titan (dataset.py:80, entrypoint titan.system.estimators.estimator). Re-analysis triggers only when
queries, query, data_model, operations, or dataset change (ESTIMATOR_TRIGGER_FIELDS); the estimator's own
output fields are excluded so it doesn't loop.
Journey through the code
REST create/update target=dataset → rest userspace (userspace.py) → server/mcp/commands.py maps to
uds.userspace.dataset.Dataset → ES write + Redis materialize → estimator job to titan. The refresh() path runs
DataSpace().query per query and Engineer().create per operation to build the dataframe. The load/consume path is
DatasetObject (rest/uds/userspace/dataset_object.py:26), which reads the ES record and yields a Dask DataFrame
(load, load_limited, get_dataframe). Training consumes it: titan's federated_model_runner.nocode_start calls
Dataset().read(dataset_id) (federated_model_runner.py:357).
Data shape
Key fields (objects.yml:168): name, description, dataset (id of the source/input dataset in Redis), queries[]
(a legacy single query is auto-promoted to queries[0]), operations[], query_size,
avg_record_size_in_bytes (default 5000), split_type (horizontal | vertical, inferred). data_model: features[],
labels[], feature_types[] / label_types[] (inferred Dask dtypes), label_federates (label → contributing
platform ids). Computed / provenance: merge_metadata (ds_id_a, ds_id_b, how, join keys), result_preview,
quality_metrics, recommended_operations, quality_refs, is_snapshot, username. Storage: ES userspace index
(metadata) + Redis (materialized dataframe keyed by the dataset id).
Invariants
- Every feature/label in
data_modelmust exist as a column in the materialized data, elseDataException(dataset.py:410). - A merged Dataset cannot be merged again (
dataset.py:269). split_type = verticalexactly when at least one feature/label column is filled by a single federate (dataset.py:380).- Delete cascades to its
DatasetMetrics(dataset.py:42).
Related products
product.trained-model— a TrainedModel is trained on a Dataset (datasets.training); the Dataset supplies the features/labels and the federated split that drive the training pipeline.product.model— the Model is the architecture a TrainedModel realizes; the Dataset is its data.product.data-pipeline— ingest pipelines land the indices a Dataset's queries read.product.fusion— federated entity resolution operates over the same per-party data substrate.
Open questions
- Source-of-truth location. The rich
Dataset/Model/TrainedModelclasses ship in the published axonis-core wheel and are imported at runtime; they are not in any tracked branch of the localaxonis-corecheckout. The canonical source path needs pinning. - No explicit status field — should the spec formalize a Dataset state enum, or keep it inferred from
quality_metrics/merge_metadata/is_snapshot?