Skip to content

Sentinel — Alerting and Monitoring Service

Status: Implemented — sentinel/ repo active with FastAPI server, MCP mount, and alerting domain objects (AlertEvent, Notification, Sensor, Subscriber) migrated from rest's userspace. Package: sentinel (Python pkg server/) Depends on: platform.axonis-core, platform.service-contract Milestone: P2 (after axonis-core is published)

Repo-local spec. Cross-cutting service mechanics (auth, /health + /service-info, Helm chart, CI/CD, uv/hatchling packaging, ruff) are not covered here — they follow platform.service-contract and its Cross-Cutting Requirements. This spec covers what makes the alerting service unique.

Who & why

Trigger: As an operator monitoring federated sites, I want a dedicated alerting service that ingests alert events from any source, evaluates thresholds, and routes notifications, so that the alert lifecycle (acknowledge → resolve → escalate) is managed in one place with domain-specific tools instead of generic CRUD.

Current pain. Alerting objects (AlertEvent, Notification, Sensor, Subscriber) lived in the alerts Elasticsearch index and were reached only through fedai-rest's generic userspace CRUD. There was no alert-specific workflow (no acknowledge/resolve/escalate), no threshold evaluation, and no uniform way to record where an alert came from — so cross-source alert handling and routing had to be reimplemented per caller.

Job to be done. Own the full alert lifecycle across heterogeneous sources (sensors, rules engines, ML models, humans, external systems) under one source-identity model, with first-class workflow tools and notification routing.

Out of scope. Raw data ingestion (Conduit, component.conduit.service); device writes back to sensors (Relay, platform.relay); and all cross-cutting service scaffolding (auth middleware, health/service-info contract, Helm, CI/CD, packaging) which platform.service-contract already mandates for every service.

Purpose

Sentinel is the alerting and monitoring microservice. It owns all alert lifecycle management: event ingestion, threshold evaluation, notification dispatch, sensor management, and subscriber routing.

Previously, alerting objects (AlertEvent, Notification, Sensor, Subscriber) were stored in the alerts Elasticsearch index and accessed via fedai-rest's generic userspace CRUD. Sentinel extracts these into a dedicated service with domain-specific tools for alert management workflows.

Scope: generic alerting, sensors as one source type

Sentinel is a generic alerting service. Alerts may originate from many kinds of sources, of which sensors are one common case. Other valid sources include:

  • Rules engines (transaction-monitoring rule trips)
  • ML models emitting an anomaly score
  • Scheduled audit jobs finding violations
  • Humans manually escalating events
  • External systems posting alerts via the API

Every AlertEvent carries a source identity consisting of:

  • source_type — one of sensor | rule | model | human | external
  • source_uid — the UID of the specific origin within that type (e.g. the sensor's UID when source_type=sensor)

Source-specific resources (e.g. alerts://sensors/{sensor_id}/history) filter alerts where source_type matches the resource scheme and source_uid matches the parameter. Other source types may add their own resources without changing the core alert model.

Sensor objects remain the registry of configured sensor sources (their thresholds, sites, types). Non-sensor sources are not registered in the Sensor index — they identify themselves only at alert-creation time via source_type + source_uid.

Domain Objects (from axonis-core userspace)

Object Schema constant ES index Purpose
AlertEvent Schema.ALERT_EVENT alerts Triggered alert with severity, site, status; identifies its origin via source_type + source_uid
Notification Schema.NOTIFICATION alerts Dispatched notification record
Sensor Schema.SENSOR alerts One kind of alert source — registers a configured sensor (threshold, type, site). Non-sensor sources self-identify at alert-create time and are not registered here.
Subscriber Schema.SUBSCRIBER alerts Alert routing target (user, channel, webhook)
AlertSubscriber Schema.ALERT_SUBSCRIBER alerts Subscription binding (source → subscriber)
AlertThreshold Schema.THRESHOLD alerts Configurable threshold definition

Code Structure

server/
  __init__.py
  __main__.py                  # Starlette: /agentspace, /api/v1, /health, /service-info
  api/
    __init__.py
    routes.py                  # FastAPI REST endpoints
    schema/
      alerting_objects.yml     # OpenAPI component schemas
  mcp/
    __init__.py
    server.py                  # FastMCP tools + resources
    commands.py                # Command layer (shared by REST + MCP)

Port and API Version

Port: 8005 API version: /api/v1

MCP Tools (18)

CRUD (12)

Tool Purpose
alert_list(summary, status, severity, site_id) List alerts with optional filters
alert_get(uid) Get alert by UID
alert_create(body) Create alert event
alert_update(uid, body) Update alert (e.g., change status)
sensor_list(summary, sensor_type) List sensors with optional type filter
sensor_get(uid) Get sensor by UID
sensor_create(body) Create sensor definition
sensor_update(uid, body) Update sensor
subscriber_list(summary) List subscribers
subscriber_get(uid) Get subscriber by UID
subscriber_create(body) Create subscriber
notification_list(summary, alert_id) List notifications, optionally by alert

Workflow (4)

Tool Purpose
alert_acknowledge(uid, acknowledged_by, note) Acknowledge an alert — sets status to ACKNOWLEDGED
alert_resolve(uid, resolved_by, resolution) Resolve an alert — sets status to RESOLVED
alert_escalate(uid, escalate_to, reason) Escalate an alert — creates notification to escalation target
alert_evaluate(sensor_id, value) Evaluate a value against a sensor's threshold — returns whether alert should trigger

Introspection (2)

Tool Purpose
alert_summary(time_range, group_by) Summary of alerts by severity/site/status over a time range
sensor_status(sensor_id) Current state of a sensor: last triggered, alert count, health

MCP Resources (2)

URI Purpose
alerts://active List of currently active (unresolved) alerts
alerts://sensors/{sensor_id}/history Recent alert history for a sensor

REST Endpoints

CRUD

Method Path Maps to MCP tool
GET /api/v1/alert alert_list
GET /api/v1/alert/{uid} alert_get
POST /api/v1/alert alert_create
POST /api/v1/alert/{uid} alert_update
GET /api/v1/sensor sensor_list
GET /api/v1/sensor/{uid} sensor_get
POST /api/v1/sensor sensor_create
POST /api/v1/sensor/{uid} sensor_update
GET /api/v1/subscriber subscriber_list
GET /api/v1/subscriber/{uid} subscriber_get
POST /api/v1/subscriber subscriber_create
GET /api/v1/notification notification_list

Workflow

Method Path Maps to MCP tool
POST /api/v1/alert/{uid}/acknowledge alert_acknowledge
POST /api/v1/alert/{uid}/resolve alert_resolve
POST /api/v1/alert/{uid}/escalate alert_escalate
POST /api/v1/sensor/{uid}/evaluate alert_evaluate

Introspection

Method Path Maps to MCP tool
GET /api/v1/alert/summary alert_summary
GET /api/v1/sensor/{uid}/status sensor_status

Service Info

{
  "name": "sentinel",
  "version": "1.0.0",
  "description": "Alerting and monitoring — event lifecycle, sensors, notifications",
  "mcp_path": "/agentspace",
  "health_path": "/health",
  "api_path": "/api/v1",
  "tools_count": 18,
  "resources_count": 2,
  "capabilities": ["alert", "sensor", "subscriber", "notification", "threshold"]
}

Command Layer

# server/mcp/commands.py

from axonis.userspace.alerting import AlertEvent, AlertThreshold, Subscriber, Notification
from axonis.userspace.intelligence import Memory

memory = Memory()

def acknowledge_alert(uid, acknowledged_by, note=""):
    alert = AlertEvent().read(uid=uid)
    update = {"status": "ACKNOWLEDGED", "acknowledged_by": acknowledged_by, "note": note}
    AlertEvent().update(update, uid)
    memory.create({
        "content": f"Acknowledged by {acknowledged_by}: {note}",
        "memory_type": "alert_ack",
        "source_conversation_id": uid,
    })
    return {**alert, **update}

def evaluate_threshold(threshold_id, value):
    threshold = AlertThreshold().read(uid=threshold_id)
    operator = threshold.get("operator", "gt")
    limit = threshold.get("value", 0)
    triggered = (
        (operator == "gt" and value > limit) or
        (operator == "lt" and value < limit) or
        (operator == "eq" and value == limit) or
        (operator == "gte" and value >= limit) or
        (operator == "lte" and value <= limit)
    )
    return {"threshold_id": threshold_id, "value": value, "threshold": limit,
            "operator": operator, "triggered": triggered}

Alert Filters

The alert_list tool supports the following filters:

  • sensor_type — filter by sensor type (only meaningful when source_type=sensor)
  • severity — filter by severity level
  • site_id — filter by site/location
  • status — filter by alert status (ACTIVE, ACKNOWLEDGED, RESOLVED, ESCALATED)
  • source_type — filter by origin kind (sensor | rule | model | human | external)
  • source_uid — filter by specific origin UID

Filters are passed as query parameters in REST and as tool arguments in MCP. The first four are pushed down to ES via Schema.ALERT_FILTERS; source_type and source_uid are applied in Python until axonis-core's ALERT_FILTERS set is extended (tracked separately).

Alert Lifecycle Pipeline

Beyond the request-scoped CRUD/workflow tools above, Sentinel owns an end-to-end evaluation pipeline that turns raw source readings into notifications and signals. The stages are source-agnostic; sensors are the canonical case but the same flow applies to any source_type.

Trigger flow

Source reading (sensor poll, rule trip, model score, external POST)
    |
    | 1. Normalize to standard reading schema, persist
    v
Threshold evaluation
    | 1. Resolve effective threshold (per-source override, else default)
    | 2. Evaluate reading vs threshold (alert_evaluate semantics)
    | 3. Check cooldown (see #cooldown) — skip if within min report interval
    | 4. If exceeded AND not in cooldown:
    |    a. Query matching subscribers
    |    b. Filter by min_severity and quiet hours
    |    c. Dispatch notifications (per channel)
    |    d. Write AlertEvent (status=triggered) — also acts as cooldown marker
    |    e. Write Notification record(s)
    |    f. Emit Signal to cortex via signal_create (see #signal-integration)
    v
Subscriber notified  +  ADI shows Signal in Monitor

Clear flow

When readings return below threshold, the evaluator transitions the matching AlertEvent (sets cleared_at and status) and emits a resolved Signal to cortex (signal_create). Per #invariants, the AlertEvent is status-transitioned, never deleted.

Threshold Configuration

Thresholds resolve from two layers, override taking precedence:

  1. Defaults — base threshold definitions per source/sensor type (platform-level configuration). New source types are onboarded by adding a section here; no code change required.
  2. Per-source overrides — stored as AlertThreshold records (Schema.THRESHOLD, alerts index) keyed by source/site; take precedence over defaults.

Resolution order: per-source override checked first, then the type default.

Evaluation modes

A threshold evaluates in one of two modes:

  1. Fixed value — the threshold's value field holds the comparison number (used by evaluate_threshold / alert_evaluate).
  2. Compare fieldcompare_field references another field in the source reading; the reading's value is compared against that field rather than a constant.

Severity levels

Level Priority Description
warning Low Initial threshold breach
high Medium Elevated concern, rapid changes
critical High Severe threshold breach
extreme Highest Emergency condition

Cooldown Logic

A minimum report interval prevents notification storms when a reading oscillates near a threshold boundary. Cooldown is state-derived from the alerts index (no external store): before dispatching, query for the most recent AlertEvent matching (source_type, source_uid, threshold_name) with triggered_at >= now - interval (size=1, sorted triggered_at desc).

  • Found → within cooldown → skip (no notification, no new AlertEvent).
  • Not found → dispatch; the written AlertEvent becomes the cooldown marker for the next evaluation.

The interval is sourced from the effective threshold's min_report_interval_sec (per-source override first, then default — same resolution order as #threshold-config).

Subscriber Routing

Subscribers carry one or more subscriptions and delivery preferences that the pipeline filters against at dispatch time:

  • subscriptions[] — each binds a source_type/sensor_type, a list of site_ids ("*" = all), a min_severity floor, and an active flag. A subscriber is matched when an active subscription's type and site match the alert and the alert severity ≥ min_severity.
  • notification_channels — ordered channels (e.g. sms); each produces a Notification record.
  • quiet_hours — optional {enabled, start, end} window in the subscriber's timezone; alerts falling inside the window are suppressed for that subscriber.

Each dispatched channel produces a Notification record (Schema.NOTIFICATION, immutable per #invariants) capturing channel, destination, provider, provider message id, status, and timestamps.

Signal Integration (ADI)

Alerting and the ADI signal surface serve different audiences and are intentionally distinct:

Concern Alerting Signal (ADI)
Audience Subscribers (phone/channel targets) People with roles (accountability packs)
Action Notify immediately Investigate, decide, attest over hours/days
Output Notification delivered Auditable decision record with evidence
Storage alerts index intelligence index
Question "Did we notify?" "Did we respond properly?"

Sentinel manages the notification infrastructure; the signal it emits feeds the accountability record consumed by the ADI investigation workflow (Cortex/Beacon).

Dual-path signal ingestion

Signals reach the intelligence index by two complementary, first-class paths producing identical Signal v2 documents:

  • Path 1 — pipeline-emitted. As a side effect of threshold evaluation + notification, Sentinel maps the AlertEvent to a Signal v2 document and emits it to cortex (the owning service) via cortex's signal_create surface — cortex validates it against the signal governance rules (severity/dedup/status) and persists it to the intelligence index. Sentinel never writes the intelligence index directly; routing through cortex ensures the governance ceremony is applied rather than bypassed by a raw index write. The AlertEvent cross-references the signal_id cortex returns. Source: platform-evaluated thresholds.
  • Path 2 — direct push. External systems POST a Signal directly (e.g. PUT /userspace/signal/{signal_id}) without going through the alerting pipeline; they own their threshold/state logic. Source: source-evaluated conditions (webhooks, polling jobs, correlation engines).

Both feed the same ADI accountability flow: Cortex loads the user's accountability pack, filters by signal_type and severity, surfaces the signal in Beacon's role-filtered Signal Queue, where a user opens an Investigation, pins evidence, selects a decision template, and a reviewer attests (separation of duties) before the edition is frozen and tasks dispatched.

Alert → Signal mapping

When the pipeline (Path 1) converts an AlertEvent to a Signal:

Severity:

Alert severity Signal severity
warning medium
high high
critical critical
extreme critical

Status:

Alert status Signal status
triggered new
cleared resolved
acknowledged acknowledged

Fields:

Alert field Signal field
source_type/sensor_type (e.g. water_level) signal_type (e.g. sensor_water_level)
source_uid / site_id subject.id
"sensor" (literal, when source is a sensor) subject.type
location subject.name
message description
triggered_at detected_at

The mapped Signal is submitted to cortex's signal_create, which validates and persists it to the intelligence index with subtype: signal; Sentinel records the returned signal_id on the AlertEvent.

Object Schemas

Reference field shapes for the alerting objects (the alerts index, see #domain-objects). Cross-cutting envelope fields (uid, timestamps, visibility) follow platform.axonis-core.

AlertEvent

{
  "alert_id": "alert_xyz789",
  "source_type": "sensor",
  "source_uid": "hcfcd_001",
  "sensor_type": "water_level",
  "site_id": "hcfcd_001",
  "threshold_name": "minor_flood",
  "severity": "high",
  "status": "triggered",
  "current_value": 25.3,
  "threshold_value": 24.0,
  "field": "stream_level_current_ft",
  "message": "Minor flooding at hcfcd_001: 25.3 ft (threshold: 24.0 ft)",
  "signal_id": "abc123-def456",
  "triggered_at": "2026-03-01T14:30:00Z",
  "cleared_at": null,
  "notifications_sent": 3,
  "notifications_failed": 0
}

signal_id cross-references the emitted Signal in the intelligence index (see #signal-integration).

Subscriber

{
  "subscriber_id": "sub_abc123",
  "name": "Jane Doe",
  "email": "jane@example.com",
  "phone": "+15551234567",
  "notification_channels": ["sms"],
  "subscriptions": [
    {"sensor_type": "water_level", "site_ids": ["hcfcd_001", "*"], "min_severity": "warning", "active": true}
  ],
  "timezone": "America/Chicago",
  "quiet_hours": {"enabled": false, "start": "22:00", "end": "06:00"},
  "active": true
}

AlertThreshold

{
  "site_id": "hcfcd_001",
  "sensor_type": "water_level",
  "threshold_name": "minor_flood",
  "override_value": 26.0,
  "min_report_interval_sec": 900,
  "enabled": true,
  "created_by": "admin"
}

min_report_interval_sec drives the cooldown window (see #cooldown).

Notification

{
  "notification_id": "notif_def456",
  "alert_id": "alert_xyz789",
  "subscriber_id": "sub_abc123",
  "channel": "sms",
  "destination": "+15551234567",
  "status": "sent",
  "provider": "twilio",
  "provider_message_id": "SM1234567890",
  "message_body": "Minor flooding at hcfcd_001...",
  "sent_at": "2026-03-01T14:30:05Z",
  "delivered_at": "2026-03-01T14:30:08Z",
  "retry_count": 0,
  "error": null
}

Memory Namespace

sentinel — stores alert acknowledgments, resolutions, escalation history.

Migration from fedai-rest

  1. Alert objects currently live in fedai-rest's generic userspace CRUD
  2. Sentinel takes ownership of the alerts ES index
  3. fedai-rest removes alerting targets from its USERSPACE dict
  4. Oracle gateway routes alert tools to sentinel instead of fedai-rest
  5. axonis-client routing maps alerts index → sentinel service URL

Invariants

  1. Alerts are append-only. AlertEvent records are never deleted, only status-transitioned.
  2. Acknowledge/resolve require a user identity. The acknowledged_by / resolved_by fields are mandatory and come from the auth token.
  3. Threshold evaluation is pure. alert_evaluate computes whether a threshold is exceeded but does NOT create an alert. The caller decides whether to act on the result.
  4. Notifications are immutable. Once created, notification records cannot be modified.
  5. Cooldown is state-derived. The cooldown decision is computed from the alerts index alone (most recent matching AlertEvent within the interval); no separate cooldown store exists.

Test Expectations

  • CRUD roundtrip tests for all 6 object types
  • Acknowledge/resolve workflow tests
  • Threshold evaluation tests (all operators)
  • Alert filtering tests (by severity, status, site, sensor_type)
  • Summary aggregation tests
  • Auth tests (token required for all endpoints)

Depends on: platform.axonis-core, platform.service-contract