Skip to main content
Causeloop exposes health endpoints for infrastructure probes and a set of operations endpoints for operators. The observability architecture is designed around OpenTelemetry with Prometheus metrics, structured JSON logging, and Grafana dashboards — most of these are specified in the backend specs and represent the target state; see the production-readiness notes below for current vs. planned.

Health endpoints

Two unauthenticated health endpoints are available for infrastructure probes. These are defined in app/routers/ops.py on a public router with no auth dependency.
EndpointMethodAuthPurpose
GET /healthGETNoneLiveness — is the process alive?
GET /readyGETNoneReadiness — is the app ready to serve traffic?
Use these for Kubernetes probes, Caddy health checks, and uptime monitors:
# Kubernetes liveness + readiness probes
livenessProbe:
  httpGet:
    path: /health
    port: 4000
  initialDelaySeconds: 10
  timeoutSeconds: 5
readinessProbe:
  httpGet:
    path: /ready
    port: 4000
  initialDelaySeconds: 5
  timeoutSeconds: 5
For docker-compose, the api service healthcheck calls /health:
healthcheck:
  test: ["CMD", "python", "-c",
         "import urllib.request; urllib.request.urlopen('http://localhost:4000/health')"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 10s

Health response format

{"status": "ok"}
A 503 Service Unavailable response (with body {"status": "unhealthy"}) is returned if the process is shutting down or the event loop is unresponsive.

Deep health check (planned)

The observability spec defines a GET /healthz/deep endpoint (operator-only, not a k8s probe) that performs a live dependency sweep including Postgres, Redis, and LLM providers:
{
  "status": "ok",
  "service": "causeloop-backend",
  "checks": {
    "postgres": {"status": "ok", "latency_ms": 4, "critical": true},
    "ai_orch":  {"status": "ok", "latency_ms": 9, "critical": false}
  },
  "uptime_s": 82341,
  "version": "1.4.2"
}
This endpoint is specified in the backend observability spec (docs/backend-specs/17_OBSERVABILITY/) but is not yet implemented in the current codebase.

Operations endpoints

Protected operations endpoints (require auth) are also defined in app/routers/ops.py on a separate router with the standard auth dependency:
EndpointAuthDescription
GET /ops/infoRequiredRuntime info: version, uptime, configuration summary
GET /ops/db-statusRequiredDatabase connectivity status

Structured logging

The observability architecture specifies structured JSON logging with mandatory fields on every log line:
FieldDescription
timestampISO 8601
levelDEBUG, INFO, WARNING, ERROR
servicecauseloop-backend
trace_idOpenTelemetry trace ID (when available)
tenant_idWorkspace ID for the current request
request_idPer-request UUID
routeHTTP method + path pattern
statusHTTP response status code
PII policy: no PII or secrets appear in logs. Email addresses, API keys, and encrypted values are never logged.
Centralized log aggregation (Loki, Datadog, Papertrail) is not yet configured out of the box. The application writes structured logs to stdout, which can be forwarded to any log aggregation service via your container runtime or a log collector sidecar. Centralized logging + SIEM is a SOC 2 gap — see SOC 2 readiness.

Metrics — OpenTelemetry

The target observability stack is OpenTelemetry (OTel) with a Gateway Collector exporting to Prometheus. The spec defines RED metrics per route:
MetricTypeLabels
Request rateCounterservice, http.route, http.method, status_code
Error rateCounterservice, http.route, status_code
Request durationHistogramservice, http.route, http.method
LLM router latencyHistogramprovider, model
LLM cost/tokensCounterprovider, model
Issues ingestedCounterworkspace_id
Pattern matchesCounterworkspace_id
The OTel Collector configuration (from docs/backend-specs/17_OBSERVABILITY/) uses:
  • Tail sampling: 100% for errors and slow requests, 5% baseline
  • spanmetrics connector for RED metrics derived from spans (pre-sampling, so metrics are unbiased)
  • Grafana dashboards per service + a tenant-health overview
OTel instrumentation and the Prometheus/Grafana stack are specified in the observability backend spec but are not yet wired into the current FastAPI application. This is a SOC 2 gap (CC7.1–7.3). Adding OpenTelemetry involves installing opentelemetry-sdk, opentelemetry-instrumentation-fastapi, and configuring the OTel exporter in app/main.py.

Audit log

The audit log is the primary operational record for compliance and incident investigation. It is:
  • Append-only: an UPDATE/DELETE trigger (0002_audit_trace_append_only.sql) raises an exception if anyone attempts to modify or delete an audit record
  • RLS-scoped: each workspace sees only its own audit trail
  • Retention-configured: governed by workspace_settings.audit_log_retention_days with a compliance floor of ≥ 365 days

Schema

audit_log (
    id              TEXT PRIMARY KEY,
    workspace_id    TEXT NOT NULL,
    actor_id        TEXT,
    actor_name      TEXT,
    action          TEXT NOT NULL,        -- e.g. 'workspace.created', 'issue.deleted'
    action_category TEXT,                 -- e.g. 'admin', 'product', 'governance'
    target_type     TEXT,
    target_id       TEXT,
    before          JSONB,
    after           JSONB,
    ip_address      TEXT,
    trace_id        TEXT,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
)

Querying the audit log

# Via the API (requires audit:read scope)
curl https://api.causeloop.ai/v1/exports/audit-log \
  -H "Authorization: Bearer <token>"

# Directly in Postgres (as schema owner)
SELECT action, actor_name, target_type, target_id, created_at
FROM audit_log
WHERE workspace_id = 'ws_01j...'
ORDER BY created_at DESC
LIMIT 100;

Alerting (planned)

The SOC 2 audit identifies security alerting as a gap. The target alert set includes:
AlertTriggerCriteria
Auth failure spike>N failed auth attempts in windowCC7.2
Rate-limit thresholdSustained rate-limit triggers from one IPCC7.3
Database connectivityPostgres health check failingA1
Error rate5xx rate > threshold for 5 minutesA1, CC7.1
RTBF SLA breachdue_at passed with request not completedP4
LLM router error rateProvider errors > thresholdCC7.1
These are not yet implemented. Integrating with PagerDuty, OpsGenie, or Slack alerts via the OTel Collector or a Grafana alerting rule is the recommended approach.

Operations runbook checklist

For day-to-day operations:
  • Deployment: docker compose pull && docker compose up -d (VPS) or railway up / render deploy
  • Schema migration: make migrate after each deploy that includes new migration files
  • New client: SELECT onboard_client(...) — see Client provisioning
  • Retention purge: call purge_expired(DATABASE_URL) via a scheduled job (daily recommended)
  • Backup: pg_dump "$DATABASE_URL" -Fc -f backup_$(date +%Y%m%d).dump
  • Health check: curl https://api.yourdomain.com/health
  • Log tailing: docker compose logs -f api (VPS) or your platform’s log viewer