Observability & Operations

Causeloop exposes health endpoints for infrastructure probes and a set of operations endpoints for operators. The observability architecture is designed around OpenTelemetry with Prometheus metrics, structured JSON logging, and Grafana dashboards — most of these are specified in the backend specs and represent the target state; see the production-readiness notes below for current vs. planned.

Health endpoints

Two unauthenticated health endpoints are available for infrastructure probes. These are defined in app/routers/ops.py on a public router with no auth dependency.

Endpoint	Method	Auth	Purpose
`GET /health`	GET	None	Liveness — is the process alive?
`GET /ready`	GET	None	Readiness — is the app ready to serve traffic?

Use these for Kubernetes probes, Caddy health checks, and uptime monitors:

# Kubernetes liveness + readiness probes
livenessProbe:
  httpGet:
    path: /health
    port: 4000
  initialDelaySeconds: 10
  timeoutSeconds: 5
readinessProbe:
  httpGet:
    path: /ready
    port: 4000
  initialDelaySeconds: 5
  timeoutSeconds: 5

For docker-compose, the api service healthcheck calls /health:

healthcheck:
  test: ["CMD", "python", "-c",
         "import urllib.request; urllib.request.urlopen('http://localhost:4000/health')"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 10s

Health response format

{"status": "ok"}

A 503 Service Unavailable response (with body {"status": "unhealthy"}) is returned if the process is shutting down or the event loop is unresponsive.

Deep health check (planned)

The observability spec defines a GET /healthz/deep endpoint (operator-only, not a k8s probe) that performs a live dependency sweep including Postgres, Redis, and LLM providers:

{
  "status": "ok",
  "service": "causeloop-backend",
  "checks": {
    "postgres": {"status": "ok", "latency_ms": 4, "critical": true},
    "ai_orch":  {"status": "ok", "latency_ms": 9, "critical": false}
  },
  "uptime_s": 82341,
  "version": "1.4.2"
}

This endpoint is planned but not yet implemented in the current codebase.

Operations endpoints

Protected operations endpoints (require auth) are also defined in app/routers/ops.py on a separate router with the standard auth dependency:

Endpoint	Auth	Description
`GET /ops/info`	Required	Runtime info: version, uptime, configuration summary
`GET /ops/db-status`	Required	Database connectivity status

Structured logging

The observability architecture specifies structured JSON logging with mandatory fields on every log line:

Field	Description
`timestamp`	ISO 8601
`level`	`DEBUG`, `INFO`, `WARNING`, `ERROR`
`service`	`causeloop-backend`
`trace_id`	OpenTelemetry trace ID (when available)
`tenant_id`	Workspace ID for the current request
`request_id`	Per-request UUID
`route`	HTTP method + path pattern
`status`	HTTP response status code

PII policy: no PII or secrets appear in logs. Email addresses, API keys, and encrypted values are never logged.

Centralized log aggregation (Loki, Datadog, Papertrail) is not yet configured out of the box. The application writes structured logs to stdout, which can be forwarded to any log aggregation service via your container runtime or a log collector sidecar. Centralized logging + SIEM is a SOC 2 gap — see SOC 2 readiness.

Metrics — OpenTelemetry

The target observability stack is OpenTelemetry (OTel) with a Gateway Collector exporting to Prometheus. The spec defines RED metrics per route:

Metric	Type	Labels
Request rate	Counter	`service`, `http.route`, `http.method`, `status_code`
Error rate	Counter	`service`, `http.route`, `status_code`
Request duration	Histogram	`service`, `http.route`, `http.method`
LLM router latency	Histogram	`provider`, `model`
LLM cost/tokens	Counter	`provider`, `model`
Issues ingested	Counter	`workspace_id`
Pattern matches	Counter	`workspace_id`

The planned OTel Collector configuration uses:

Tail sampling: 100% for errors and slow requests, 5% baseline
spanmetrics connector for RED metrics derived from spans (pre-sampling, so metrics are unbiased)
Grafana dashboards per service + a tenant-health overview

OTel instrumentation and the Prometheus/Grafana stack are specified in the observability backend spec but are not yet wired into the current FastAPI application. This is a SOC 2 gap (CC7.1–7.3). Adding OpenTelemetry involves installing opentelemetry-sdk, opentelemetry-instrumentation-fastapi, and configuring the OTel exporter in app/main.py.

Audit log

The audit log is the primary operational record for compliance and incident investigation. It is:

Append-only: an UPDATE/DELETE trigger (0002_audit_trace_append_only.sql) raises an exception if anyone attempts to modify or delete an audit record
RLS-scoped: each workspace sees only its own audit trail
Retention-configured: governed by workspace_settings.audit_log_retention_days with a compliance floor of ≥ 365 days

Schema

audit_log (
    id              TEXT PRIMARY KEY,
    workspace_id    TEXT NOT NULL,
    actor_id        TEXT,
    actor_name      TEXT,
    action          TEXT NOT NULL,        -- e.g. 'workspace.created', 'issue.deleted'
    action_category TEXT,                 -- e.g. 'admin', 'product', 'governance'
    target_type     TEXT,
    target_id       TEXT,
    before          JSONB,
    after           JSONB,
    ip_address      TEXT,
    trace_id        TEXT,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
)

Querying the audit log

# Via the API (requires audit:read scope)
curl https://api.causeloop.ai/v1/exports/audit-log \
  -H "Authorization: Bearer <token>"

# Directly in Postgres (as schema owner)
SELECT action, actor_name, target_type, target_id, created_at
FROM audit_log
WHERE workspace_id = 'ws_01j...'
ORDER BY created_at DESC
LIMIT 100;

Alerting (planned)

The SOC 2 audit identifies security alerting as a gap. The target alert set includes:

Alert	Trigger	Criteria
Auth failure spike	>N failed auth attempts in window	CC7.2
Rate-limit threshold	Sustained rate-limit triggers from one IP	CC7.3
Database connectivity	Postgres health check failing	A1
Error rate	5xx rate > threshold for 5 minutes	A1, CC7.1
RTBF SLA breach	`due_at` passed with request not completed	P4
LLM router error rate	Provider errors > threshold	CC7.1

These are not yet implemented. Integrating with PagerDuty, OpsGenie, or Slack alerts via the OTel Collector or a Grafana alerting rule is the recommended approach.

Operations runbook checklist

For day-to-day operations:

Deployment: docker compose pull && docker compose up -d (VPS) or railway up / render deploy
Schema migration: make migrate after each deploy that includes new migration files
New client: SELECT onboard_client(...) — see Client provisioning
Retention purge: call purge_expired(DATABASE_URL) via a scheduled job (daily recommended)
Backup: pg_dump "$DATABASE_URL" -Fc -f backup_$(date +%Y%m%d).dump
Health check: curl https://api.yourdomain.com/health
Log tailing: docker compose logs -f api (VPS) or your platform’s log viewer

Self-Hosting

Database & Provisioning

Security & Compliance

Observability & Operations

Health endpoints

Health response format

Deep health check (planned)

Operations endpoints

Structured logging

Metrics — OpenTelemetry

Audit log

Schema

Querying the audit log

Alerting (planned)

Operations runbook checklist

​Health endpoints

​Health response format

​Deep health check (planned)

​Operations endpoints

​Structured logging

​Metrics — OpenTelemetry

​Audit log

​Schema

​Querying the audit log

​Alerting (planned)

​Operations runbook checklist

Health endpoints

Health response format

Deep health check (planned)

Operations endpoints

Structured logging

Metrics — OpenTelemetry

Audit log

Schema

Querying the audit log

Alerting (planned)

Operations runbook checklist