Causeloop exposes health endpoints for infrastructure probes and a set of operations endpoints for operators. The observability architecture is designed around OpenTelemetry with Prometheus metrics, structured JSON logging, and Grafana dashboards — most of these are specified in the backend specs and represent the target state; see the production-readiness notes below for current vs. planned.
Health endpoints
Two unauthenticated health endpoints are available for infrastructure probes. These are defined in app/routers/ops.py on a public router with no auth dependency.
| Endpoint | Method | Auth | Purpose |
|---|
GET /health | GET | None | Liveness — is the process alive? |
GET /ready | GET | None | Readiness — is the app ready to serve traffic? |
Use these for Kubernetes probes, Caddy health checks, and uptime monitors:
# Kubernetes liveness + readiness probes
livenessProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 4000
initialDelaySeconds: 5
timeoutSeconds: 5
For docker-compose, the api service healthcheck calls /health:
healthcheck:
test: ["CMD", "python", "-c",
"import urllib.request; urllib.request.urlopen('http://localhost:4000/health')"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
A 503 Service Unavailable response (with body {"status": "unhealthy"}) is returned if the process is shutting down or the event loop is unresponsive.
Deep health check (planned)
The observability spec defines a GET /healthz/deep endpoint (operator-only, not a k8s probe) that performs a live dependency sweep including Postgres, Redis, and LLM providers:
{
"status": "ok",
"service": "causeloop-backend",
"checks": {
"postgres": {"status": "ok", "latency_ms": 4, "critical": true},
"ai_orch": {"status": "ok", "latency_ms": 9, "critical": false}
},
"uptime_s": 82341,
"version": "1.4.2"
}
This endpoint is specified in the backend observability spec (docs/backend-specs/17_OBSERVABILITY/) but is not yet implemented in the current codebase.
Operations endpoints
Protected operations endpoints (require auth) are also defined in app/routers/ops.py on a separate router with the standard auth dependency:
| Endpoint | Auth | Description |
|---|
GET /ops/info | Required | Runtime info: version, uptime, configuration summary |
GET /ops/db-status | Required | Database connectivity status |
Structured logging
The observability architecture specifies structured JSON logging with mandatory fields on every log line:
| Field | Description |
|---|
timestamp | ISO 8601 |
level | DEBUG, INFO, WARNING, ERROR |
service | causeloop-backend |
trace_id | OpenTelemetry trace ID (when available) |
tenant_id | Workspace ID for the current request |
request_id | Per-request UUID |
route | HTTP method + path pattern |
status | HTTP response status code |
PII policy: no PII or secrets appear in logs. Email addresses, API keys, and encrypted values are never logged.
Centralized log aggregation (Loki, Datadog, Papertrail) is not yet configured out of the box. The application writes structured logs to stdout, which can be forwarded to any log aggregation service via your container runtime or a log collector sidecar. Centralized logging + SIEM is a SOC 2 gap — see SOC 2 readiness.
Metrics — OpenTelemetry
The target observability stack is OpenTelemetry (OTel) with a Gateway Collector exporting to Prometheus. The spec defines RED metrics per route:
| Metric | Type | Labels |
|---|
| Request rate | Counter | service, http.route, http.method, status_code |
| Error rate | Counter | service, http.route, status_code |
| Request duration | Histogram | service, http.route, http.method |
| LLM router latency | Histogram | provider, model |
| LLM cost/tokens | Counter | provider, model |
| Issues ingested | Counter | workspace_id |
| Pattern matches | Counter | workspace_id |
The OTel Collector configuration (from docs/backend-specs/17_OBSERVABILITY/) uses:
- Tail sampling: 100% for errors and slow requests, 5% baseline
spanmetrics connector for RED metrics derived from spans (pre-sampling, so metrics are unbiased)
- Grafana dashboards per service + a tenant-health overview
OTel instrumentation and the Prometheus/Grafana stack are specified in the observability backend spec but are not yet wired into the current FastAPI application. This is a SOC 2 gap (CC7.1–7.3). Adding OpenTelemetry involves installing opentelemetry-sdk, opentelemetry-instrumentation-fastapi, and configuring the OTel exporter in app/main.py.
Audit log
The audit log is the primary operational record for compliance and incident investigation. It is:
- Append-only: an
UPDATE/DELETE trigger (0002_audit_trace_append_only.sql) raises an exception if anyone attempts to modify or delete an audit record
- RLS-scoped: each workspace sees only its own audit trail
- Retention-configured: governed by
workspace_settings.audit_log_retention_days with a compliance floor of ≥ 365 days
Schema
audit_log (
id TEXT PRIMARY KEY,
workspace_id TEXT NOT NULL,
actor_id TEXT,
actor_name TEXT,
action TEXT NOT NULL, -- e.g. 'workspace.created', 'issue.deleted'
action_category TEXT, -- e.g. 'admin', 'product', 'governance'
target_type TEXT,
target_id TEXT,
before JSONB,
after JSONB,
ip_address TEXT,
trace_id TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
)
Querying the audit log
# Via the API (requires audit:read scope)
curl https://api.causeloop.ai/v1/exports/audit-log \
-H "Authorization: Bearer <token>"
# Directly in Postgres (as schema owner)
SELECT action, actor_name, target_type, target_id, created_at
FROM audit_log
WHERE workspace_id = 'ws_01j...'
ORDER BY created_at DESC
LIMIT 100;
Alerting (planned)
The SOC 2 audit identifies security alerting as a gap. The target alert set includes:
| Alert | Trigger | Criteria |
|---|
| Auth failure spike | >N failed auth attempts in window | CC7.2 |
| Rate-limit threshold | Sustained rate-limit triggers from one IP | CC7.3 |
| Database connectivity | Postgres health check failing | A1 |
| Error rate | 5xx rate > threshold for 5 minutes | A1, CC7.1 |
| RTBF SLA breach | due_at passed with request not completed | P4 |
| LLM router error rate | Provider errors > threshold | CC7.1 |
These are not yet implemented. Integrating with PagerDuty, OpsGenie, or Slack alerts via the OTel Collector or a Grafana alerting rule is the recommended approach.
Operations runbook checklist
For day-to-day operations:
- Deployment:
docker compose pull && docker compose up -d (VPS) or railway up / render deploy
- Schema migration:
make migrate after each deploy that includes new migration files
- New client:
SELECT onboard_client(...) — see Client provisioning
- Retention purge: call
purge_expired(DATABASE_URL) via a scheduled job (daily recommended)
- Backup:
pg_dump "$DATABASE_URL" -Fc -f backup_$(date +%Y%m%d).dump
- Health check:
curl https://api.yourdomain.com/health
- Log tailing:
docker compose logs -f api (VPS) or your platform’s log viewer