[Purpose: unify how errors are classified, shaped, propagated, logged, and monitored]
{
"error": {
"code": "ERROR_CODE",
"message": "Human-readable message",
"requestId": "trace-id",
"timestamp": "ISO-8601"
}
}
Principles: stable code enums, no secrets, include trace info.
Example pattern:
try { return await useCase(); }
catch (e) {
if (e instanceof BusinessError) return respondMapped(e);
logError(e); return respondInternal();
}
Log: operation, userId (if available), code, message, stack, requestId, minimal context. Do not log: passwords, tokens, secrets, full PII, full bodies with sensitive data. Levels: ERROR (failures), WARN (recoverable/edge), INFO (key events), DEBUG (diagnostics).
Retry when: network/timeouts/transient 5xx AND operation is idempotent. Do not retry: 4xx, business errors, non-idempotent flows. Strategy: exponential backoff + jitter, capped attempts; require idempotency keys.
Track: error rates by code/category, latency, saturation; alert on spikes/SLI breaches.
Expose health: /health (live), /health/ready (ready). Link errors to traces.
Focus on patterns and decisions. No implementation details or exhaustive lists.