Production API health monitoring & incident management dashboard
Live monitoring dashboard deployed on Railway + Vercel, processing 1,200+ endpoint checks per minute with sub-millisecond aggregate queries and real-time WebSocket event streaming.
Architecture
Data flow from user interface through API layer to persistence and cloud deployment
The problem
Support engineers spend too much time discovering incidents through customer reports rather than internal monitoring. SENTINEL/OPS was built to demonstrate what production-grade API health tooling looks like: real polling, real alerting, real SLA tracking, not a mocked dashboard.
Architecture decisions
The core insight was separating concerns into four discrete BullMQ queues: scheduling, HTTP probing, rule evaluation, and notification. This means the notifier never sees raw HTTP responses and the checker never writes alert logic.
- Scheduler enqueues check jobs at configurable intervals per endpoint
- Checker performs HTTP probe, writes result to TimescaleDB hypertable
- Checker also calls broadcast() for zero-latency WebSocket push
- Evaluator reads continuous aggregate (not raw rows) for rule evaluation
- Notifier fires Twilio SMS and/or SendGrid email based on alert config
TimescaleDB as the backbone
TimescaleDB's continuous aggregates pre-compute 1-minute windowed averages of latency, uptime, and error rates. Dashboard queries hit the `check_results_1min` materialized view, sub-millisecond regardless of data volume. The generated `mttr_minutes` column auto-calculates mean time to recovery when `resolved_at` is written, eliminating application-layer math.
Testing strategy
The 64-test suite covers the full stack: Vitest unit tests for SLA utility functions, Zustand store action tests, and API route tests with mocked BullMQ and TimescaleDB clients via supertest. The goal was to test contract boundaries, not implementation details.