← Work · case study
In-house observability
A fully self-hosted monitoring platform built to replace a third-party SaaS — every server, website, database, storage node, and backup job on one dashboard, on infrastructure the company owns.
> The brief
Like most companies, this one — a nationwide e-commerce operation running a hybrid AWS and on-premises environment — relied on a third-party SaaS (Datadog) to monitor its systems. It worked, but it meant a growing monthly bill and a continuous stream of operational telemetry flowing into someone else's cloud. The question was simple: what if we built our own?
The company isn't named here by agreement — it's the environment I run day to day. A deeper walkthrough is available in conversation.
> What was built
- Metrics Live health of every server — CPU, memory, disk, network — plus load balancers, auto-scaling, CDN, and cloud storage on one screen
- Logging Fleet-wide logs in one searchable home — trace an issue across systems in seconds
- APM End-to-end request tracing that pinpoints slow pages and surfaces errors as they happen
- Traffic Real-time web analytics with geographic and device breakdowns, plus bot detection
- Alerting Configurable rules watching 24/7, posting to Slack on threshold crossings and recovery
- Security Continuous automated detection of suspicious activity, with early notification
- Storage Direct health monitoring of an on-premises storage cluster
- Backups Every backup run tracked and confirmed, with Slack and email summaries
> Under the hood
Instead of a generic tool bent to fit the stack, the platform is the stack, watching itself — designed around how the environment actually runs.
- Backend Node.js 22 + Fastify, with Python services for alerting and notifications
- Data PostgreSQL / TimescaleDB for time-series metrics; ClickHouse for high-volume logs and traces
- Frontend Dependency-light dashboard in vanilla JS + Apache ECharts — fully self-hosted, no CDNs
- Instrumentation OpenTelemetry for application tracing
- Integrations AWS CloudWatch, Slack, the firewall, and the storage cluster
- Reliability Independent, isolated, self-restarting services — the monitor is not a single point of failure
- Security-first Runs entirely inside the private network, VPN-gated, authenticated, with least-privilege read-only integrations
> Why it matters
The data stays home. The bill stops growing with every host and log line. The tool fits the stack instead of the other way around. And when something breaks, the team finds out from their own platform — one pane of glass they own outright. It's the same thinking I bring to RedCyfer Systems builds: own your telemetry.