In-House Observability Platform — Case Study

> The brief

Like most companies, this one — a nationwide e-commerce operation running a hybrid AWS and on-premises environment — relied on a third-party SaaS (Datadog) to monitor its systems. It worked, but it meant a growing monthly bill and a continuous stream of operational telemetry flowing into someone else's cloud. The question was simple: what if we built our own?

The company isn't named here by agreement — it's the environment I run day to day. A deeper walkthrough is available in conversation.

> What was built

Metrics Live health of every server — CPU, memory, disk, network — plus load balancers, auto-scaling, CDN, and cloud storage on one screen
Logging Fleet-wide logs in one searchable home — trace an issue across systems in seconds
APM End-to-end request tracing that pinpoints slow pages and surfaces errors as they happen
Traffic Real-time web analytics with geographic and device breakdowns, plus bot detection
Alerting Configurable rules watching 24/7, posting to Slack on threshold crossings and recovery
Security Continuous automated detection of suspicious activity, with early notification
Storage Direct health monitoring of an on-premises storage cluster
Backups Every backup run tracked and confirmed, with Slack and email summaries

> Under the hood

Instead of a generic tool bent to fit the stack, the platform is the stack, watching itself — designed around how the environment actually runs.

Backend Node.js 22 + Fastify, with Python services for alerting and notifications
Data PostgreSQL / TimescaleDB for time-series metrics; ClickHouse for high-volume logs and traces
Frontend Dependency-light dashboard in vanilla JS + Apache ECharts — fully self-hosted, no CDNs
Instrumentation OpenTelemetry for application tracing
Integrations AWS CloudWatch, Slack, the firewall, and the storage cluster
Reliability Independent, isolated, self-restarting services — the monitor is not a single point of failure
Security-first Runs entirely inside the private network, VPN-gated, authenticated, with least-privilege read-only integrations

> Why it matters

The data stays home. The bill stops growing with every host and log line. The tool fits the stack instead of the other way around. And when something breaks, the team finds out from their own platform — one pane of glass they own outright. It's the same thinking I bring to RedCyfer Systems builds: own your telemetry.

In-house observability

> The brief

> What was built

> Under the hood

> Why it matters

Paying a growing monitoring bill?