Taming the observability maze

.

Introduction

Last week we wrote about how bi(OS) was hit with the load equivalent to two black Fridays on Thursday by tier-1 global retailers during Black Friday. While we are proud of our achievement, we don’t take our customers’ reliability for granted. Although, we do take a significantly different approach. Read on to learn why observability is a bi(OS) feature and not a product/company/magic-quadrant.

Build your own {Data|ML|DevOps|SecOps …} observability

Our SRE teams have worked with the best possible observability tools in the past – Splunk, Google StackDriver, Prometheus, Grafana, ELK, DataDog, etc.  When we looked for the best observability solution, it felt like the market provided feature-rich, impact-poor answers.  We asked why can’t bi(OS) observe itself?  We perfected capabilities that can deliver impact without breaking the bank.

Goals

When we embarked on this journey, we set three goals for ourselves:

  • Systems, Data, ML observability, should follow the same learning curve – we are a small, agile team and can’t afford to die by a thousand cuts…er tools.
  • Both push and pull modes of ingestion should be supported for various data sources in CSV and JSON formats.
  • The solution should be SRE-friendly and multi-cloud ready, given bi(OS) runs on GCP, AWS, and Azure.

The Solution 

We used Fluentbit as an SRE-friendly solution to record and forward various metrics from various bi(OS) components. Also, agentless instrumentation was performed via deeper instrumentation of the apps and APIs. All system metrics (e.g., CPU, memory, DB), application logs (e.g., Errors, INFO, Warnings), and data quality metrics were stored within bi(OS). The results were enlightening.

The key highlights of the solution were

  • The entire solution was implemented in a day for our primary cluster responsible for serving Tier-1 retailers.
  • We used the same solution for operational and systems metrics, while our customers used the same interface for keeping a tab on their own business.
  • The solution didn’t add more than 5% overhead on the existing cluster.

Conclusion

At Isima, observability isn’t a product/feature/tool that is bolted on; it’s built-in.  Being true to our Telco-grade promise to our customers, we don’t expect them to buy/integrate solutions to observe bi(OS).  bi(OS) has already been battle-tested for a pre-Black Friday 6x regular load with observability built-in.  Experience features such as Top-N, Cyclical Comparison, X and Y split-by, and Derived metrics with a swipe.