Observability: What is it? – Not An Expert But

Observability, what is it? and why do people keep talking about it?

So, what is observability? Well that’s a great question! let’s take a look at what people (*cough* typically platform vendors) are saying!

“Observability refers to the ability to monitor, measure, and understand the state of a system or application by examining its outputs, logs, and performance metrics” —Redhat

“Observability is the ability to measure a system’s current state based on the data it generates, recorded as logs, metrics, and traces” — Dynatrace

“Observability is the ability to collect data about programs’ execution, modules’ internal states, and the communication among components” — Wikipedia

I could go on, and on, and on, and on…

What am I trying to say? That these are wrong? No, not at all. The fact is these are all correct and say pretty much exactly the same thing. Which I would prefer to simplify, while at the same time overcomplicating it a tad more:

“Observability is the practice of being able to see the current state of a system and understand why it is in that state” – Duncan

It allows the observer the ability to monitor, debug and improve systems.

So why do you need Observability?

This answer to me is threefold:

1. Understanding the performance/availability/behaviour of your service/environment
2. Testing/verifying new releases in the development process
3. Understanding user behaviour

Points 1 & 2 here are the most used for the “Why” of Observability. Of course monitoring you system, being alerted to problems and being able to debug them are the core reason for Observability. The same applies to the development pipeline. Point 3 however is used, but I don’t believe is explained well. I plan an in-depth future article on this, because for me it is one of the most exciting parts of observability. The main two meet critical needs, the third is pure golden value.

What does this give you?

This gives you the ability to monitor and improve some crucial metrics for your application(s). These are often described as MTT(X) as it covers a number of metrics. In brief here are some of the more common ones:

MTTD – Mean Time To Detection/Discovery – This is your ability to detect issues and how quickly you do so
MTTI – Mean Time to Investigate – This is how long it takes you or your teams to investigate the issues and find a resolution
MTTR – Mean Time to Resolution/Resolve – This is how long it takes to resolve the issue – scale up resources, restart a service, fix a bug etc.

The Core Pillars

Now we know the why, Let me take you back to a time before modern platforms, where vendors fought for smaller spaces such as log analytics, performance monitoring etc. Here, in this time before (modern) Observability battles were small, and niche. However teams who were tasked with doing the monitoring would need a number of these niche products to get the relevant views into their systems. To monitor they needed many a system.

This was obviously quite clunky. This triggered the vendors to start to adapt, and sell the idea of a “Single pane of glass”. A platform which would consume all the data of the niche products and alert/provide insights off the back of it.

The initial idea comes from the often referenced 3 pillars of observability diagram. In early 2017 Peter Bourgon attended a distributed tracing summit and came up with this diagram

Thus, the 3 pillars were born:

1. Metrics (The oldest of the pillars)
2. Logs
3. Traces

Metrics

The oldest of the pillars. Metrics provide insights into systems via numerical data such as CPU Usage, Memory Usage, API Latency, Requests and Error rates.

Logs

Records of system events / errors / exceptions that have occurred in an application, or from your infrastructure. They provide context in the form of structured or unstructured data

Traces

Provide representations of requests within or to a system. Typically are described as providing insights into distributed systems however they have a sound use case in monolithic applications.

These are the foundations of Observability. They fundamentally allow for it to work.

There exist 3 areas where you can get all of the data to provide a full observable view of your solution/platform:

1. Infrastructure (Servers, Containers (Docker, Kubernetes), Serverless etc)
2. Applications
3. Front end (Real User Monitoring or RUM for short)

It is rare that I have had the fortune to see a company implement all of these 3 areas in one system, however it is possible and I have seen it.
Observability at it’s core is monitoring (and alerting), analysis of your data and root cause analysis

How can you drive your companies or teams Observability?

This is a hard question, and I say this honestly. Observability is driven by need, and everyone’s needs are different. Which leads to many different answers here. However one could try and describe a path that could be followed, which much like the pirates code from a fantastic film should be treated as guidelines and not necessarily a fixed set of rules.

The general guideline from me is to move from basic monitoring to proactive Observability. Yes, don’t wait for something to happen and fix it after, but find it first and fix it.

What I hope to achieve with this blog is to provide insights both functional and theoretical into concepts and best practices. I want to give you the tools to ensure your systems are reliable and that you can go from little or no Observability to being able to quickly detect, identify and resolve problems as they happen while also having the ability to proactively prevent incidents before they impact your user or customer base.