Skip to main content
What is Observability?
  1. Glossary/

What is Observability?

6 mins·
Ben Schmidt
Author
I am going to help you build the impossible.

You might hear your engineering team throw around the word observability during standups or post-mortem meetings. It often gets lumped in with general IT jargon or confused with basic monitoring.

But for a founder trying to build a product that lasts, understanding this specific term is vital.

At its simplest level, observability is a measure of how well you can understand the internal state of a system just by looking at its external outputs. The term actually originates from control theory engineering rather than computer science. It poses a straightforward engineering question. If you cannot see inside the machine, can you figure out exactly what is happening inside it based on the data it spits out?

In the context of a software startup, this translates to the ability to ask questions of your system and get answers without having to ship new code to find them.

It is about moving from guessing why a feature is broken to knowing why it is broken based on the evidence available.

The Core Definition

#

Think about a mechanical watch. If the hands stop moving, that is an external output. You know it is broken. However, unless you can infer which gear is stuck or which spring is loose just by looking at the hands, the system has low observability. You have to open it up to diagnose it.

Software works the same way. When a user clicks a button and nothing happens, that is the output. A system with high observability provides enough telemetry data—logs, metrics, and traces—that a developer can look at a dashboard and say exactly which database query failed or which microservice timed out.

They do not need to reproduce the bug locally. They do not need to guess. The system told them.

This matters because complexity in software grows faster than your team does. In the early days, you have a monolith application. One server. One database. If it breaks, you look at the one server. But as you scale, you might split that into ten different services. You might introduce third-party APIs. You might have serverless functions running in the cloud.

Suddenly, a failure in the payment processing flow might actually be caused by a latency spike in an authentication service three layers deep. Observability is the quality of your infrastructure that allows you to trace that line of causality.

Observability vs. Monitoring

#

This is the most common point of confusion. Founders often ask why they need observability tools if they already have monitoring alerts set up.

The difference lies in the nature of the failure.

Monitoring is for known unknowns. It is reactive. You set up a monitor because you know something might go wrong in a specific way. You ask the system: “Is the server CPU usage above 90 percent?” or “Is the website down?” The monitor answers yes or no. It is a dashboard of red and green lights.

Observability is for unknown unknowns. It is for the problems you never imagined could happen. It allows you to explore the data to find the answer to a question you did not know you needed to ask.

Monitoring tells you that your checkout page is slow. Observability allows you to query the data to discover that the checkout page is slow only for iOS users in Germany who have more than five items in their cart because of a specific database lock contention.

Monitoring is the smoke alarm. Observability is the forensics team that figures out how the fire started.

If you only have monitoring, your team will spend hours or days debugging complex issues. If you have observability, that time is often reduced to minutes.

The Three Pillars

#

Reduce the time between failure and fix.
Reduce the time between failure and fix.
To achieve this state of transparency, your technical team needs to instrument your code to produce three specific types of data. These are often called the three pillars.

Logs

These are discrete records of events. A log says: “Payment processed at 10:00 AM.” Logs are high fidelity and easy to generate. However, they can be overwhelming in volume and expensive to store if you are not careful.

Metrics

These are numerical representations of data measured over time. CPU usage, memory consumption, request latency, or number of active users. Metrics are great for spotting trends. They tell you that usage spiked at noon, but they lack the context to tell you exactly what those users were doing.

Traces

This is the most complex but often the most valuable pillar for modern startups. A trace follows a single request as it hops through different parts of your system. If a user logs in, views a product, adds it to the cart, and checks out, a trace ties all those backend operations together. It shows you that the login took 50ms, the product view took 20ms, but the add-to-cart took 5000ms.

Without traces, you are just looking at isolated piles of data. Traces provide the narrative.

When to Prioritize This

#

There is a cost to observability. There is a financial cost in paying for vendors who store and visualize this data. There is also an engineering cost in setting it up and maintaining it.

In the very early stages, when you are building an MVP and have five users, you probably do not need a complex observability suite. You can look at the server logs manually. You can talk to the users directly.

However, the need for observability creeps up on you silently.

It usually happens when you hire your first engineers who did not write the original code. They do not have the mental map of the system that the founders do. When something breaks, they are flying blind.

It also happens when you start decoupling your architecture. As soon as network calls are involved between different parts of your application, you introduce failures that are hard to replicate.

If you wait until your system is on fire to think about observability, it is too late. You cannot instrument a system after it has crashed. You need the data to be there beforehand.

Questions for Your Team

#

As a founder, you do not need to know how to configure the tracing agent. But you do need to ask the right questions to ensure your business is resilient.

When the site goes down, does the team know why immediately, or do they spend hours guessing?

Do we know what the experience is like for a specific high-value customer, or do we only look at averages?

If we deployed a bad update, would we know it is bad before the users start complaining on Twitter?

Observability shifts the culture of a startup. It moves the team away from blame and toward curiosity. Instead of asking “Who broke this?” the data allows them to ask “Why did the system react this way?”

It is a fundamental shift in how you operate a digital business. It turns the lights on in a dark room.