=============
== Stefano ==
=============
A personal blog on any topic, with a strong focus on Observability and Software Engineering

Making software systems observable

observability

In the era of “always on” services and cloud platform, observability practises have become a central enabler to achieve this. But what is observablity? For starter, observability is not something strictly related to Software: in fact the concept of observability can be defined in much broader terms.

Observability is a property of a system. A system is said to be observable if the internal state of the system can be inferred by the output signals generated by the system.

So the question you should be asking yourself is not if you are “implementing observability right” but you should rather ask yourself “is this system observable?”.

With the above broad definition in mind, let’s to specialise it for the software engineering field. The output signals of a software system traditionally are:

  • Metrics
  • Logs
  • Traces

These have often been referred to as the three pillars of observability. Each of these signals serve a different use case in the domain of observability. Broadly speaking, metrics can be used to detect when something goes wrong, traces help us understand what went wrong, and logs help us understand why. The ability to interoperate them is crucial for these signals to really make systems observable.

The use case for metrics

A metric is an aggregated measurement taken at a given point in time.

The fact that metrics are aggregated measurements is key to keep the volume of data produced manageable and predictable. When measuring endpoint latency, only aggregations over the selected measuring interval should be reported: classic aggregations are the number of times the endpoint was called, the total time taken to serve all requests, the maximum time taken to serve a request in the interval and other aggregates statistics like duration percentiles. This is important because it makes the amount of metrics produced independent of the amount of traffice served by the system. Keeping this under control enables affordable long retention times and more responsive queries, a prerequisite to use them for alerting purposed.

Cardinality explosion

Cardinality is a concept closely realted to dimensional metrics: formally, the cardinality of a metric is the number of possible combinations of different values for all the dimensions of the metric. Take for example the metric http_server_request_seconds_count. If this was recorded for two services A and B then we would have a metric with cardinality 2

http_server_request_seconds_count{service_name="A"} 5
http_server_request_seconds_count{service_name="B"} 5

If additionally we measure this for 3 endpoints which both services serve, say endpoints C, D and E then we’d have a cardinality of 6.

http_server_request_seconds_count{service_name="A", endpoint="C"} 1
http_server_request_seconds_count{service_name="A", endpoint="D"} 2
http_server_request_seconds_count{service_name="A", endpoint="E"} 2
http_server_request_seconds_count{service_name="B", endpoint="C"} 1
http_server_request_seconds_count{service_name="B", endpoint="D"} 2
http_server_request_seconds_count{service_name="B", endpoint="E"} 2

This is still acceptable because these dimensions have a finite amount of value they can assume, since a service has a finite amount of exposed endpoints. If we add a dimension with an unbounded amount of possible values, like a requestId or a userId trying to track down a specific user then the cardinality of the metric will explode! An high cardinality for a metric will result in a high number of timeseries for the given metric, which means higher number of data samples to be storedgoing against the properties that make metrics affordable.

The use case for traces

A trace is a structured recording of an ordered series of spans for a given interaction with the system. A span is a representation of a unit of work.

The conceptual difference with metrics is that a trace represent a single specific interaction with the system. We can see how they are complementary to metrics, giving more fine grained information about our system. This however also means that the amount of data genereated for tracing is directly proportional to the amount of traffice served by our systems: for this reason sampling strategies are often put in place to keep the amount of data under control while not loosing valuable information.

The information a trace give us is the operations (spans) carried out to fulfill the request, the time taken to fulfil every operation, their outcome and extra custom metadata. Spans are linked toghether by a common identifier trace_id and each span has also a parent_id reference to the parent span if present (the root span doesn’t have any parent). This context is passed down the system so that each operation can decorate the reported information correctly.

Given the nature of traces and the way they are stored and queried, the metadata attached to a trace can have high cardinality attributes.

Interoperating metrics and traces

Right now we have metrics that give us an aggregated view of the system, and traces that give us visibility into individual requests. With exemplars our system can enrich the collected metric with one trace_id that is an “example” request that happened during the measurement interval for the given dimensions. This enables the possibility of investigating systems at an aggregated level with metrics and jumping into a finer level of detail while still retaining the same context.

The use case for logs

A log is a timestamped record, either structured or unstructured

Logs is propably the signal that has been around the longest and with which developers are most familiar. They offer the finest grained level of information, and the highest degree of customisability. The current recommendation is to have structured logs, i.e. logs that follow a consistent and machine-parseable format. A popular format for this is json, but also logfmt is a used one: choosing the best one might depend on the log aggregation and log analysis tools that might work better with one or the other.

Contextual propagation with structured logs

Having a defined strucure for you logs allow you to enrich them with contextual information: for example a field trace_id could be used to tag all logs produced by the system for a given interaction, to jump from a given trace to the even finer grained log information for the same trace. To link them with metrics, a few key recorded dimensions should also be added to the structured log like endpoint.

Summary

Hopefully it’s now clear what is the puropse of each of the observability signals in software systems, and how key is to share context between them to really be able to make your system observable.

  • Use metric to describe the high level performance of your system.
  • Use tracing to break down unit of works and investigate individual failures. Link them to metrics with exemplars.
  • Use structured logs to record the finest grained information and connect them with metrics and traces with contextual propgaation.