Decoding Metrics: The Universal Language of Observability

Last edited: March 6, 2025

Esperanto. Developed in 1887, it had visions of being a universal language into which all other languages were translated rather than learning each individual language. The creator, LL Zamenhof, had high hopes for his creation.

“Were there but an international language, all translations would be made into it alone ... and all nations would be united in a common brotherhood.”

More than 140 years later, the world's population is fast approaching 7 billion, and the Esperantists, as they are known, have grown to … about 100,000. Certainly not the unifying language it was hoped to be.

Unless you consider logs, metrics, and traces as languages, I am not a linguist or a polyglot. And, if you consider how misunderstood metrics are in the world, maybe I am a polyglot, after all.

Let’s talk about metrics, what they are, and why they are so damn complicated to understand.

In the world of observability, metrics have become our Esperanto, a universal language that promises clarity but often leads to confusion. As we'll explore, the complexity of metrics poses significant challenges for organizations striving for effective observability.

Organizations struggling with metric management often face increased costs, slower incident resolution times, and difficulty in making data-driven decisions. These challenges can directly impact competitiveness and bottom-line results in today's fast-paced digital environment.

First, though, let’s take a step back. What are metrics, anyway?

What Are Metrics?

Metrics is a measurement from a point in time. Sounds simple, right? Not exactly. While the measurement is numeric, that’s where the simplicity ends. Counters go up until they are reset, unless you are using the OpenTelemetry Metrics format. OTLP Metrics uses a monotonic sum instead of a counter and non-monotonic sum as a gauge.

In addition to counters that are counters and counters that are gauges (think your speedometer), we’ve also got timers, gauge deltas, and sets from Statsd, gauges, histograms, and summaries from Prometheus, with OpenTelemetry Metrics bringing exponentialhistograms (yes, all one word!) to the table.

Confused? You are not alone. Confusion does not help observability, yet observability is built on a foundation of metrics. What could possibly go wrong?!?

Why Are Metrics So Damned Hard?

The best way to answer this question is with a classic xkcd comic about standards.

https://xkcd.com/927/

We’ve already discussed the most common metrics standards. Here’s their order of introduction, from oldest to newest.

StatsD Extended (386 bytes)

kafka.consumer.request_size_avg:164.6153846153846|g|#service:frauddetectionservice,client_id:consumer-frauddetectionservice-1,node_id:node-2147483646,container_id:296ecbc9372a5f6a21c98218fe2696cd3a2de69d1e42128a9401dfde974e8a90,host_name:296ecbc9372a,host_arch:aarch64,os_type:linux,process_pid:1,telemetry_sdk_name:opentelemetry,telemetry_sdk_version:1.31.0,telemetry_sdk_language:java

Prometheus (756 bytes)

# HELP kafka_consumer_request_size_avg Average request size for Kafka consumer # TYPE kafka_consumer_request_size_avg gauge kafka_consumer_request_size_avg{container_id="296ecbc9372a5f6a21c98218fe2696cd3a2de69d1e42128a9401dfde974e8a90",docker_cli_cobra_command_path="docker compose",host_arch="aarch64",host_name="296ecbc9372a",os_description="Linux 6.6.26-linuxkit",os_type="linux",process_pid="1",process_runtime_name="OpenJDK Runtime Environment",process_runtime_version="17.0.9+9-Debian-1deb11u1",service_name="frauddetectionservice",telemetry_auto_version="1.31.0",telemetry_sdk_language="java",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="1.31.0",client_id="consumer-frauddetectionservice-1",node_id="node-2147483646"} 164.6153846153846

OTLP Metrics (1,128 bytes)

{"schema_url":"","scope":{"name":"io.opentelemetry.kafka-clients-0.11","version":"1.31.0-alpha","dropped_attributes_count":0},"container.id":"296ecbc9372a5f6a21c98218fe2696cd3a2de69d1e42128a9401dfde974e8a90","docker.cli.cobra.command_path":"docker compose","host.arch":"aarch64","host.name":"296ecbc9372a","os.description":"Linux 6.6.26-linuxkit","os.type":"linux","process.command_args":"[\"/usr/lib/jvm/java-17-openjdk-arm64/bin/java\",\"-jar\",\"frauddetectionservice-1.0-all.jar\"]","process.executable.path":"/usr/lib/jvm/java-17-openjdk-arm64/bin/java","process.pid":1,"process.runtime.description":"Debian OpenJDK 64-Bit Server VM 17.0.9+9-Debian-1deb11u1","process.runtime.name":"OpenJDK Runtime Environment","process.runtime.version":"17.0.9+9-Debian-1deb11u1","service.name":"frauddetectionservice","telemetry.auto.version":"1.31.0","telemetry.sdk.language":"java","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.31.0","client-id":"consumer-frauddetectionservice-1","node-id":"node-2147483646","_time":1717264158.9800723,"_metric":"kafka.consumer.request_size_avg","_value":164.6153846153846,"flags":0}

Three metric datapoints using different formats to show the same details, but only because we used Statsd Extended format. (Yes, another metric format!) I also included the byte count so that you didn’t have to eye them up and guess which format was larger. We have three widely adopted formats with vastly different layouts, each one capable of telling you that the metric datapoint kafka.consumer.request_size_avg has a value of 164.6153846153846 at 2024-06-01 18:49:18 UTC.

Each format is an open standard and each has its own positives and negatives. How do you pick the format that is right for you?

How We Do Metrics Wrong

Converting between metrics formats like StatsD, Prometheus, and OTLP Metrics is hard. Cribl made translating into OTLP Metrics format easier with our aptly named OTLP Metrics Function, but what about converting the same metric format into a different metric type?

What if I have 100 metric datapoints per minute emitted from a set of 10 different container IDs, but everything else is the same? I may want to drop the container.id dimension to reduce cardinality, or the uniqueness of a specific datapoint. Pretty easy, right? I just use a Function like Eval in Cribl Stream and remove that container.id key. And, done!

I now have 100 metric datapoints with exactly the same dimensions for each 6-second period (100 metrics, 10 containers, 6-second intervals), as the only difference was the container.id for each 6-second window. Those 100 metric datapoints need to be aggregated into a single metric datapoint per interval so that we have 6 datapoints, not 100.

Aggregations are where things get really complicated. Aggregate Functions should be applied to specific metric types. Even if the math “works”, calculating the sum() of a gauge and then storing it as a counter will really confuse a metrics datastore that expects counters to be monotonic when they are suddenly non-monotonic. If you toss in a delta counter vs. a cumulative counter, chaos will not be far behind.

At Cribl, we recognize these challenges and have developed solutions to address them head-on. Our OTLP Metrics Function simplifies format conversion, while our advanced aggregation capabilities help manage cardinality without sacrificing data fidelity. These innovations are designed to make metrics more accessible and actionable for organizations of all sizes.

What Does the Future Hold?

Understanding metrics is different from understanding events or traces. Yes, I said traces because traces are more aligned to structured, ordered logs than we want to believe. (But that’s a blog post for another day!) Where logs and traces (technically spans in a trace) can be viewed as discrete events, metrics are best understood by pivoting the context from time to the metric name. In our 4.10 release, we added a new visualization option for samples and data captures that automatically pivots to a metric context, including showing the difference between pre-and post-pipeline data sets.

The next frontier is interacting with metrics and defining the right Functions (or series of Functions) to summarize, aggregate, drop, etc., across metrics and dimensions. Building metric-specific Functions—because they are not event-centric like logs or spans/traces—will help ease the cognitive burden of parsing, managing, and contextualizing metrics.

This is just the beginning. We see a not-too-distant future where interacting with metrics is as intuitive as working with logs or traces. By developing metric-specific Functions and intelligent aggregation techniques, we're working to reduce the cognitive burden of metric management, allowing teams to focus on deriving insights rather than wrestling with data formats.

Conclusion

Math is hard. Metrics are math by another name. When we can make math and metrics a little easier, we should. Cribl speaks logs, metrics, and traces. We speak StatsD, Prometheus, and OTLP Metrics. Ensuring that we can all speak about metrics without barriers is the Esperanto of metrics.

Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.

We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.

Previous articleNext article