Observability: Distributed Tracing with OpenTelemetry — Part 1

Prabuddha Chakraborty
Licious Technology
Published in
9 min readSep 7, 2023

--

Sourced from Unsplash

The Need for Distributed Tracing

As applications grow in complexity, so do the challenges of debugging, optimizing, and maintaining them. Distributed systems, where various services interact to fulfill a user request, amplify these challenges. Imagine trying to trace the journey of a user request as it hops across microservices, databases, and external APIs. This is where distributed tracing shines.

Distributed tracing allows developers and operations teams to visualize and analyze the flow of requests as they traverse through different components of an application. This real-time insight not only helps in diagnosing and resolving issues but also in optimizing performance and resource allocation.

Introducing OpenTelemetry

OpenTelemetry is an open-source project that provides a unified set of APIs, libraries, agents, and instrumentation to enable observability in applications. It supports multiple programming languages and frameworks, making it a versatile choice for modern software development. OpenTelemetry integrates with various observability tools, enabling you to collect, process, and visualize telemetry data, including traces, metrics, and logs.

Key Features of OpenTelemetry

  1. Automatic Instrumentation: OpenTelemetry offers automatic instrumentation for popular libraries and frameworks. This means you can start capturing traces without manually adding code to each service.
  2. Flexibility: You have the freedom to choose the components you want to use. Whether it’s capturing traces, metrics, or logs, OpenTelemetry has you covered.
  3. Vendor-Neutral: OpenTelemetry is designed to be vendor-agnostic, allowing you to send telemetry data to various backend systems like Prometheus, Jaeger, Zipkin, and more.
  4. Context Propagation: Distributed tracing relies on context propagation to maintain the correlation of traces across different services. OpenTelemetry provides mechanisms for seamless context propagation.

Key Components of OpenTelemetry

OpenTelemetry is currently made up of several main components:

APIs and SDKs

OpenTelemetry provides APIs and SDKs for various programming languages like Java, Python, JavaScript, Go, etc. These APIs allow developers to add instrumentation code to their applications to generate traces and metrics. The SDKs include built-in integrations with popular frameworks and libraries.

OTLP

The OpenTelemetry Protocol (OTLP) specifies how OpenTelemetry-related components exchange telemetry data. This includes instrumented applications, intermediary services such as the Collector, as well as backends to which the data is eventually sent.

Collector

The OpenTelemetry Collector can collect data from OpenTelemetry SDKs and other sources, and then export this telemetry to any supported backend, such as Jaeger, Prometheus, or Kafka queues.

Collector Image from cncf

OpenTelemetry Collector is built as a processing pipeline in a pluggable architecture, with four main parts:

  1. Receivers for ingesting incoming data of various formats and protocols, such as OTLP, Jaeger, and Zipkin. You can find the list of available receivers here.
  2. Processors for performing data aggregation, filtering, sampling, and other collector processing logic on the telemetry data. Processors can be chained to produce complex processing logic.
  3. Exporters for emitting the telemetry data to one or more backend destinations (typically analysis tools or higher order aggregators) in various formats and protocols, such as OTLP, Prometheus and Jaeger.
    You can find the list of available exporters here.
  4. Connectors for connecting different pipelines in the same Collector. The connector serves as both an exporter and receiver, so it can consume data as an exporter from one pipeline and emit the data as a receiver into another pipeline.

Backend Storage

OpenTelemetry primarily focuses on capturing, tracing, and exporting observability data, but it doesn’t inherently provide storage components for long-term retention. Instead, OpenTelemetry relies on integrations with existing observability platforms and systems for storage, analysis, and visualization of collected data. Here are some common storage components and platforms that can be integrated with OpenTelemetry:

  • Jaeger: Jaeger is an open-source, end-to-end distributed tracing platform that supports the storage and visualization of trace data. OpenTelemetry can export trace data to Jaeger for storage and analysis.
  • Zipkin: Zipkin is another open-source distributed tracing system that can store and display trace data. OpenTelemetry can export trace data to Zipkin-compatible formats.
  • Prometheus: Prometheus is an open-source monitoring system that is commonly used to store and query metric data. OpenTelemetry can export metrics in Prometheus-compatible formats.
  • Custom Data Stores: Organizations can choose to use custom storage solutions using databases like Clickhouse, Druid, MySQL, PostgreSQL, or NoSQL databases like Cassandra or MongoDB.

Distributed Tracing with Opentelemetry

At the crux of distributed tracing, the complete flow of a user request or a transaction, from its initiation to its completion, is represented by a trace. Traces are broken down into multiple spans.

Distributed Tracing

Span

In Opentelemetry, spans represent individual units of work or operations within a distributed trace. Spans contain information about the duration, start and end times, associated attributes, events, and more.

The exact structure of a span can vary based on the chosen serialization format, but here’s a sample JSON representation of a span in the OpenTelemetry Protocol (OTLP) format based on OTLP Protocol Buffers Schema :

{
"name": "sample_licious_span",
"trace_id": "0123456789abcdef0123456789abcdef",
"span_id": "0123456789abcdef",
"parent_span_id": "0123456789abcdef",
"start_time_unix_nano": 1678901234567890,
"end_time_unix_nano": 1678901234568900,
"status": {
"code": "OK"
},
"attributes": [
{
"key": "http.method",
"value": { "string_value": "GET" }
},
{
"key": "http.url",
"value": { "string_value": "https://licious.com/api" }
}
],
"events": [
{
"time_unix_nano": 1678901234568000,
"name": "event_name",
"attributes": [
{
"key": "event.attr",
"value": { "string_value": "attribute_value" }
}
]
}
]
}

In a Trace, the first span is known as the parent span or root span.
A parent span encapsulates the end-to-end latency of an entire request.
A child span is a span that is initiated by a parent span. Child spans are used to represent operations that are part of a larger task or operation represented by the parent span. Child spans can also have their own set of attributes, events, and timing information.

Consider a basic web server example, when the server processes the incoming HTTP request, it might make calls to a database, an external API, and a caching service. Each of these operations can be captured as a child span of the parent span representing the incoming request. These child spans allow you to understand the performance and behavior of each specific operation within the context of the larger request. Parent-child relationships between spans are crucial for understanding how requests flow through a distributed system, identifying latency issues, diagnosing errors, and optimizing performance. They help create a visual representation of the entire journey of a request, allowing developers and operators to troubleshoot and optimize the system effectively.

In the OpenTelemetry context, the SDK and instrumentation libraries handle the creation and management of spans, including establishing parent-child relationships. As you instrument your code, you can use the provided APIs to create new spans and specify the appropriate parent span for each new span you create.

Context Propagation

Context propagation is a critical aspect of distributed tracing that enables the correlation of trace and span information as requests flow through different services and components within a distributed system. It ensures that related spans are linked together to form a coherent trace, even as the request is processed by different services in different parts of the system. Context propagation typically involves carrying trace context information, such as trace and span IDs, from the parent span to child spans and across service boundaries.

Two common context propagation standards used in distributed tracing are the W3C Trace Context specification and the B3 propagation format.

W3C Trace Context Specification:

The W3C Trace Context specification is an industry-standard approach to context propagation in distributed tracing. It defines a set of HTTP headers that are used to convey trace context information between services. The headers specified by W3C Trace Context are:

  1. traceparent: This header contains the essential trace and span IDs, along with trace flags that indicate whether the trace should be sampled. It ensures that the trace context is propagated between services and enables the creation of child spans.
  2. tracestate: This optional header allows services to include additional context information. It provides flexibility for custom data that might be needed for correlation.

B3 Propagation

The B3 propagation format is a simpler context propagation mechanism commonly used by various distributed tracing systems. It originated from the Zipkin project and is supported by multiple tracing solutions. B3 defines three main headers for context propagation:

  1. X-B3-TraceId: Represents the trace ID for the entire trace. It is a 64-bit value.
  2. X-B3-SpanId: Represents the ID of the current span.
  3. X-B3-Sampled: Indicates whether the current span should be sampled for tracing.

Example of how Context propagation works:

Let’s consider an example of a distributed request that involves two services, A and B, using the W3C Trace Context headers for context propagation:

Service A:

  • Service A receives an incoming request and generates a new trace ID and span ID.
  • It attaches the traceparent header to the outgoing request to Service B, containing the trace and span IDs.

Service B:

  • Service B receives the request from Service A and extracts the trace context information from the traceparent header.
  • It uses the extracted trace and span IDs to continue the trace in its own context.
  • Service B attaches the traceparent header to any subsequent outgoing requests it makes to other services.

By consistently propagating trace context information using the specified headers, the trace correlation remains intact as the request flows through different services. This correlation allows the distributed tracing system to link the spans together and provide a complete view of the request’s journey.

OpenTelemetry integrates seamlessly with both the W3C Trace Context specification and the B3 propagation format. When you use OpenTelemetry’s APIs and SDKs for instrumenting your applications, context propagation is automatically handled according to the chosen standard. This enables you to capture trace context, maintain trace correlation, and seamlessly integrate with existing tracing ecosystems.

For more information on W3 Trace : https://www.w3.org/TR/trace-context/

OpenTelemetry Operator: Streamlining Deployment

Deploying observability tools can be daunting, especially in dynamic environments like Kubernetes. This is where the OpenTelemetry Operator comes to the rescue. The operator abstracts the complexities of setting up and managing OpenTelemetry components, making observability deployment a breeze.

Opentelemetry Operator Dataflow

Benefits of the OpenTelemetry Operator

  1. Simplified Deployment: The operator simplifies the deployment process by providing a declarative way to manage OpenTelemetry components as Kubernetes resources.
  2. Automated Updates: With the operator in place, you can easily update OpenTelemetry components across your cluster without manual intervention.
  3. Custom Configuration: Tailor OpenTelemetry configurations to your application’s specific needs by defining custom resource specifications.
  4. Scalability: The operator enables seamless scaling of observability components based on application demands, ensuring consistent performance insights.

Observability at Licious Today

By adopting OpenTelemetry and leveraging its distributed tracing capabilities, Licious inside the organization has gained real-time visibility into the journey of user requests. This enabled us to identify bottlenecks, track latency issues, and enhance their overall service quality. The automatic instrumentation feature of OpenTelemetry Operator allowed the organization to effortlessly integrate tracing into our services without excessive manual intervention, ensuring swift implementation.

Licious Observability Setup

Having a centralized Opentelemetry collector, allowed us to seamlessly try out several vendors providing open-source solutions like SigNoz, Hypertrace, and Uptrace. In some other articles, we may provide comparisons between these solutions.

A glance at the Internal of the Licious Observability Platform based on OpenTelemetry spans

Our Internal APM Platform is created on top of 2 OSS available solutions — Jaeger & SigNoz.

RED (Rate-Error-Latency) Metrics

Tracking Database Calls Metrics

Tracking External Calls Metrics

Tracing Span across services

Wrapping Up

Distributed tracing assumes a pivotal role within the observability inside Licious, affording developers and operations teams invaluable insights into intricate application architectures. OpenTelemetry simplifies the process of capturing and visualizing traces, and the OpenTelemetry Operator takes this a step further by streamlining the deployment and management of observability components, particularly within dynamic environments like Kubernetes.

Through the integration of OpenTelemetry and the OpenTelemetry Operator into our development workflow, we stand to construct and uphold highly observable, performance-optimized applications. This not only guarantees a more gratifying experience for users but also eases the journey for developers. Embrace the capabilities of distributed tracing and elevate our application’s observability prowess with OpenTelemetry.

--

--