OpenTelemetry: Trace and instrument your application code

OpenTelemetry: Trace and instrument your application code

Originally, two open-source projects existed to allow you to implement tracing in your applications: OpenCensus and OpenTracing.

These two projects had the same goal and decided to merge to form OpenTelemetry. They are now incubated within the CNCF (Cloud Native Computing Foundation).

The objective remains the same: to allow developers to implement distributed tracing on their applications and to have an end-to-end view of what is happening from the user request on a web front-end to the various back-end services that are called.

This makes it much easier to identify problems in production as well as to debug to know the precise context of a request.

If you have not set up a stack from the start of your projects, you will have to go through your code again to set it up, but you can add it gradually: start by setting up traces in the critical parts of your code and then complete it later with logs and instrumentations on the various tools you use (databases, HTTP/gRPC APIs, cache, etc.).

How does OpenTelemetry work?

OpenTelemetry Architecture

Overall, you need to install a collector and your back-end applications will have the task of sending the traces to this collector. It will then take care of exporting the traces and metrics to an external service such as Jaeger, Prometheus, Datadog or whatever.

The OpenTelemetry agent can be run as a simple binary or you can also install it on your Kubernetes cluster, in sidecar or via a DaemonSet.

For more information on installation, refer to this page.

Adding traces to my code

What we are going to focus on now is how to set up the first traces in your applications.

I will use the Go library in these examples but refer to the documentation to find out how to use the library of the language of your choice (PHP, Python, Ruby, Rust, Java, Javascript, ...).

In this example, our code will open a gRPC connection via the otlptracegrpc package to send our traces to the OpenTelemetry collector. So we need to instantiate an exporter :

import (
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"google.golang.org/grpc"
	"google.golang.org/grpc/backoff"
)

traceExporter, err := otlptracegrpc.New(
    ctx,
    otlptracegrpc.WithInsecure(),
    otlptracegrpc.WithEndpoint("collector-endpoint.svc.local:50052"),
    otlptracegrpc.WithDialOption(
        grpc.WithConnectParams(grpc.ConnectParams{
            Backoff: backoff.Config{
                BaseDelay:  1 * time.Second,
                Multiplier: 1.6,
                MaxDelay:   15 * time.Second,
            },
            MinConnectTimeout: 0,
        }),
    ),
)
if err != nil {
    panic(err)
}

In case of connection problems, as specified in this example, we can also set up a retry system in backoff mode to retry a connection exponentially.

We then need to instantiate a TracerProvider which will allow us to create traces in our application:

import (
    "go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
	"go.opentelemetry.io/otel/trace"
)

// ...

tracerProvider := sdktrace.NewTracerProvider(
    sdktrace.WithResource(resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceNameKey.String("my-application"),
        semconv.ServiceVersionKey.String("v1.0.0"),
    )),
    sdktrace.WithSampler(sdktrace.AlwaysSample()),
    sdktrace.WithSpanProcessor(sdktrace.NewBatchSpanProcessor(traceExporter)),
)

otel.SetTracerProvider(tracerProvider)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))

When this tracerProvider is initialized, we notice that we specify some attributes by default: the URL of the OpenTelemetry schema used, an application name (called service) and a version number. This information will then be available on your final interface and you will be able to query it.

These default attributes will also be set on every trace that is created in your application.

We also create a BatchSpanProcessor by specifying the traceExporter created previously. Thus, for each new trace, the TracerProvider will send (in batch mode) the telemetry data to our exporter.

Once the tracerProvider is ready to be used, all we have to do is create a first trace in our application.

A little concept before continuing:

  • A span represents a portion of code executed by your application code: for example, the HTTP handler of an API route can represent a span and this one can have other children like a call to a database, for example,
  • A trace is a set of spans: these spans are linked by a trace_id, automatically added on creation of a first span (which has no parent).

So let's declare an HTTP handle with a call to an API, allowing to declare a trace with two spans (a parent and a child):

import (
    "context"
    "net/http"

    "go.opentelemetry.io/otel/codes"
)

// ...

var (
    tracer = tracerProvider.Tracer("http-handler")
)

func ServeHTTP(writer http.ResponseWriter, request *http.Request) {
    ctx, span := tracer.Start(
        request.Context(),
        "http-server: example handler",
        trace.WithSpanKind(trace.SpanKindServer),
        trace.WithAttributes(
            semconv.HTTPMethodKey.String(request.Method),
        ),
    )

    // Handle your request here...
    data := retrieveDataFromDatabase(ctx)
    // Handle the next part of you query here...

    span.End()
}

func retrieveDataFromDatabase(ctx context.Context) interface{} {
    ctx, span = tracer.Start(
        ctx,
        "database: retrieve data",
        trace.WithAttributes(
            semconv.HTTPMethodKey.String(request.Method),
    		attribute.String("my-key", "my-value"),
        ),
    )

    // Query database
    data, err := mydatabase.QuerySomething(ctx)
    if err != nil {
        span.SetStatus(codes.Error, err.Error())
    }

    span.End()
    return data
}

In this example, we have two spans: a first one named http-server: example handler and as a child of this span the one for the call to our database: database: retrieve data.

Parenthood is done via the context: when creating our first span, the library will define two keys in the context to store the trace_id and the span_id of the parent context. Thus, when creating a new span, the information stored in the context is used to define the hierarchy.

Note the ability to add specific attributes to these spans. Most of the main attributes you can define are already standardized in the semconv package. Of course you can also define your own attributes.

The span may be marked as an error if you have encountered an error.

Instrumenting third party libraries

The net/http libraries, the gRPC clients and servers or the database clients, caching tools or SDKs you can use on your projects (like the one from AWS for example) are most likely already instrumented.

You can look in the repositories suffixed *-contrib associated here: https://github.com/open-telemetry?q=contrib.

However, some instrumentals are also available on other GitHub repositories, so don't hesitate to do some research before you start instrumenting.

For example, I had the opportunity to work on the instrumentalisation of the confluentinc/confluent-kafka-go library here: https://github.com/etf1/opentelemetry-go-contrib.

The advantage of these instrumentalizations is that they allow you to quickly set up first traces on these calls without having too much code to modify on your side.

For example, the AWS SDK instrumentalization to add traces to a DynamoDB client is added only by adding the following line:

import (
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/credentials"
	"github.com/aws/aws-sdk-go-v2/service/dynamodb"
    "go.opentelemetry.io/contrib/instrumentation/github.com/aws/aws-sdk-go-v2/otelaws"
)

// ...
cfg, err := config.LoadDefaultConfig(ctx)
if err != nil {
    panic(err)
}

otelaws.AppendMiddlewares(&cfg.APIOptions, otelaws.WithTracerProvider(tracerProvider))
dynamodbClient := dynamodb.NewFromConfig(cfg)

Instrumentalize my own library

If you want to instrument a library, here are the few things you need to know. I will take here the example of the instrumentalization done in this Kafka producer.

Before going into detail, here is a summary of the steps to perform:

  • Recover the information stored in the Kafka message (if it's a consumed message from a previous brick) to store it in our Go context
  • Create a span from the Go context and thus define the parent span (again, if there was one)
  • Update the newly created and updated span information in our Go context in the Kafka message
  • Thus, when the Kafka message is consumed by another application in the future, it will contain the provenance span information

Now let's take a look at the opentelemetry-go library interfaces that we need to adhere to:

type TextMapCarrier interface {
	// Get returns the value associated with the passed key.
	Get(key string) string
	// DO NOT CHANGE: any modification will not be backwards compatible and
	// must never be done outside of a new major release.

	// Set stores the key-value pair.
	Set(key string, value string)
	// DO NOT CHANGE: any modification will not be backwards compatible and
	// must never be done outside of a new major release.

	// Keys lists the keys stored in this carrier.
	Keys() []string
}

This first TextMapCarrier interface allows you to set and get attributes from your context: in other words, it is mainly a way to store or retrieve trace_id and span_id attributes on your object (an HTTP request, a Kafka message, or whatever).

Once your object is wrapped by your TextMapCarrier implementation, you will then have to use it in the propagator via the following methods:

type TextMapPropagator interface {
	// Inject set cross-cutting concerns from the Context into the carrier.
	Inject(ctx context.Context, carrier TextMapCarrier)
	// DO NOT CHANGE: any modification will not be backwards compatible and
	// must never be done outside of a new major release.

	// Extract reads cross-cutting concerns from the carrier into a Context.
	Extract(ctx context.Context, carrier TextMapCarrier) context.Context
	// DO NOT CHANGE: any modification will not be backwards compatible and
	// must never be done outside of a new major release.

	// Fields returns the keys whose values are set with Inject.
	Fields() []string
}

Concretely, when a request is received: you need to extract the context information using the Extract(...) context.Context method to determine if a parent span should be used. You will then get a context with the data ready for the creation of your span:

carrier := NewMessageCarrier(message)
ctx = otel.GetTextMapPropagator().Extract(ctx, carrier)

You can then create a span with your context as seen before and then re-inject the updated trace_id and span_id into your Kafka message:

ctx, span := tracer.Start(ctx, "produce")
otel.GetTextMapPropagator().Inject(ctx, carrier)

All you have to do is add some attributes to your span, if you wish:

span.SetAttributes(
    semconv.MessagingSystemKey.String("kafka")
    semconv.MessagingDestinationKindTopic,
    semconv.MessagingDestinationKey.String(message.Topic),
)

You can now perform the processing you have to do (in our case produce the Kafka message) and then close your span:

err := producer.Produce(message)
if err != nil {
    span.SetStatus(codes.Error, err.Error())
}
span.End()

And there you have it, the first step to instrumenting your libraries.

Associating logs to my traces

We now have traces that go back up: we just have to associate our application logs with the associated span.

Technically, it's simple, we just need to put two attributes with the trace_id and span_id in the log format. For example in JSON :

{
  "timestamp": 1581385157.14429,
  "message": "My log message",
  "trace_id": "123456789123456789123456",
  "span_id": "1234567891234567"
}

In order for your logs to be associated with your traces, you simply need to retrieve this information from your span as follows:

logger = logger.With(
    zap.String("span_id", span.SpanContext().SpanID().String()),
    zap.String("trace_id", span.SpanContext().TraceID().String()),
)

// ...

logger.Info("Received response from my service", zap.String("data", data))

This will allow the log collector agent to retrieve your application logs and associate them to a span: by dependency, to your application/version as well.

Conclusion

OpenTelemetry is a really interesting tool to implement, especially if you have a distributed architecture with several microservices: you will have a better view on the workflow of your application requests.

The implementation on an existing project can be a bit tedious because you have to modify a good portion of the code, but you can go by steps:

  • Start by instrumenting the libraries in order to have information on third party calls,
  • Add traces in your code little by little, starting with the critical parts,
  • Link your logs to your traces.

Don't hesitate to contact me if you want to have more information on this subject!