debugging tokio instrumentation

by Hayden Stainsby on Tuesday, 30 January 2024

I contribute to tokio-console. One of the things that I often find myself doing is matching what is shown in the console with the "raw" tracing output that comes from Tokio. However, this is pretty hard to read and doesn't actually contain all the information that I need.

There are a couple of things that I'd like to have. Firstly (and most importantly), I need to see Tracing's internal span::Id for the spans that are emitted. This is something which the fmt::Subscriber (and underlying Layer) don't support. And rightly so - it's internal information. But it's used heavily in the instrumentation in Tokio and to debug, I really need to have it available.

Normally, to get this information I use a patched version of the tracing-subscriber crate. But this is something that can't be checked into the console project, and setting it up each time is a bit tedious.

Secondly, I'd like to be able to visually differentiate the specific spans and events used in Tokio's instrumentation. Unlike the internal span ID, this is entirely domain specific, and has no use outside of this specific use case.

Here's a snippet of output from the fmt::Subscriber outputting some of Tokio's instrumentation.

2024-01-31T09:23:11.247690Z  INFO runtime.spawn{kind=task task.name= task.id=18 loc.file="src/bin/initial-example.rs" loc.line=11 loc.col=5}: initial_example: pre-yield fun=true
2024-01-31T09:23:11.247703Z TRACE runtime.spawn{kind=task task.name= task.id=18 loc.file="src/bin/initial-example.rs" loc.line=11 loc.col=5}: tokio::task::waker: op="waker.clone" task.id=2
2024-01-31T09:23:11.247717Z TRACE runtime.spawn{kind=task task.name= task.id=18 loc.file="src/bin/initial-example.rs" loc.line=11 loc.col=5}: tokio::task: exit
2024-01-31T09:23:11.247737Z TRACE tokio::task::waker: op="waker.wake" task.id=2
2024-01-31T09:23:11.247754Z TRACE runtime.spawn{kind=task task.name= task.id=18 loc.file="src/bin/initial-example.rs" loc.line=11 loc.col=5}: tokio::task: enter
2024-01-31T09:23:11.247766Z TRACE runtime.resource{concrete_type="Barrier" kind="Sync" loc.file="src/bin/initial-example.rs" loc.line=9 loc.col=39}: tokio::sync::barrier: enter
2024-01-31T09:23:11.247800Z TRACE runtime.resource{concrete_type="Barrier" kind="Sync" loc.file="src/bin/initial-example.rs" loc.line=9 loc.col=39}:runtime.resource.async_op{source="Barrier::wait" inherits_child_attrs=false}: tokio::util::trace: new

There is a lot of information here, and distinguishing different spans types can be complicated (especially when you're scanning through dozens or even hundreds of lines). Additionally, the span::Id is completely absent.

Compare this to the output of the same section of logs coloured and including the span::Id right after the span name.

2024-01-31T09:23:39.136879Z  INFO runtime.spawn[2]{kind=task, task.name=, task.id=18, loc.file="src/bin/initial-example.rs", loc.line=18, loc.col=5} initial_example: fun=true pre-yield
2024-01-31T09:23:39.136937Z TRACE runtime.spawn[2]{kind=task, task.name=, task.id=18, loc.file="src/bin/initial-example.rs", loc.line=18, loc.col=5} tokio::task::waker: op="waker.clone", task.id=2
2024-01-31T09:23:39.136995Z TRACE runtime.spawn[2]{kind=task, task.name=, task.id=18, loc.file="src/bin/initial-example.rs", loc.line=18, loc.col=5} exit
2024-01-31T09:23:39.137059Z TRACE tokio::task::waker: op="waker.wake", task.id=2
2024-01-31T09:23:39.137122Z TRACE runtime.spawn[2]{kind=task, task.name=, task.id=18, loc.file="src/bin/initial-example.rs", loc.line=18, loc.col=5} enter
2024-01-31T09:23:39.137212Z TRACE runtime.resource[1]{concrete_type="Barrier", kind="Sync", loc.file="src/bin/initial-example.rs", loc.line=16, loc.col=39} enter
2024-01-31T09:23:39.137296Z TRACE runtime.resource[1]{concrete_type="Barrier", kind="Sync", loc.file="src/bin/initial-example.rs", loc.line=16, loc.col=39} runtime.resource.async_op[274877906945]{source="Barrier::wait", inherits_child_attrs=false} new

Having now justified something I wanted to do anyway, let's build our own custom tracing subscriber!

(actually, it's going to mostly be a Layer)

aside: tracing subscribers and layers

If you're already familiar with tracing, you may wish to skip this section and go straight to ari-subscriber.

In the tracing ecosystem, you need a subscriber to actually do anything other than send your traces into the void. Specifically something that implements the Subscriber trait. A subscriber can take the traces and do what it wishes. Write them to stdout, to a file, collect them and perform aggregation, send them to another service (maybe via Open Telemetry).

The tracing-subscriber crate provides a number of subscriber implementations. From the outside, this mostly looks like different ways to write traces to a file handle. However, the real heart of tracing-subscriber is the registry. The registry is a subscriber which implements a span store and allows multiple layers to connect to it.

What is a Layer? For the longest time I had real trouble understanding conceptually what a layer is. From the documentation, a layer is "a composable abstraction for building Subscribers". However, I struggled to understand how I may wish to compose layers. It's also confusing because layers don't feed into other layers the way that tower layers do (which are like middleware, in that what one layer does affects what the next layer receives).

Instead, think of layers as mini-subscribers. They can take action on some methods on the Layer trait, but can fall back to the default implementation for things that they're not interested in. And Layer has a default implementation for everything.

Most layers need to store information about spans, this is where the registry comes in (specifically via the LookupSpan trait). Layers can store their own data in the registry in the form of span extensions.

The reason why storing this data in the registry is important may not be immediately obvious.

It's because tracing itself doesn't store this data. This allows tracing to not allocate for the data and therefore be used in no_std environments as well as the giant servers and beefy development machines that many of us are accustomed to.

Here's an example for clarity. When a span is created, a Subscriber receives a call to new_span(). This includes the span Attributes which gives the subscriber access to all the information about that span. Its metadata, field names, and also the values of any fields that were set at the time of creation.

This is great, it's everything we could need!

Now let's look at the method that gets called when a span is entered (becomes active), this is called [enter()] and all it comes with is... a span::Id. No metadata, no field names, and certainly no field values. And this pattern repeats on the trait methods called when a span exits or is closed.

Using the registry to store whatever data a layer might need about a span later on is the solution. This allows the fmt::Subscriber to print out the full data for each span in an event's ancestry.

Now that we understand a bit about what subscribers and layers are, let’s get into implementing some of it!

ari-subscriber

To meet the needs of my use-case, as described above, I've written the ari-subscriber crate. It's currently at version 0.0.1, which indicates that it's probably a bit rough, but so far it's already helped me quickly narrow down the version of Tokio after which yield_now() doesn't get detected as a self wake by Tokio Console.

The “ari” in ari-subscriber is for “async runtime instrumentation”.

The interface is simple, you pass an ari-subscriber layer to the registry:

use tracing_subscriber::prelude::*;
tracing_subscriber::registry()
    .with(ari_subscriber::layer())
    .init();

This will write output to stdout (currently not configurable). And the output will have pretty colours!

Let's look at a simple example of how we can use ari-subscriber. Here's the Rust code we'll be using:

#[tokio::main]
async fn main() {
    // Set up subscriber
    use tracing_subscriber::prelude::*;
    tracing_subscriber::registry()
        .with(ari_subscriber::layer())
        .init();

    // Spawn a task and wait for it to complete
    tokio::spawn(async {
        tracing::info!(fun = true, "pre-yield");
        tokio::task::yield_now().await;
    })
    .await
    .unwrap();
}

We start in an async context (using the #[tokio::main] attribute). First we set up the ari-subscriber layer with the registry. Then we spawn a task and wait for it to complete. The task emits a tracing event and then returns control to the runtime by calling the yield_now() function from Tokio. After that it ends

If you've been watching closely (and following all the links I've been sprinkling around), you may have realised that I'm looking at the case described in the issue console#512. What we want to look at is where the wake operation occurs.

We're going to fix our version of Tokio to an old one, where we know that Tokio Console detects awaiting on yield_now() as a self-wake. So let's specify the following in our Cargo.toml:

[dependencies]
ari-subscriber = "0.0.1"
tokio = { version = "=1.22.0", features = ["full", "tracing"] }
tracing = "0.1.40"
tracing-subscriber = "0.3.18"

We set the version of Tokio to =1.22.0, this indicates that we want exactly this version. By default, cargo would take any 1.x version where x is greater than or equal to 22.

Now let's look at the output (truncated a little bit to remove things that we won't be focusing on).

2024-01-30T15:43:24.010351Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} new
2024-01-30T15:43:24.010695Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} enter
2024-01-30T15:43:24.010778Z  INFO runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} debugging_tokio_instrumentation: fun=true pre-yield
2024-01-30T15:43:24.010829Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} tokio::task::waker: op="waker.wake_by_ref", task.id=1
2024-01-30T15:43:24.010878Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} exit
2024-01-30T15:43:24.010924Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} enter
2024-01-30T15:43:24.010962Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} exit
2024-01-30T15:43:24.010997Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} enter
2024-01-30T15:43:24.011032Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} exit
2024-01-30T15:43:24.011065Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} close

Unfortunately it's way to wide to visualise nicely on this web-site. But let's walk through it.

The date and time and log level is pretty straight forward. I took the log level colours from the fmt::Subscriber, so those should be familiar.

trace types

All the lines in the output are prefixed with a span named runtime.spawn. Spans with this name instrument tasks, ari-subscriber colours them green. There are district types of instrumentation in Tokio, and they each get their own colour.

runtime.spawn spans (green) instrument tasks
runtime.resource spans (red) instrument resources
runtime.resource.async_op spans (blue) instrument async operations
runtime.resource.async_op.poll spans (yellow) instrument the individual polls on async operations
tokio::task::waker events (purple) represent discrete waker operations
runtime::resource::poll_op events (orange) represent poll state changes
runtime::resource::state_update events (pink) represent resource state changes
runtime::resource::async_op::state_update events (turquoise) represent async operation state changes

In the case of spans, the value given above is the span name, for events it is the target.

Describing how each of these traces is used within Tokio and how to interpret them would fill several more posts and I won't go into that topic in more detail here. I already wrote a post on the instrumentation for tasks, which covers the runtime.spawn spans and the tokio::task::waker events. Go read tracing tokio tasks to learn about those!

span events

Now let's get back to the output of ari-subscriber for our test program. The first line ends in new, this is an event representing the creation of a new span. There are equivalent lines for enter, exit, and close; all parts of the span lifecycle. See the span lifecycle section of the post I linked above for a refresher on the lifecycle.

By default, the fmt::Subscriber doesn't output these "span events", but it can be configured to do so with the with_span_events() method on the builder. Currently ari-subscriber always emits these span events, but I may wish to make this configurable in the future to reduce the amount of output.

analysing wakes

Let's find our wake operation. You'll notice that there is exactly one line at INFO level. This is the one that we added to our spawned task ourselves. After the runtime.spawn span we see the text

debugging_tokio_instrumentation: fun=true pre-yield

The first bit (debugging_tokio_instrumentation) is the target, which by default is the same as the module path so it's the name of our example application. After the colon are the fields (just one field: fun=true) and finally the message (pre-yield). An event's message is actually just a specially handled field with the name message. This event isn't coloured because it isn't part of the instrumentation that ari-subscriber knows about.

The next line is the wake operation (it's purple!). We can see that its target is tokio::task::waker and then it has 2 fields and no message. The fields are op="waker.wake_by_ref" and task.id=1.

Let's start with the second field, task.id=1. This gives the instrumentation ID of the task being woken. The instrumentation ID of a task is not the Tokio task::Id, but rather the tracing span::Id of the span which instruments that task. That value is the one that appears in brackets after the span name runtime.spawn (e.g. [1]). This is a bit confusing, because the runtime.spawn span also has a field called task.id, but that one refers to the Tokio task ID. The important point here is that our span IDs match (both 1), so this operation is being performed from within the task that it is affecting.

The operation wake_by_ref indicates that the task is being woken using a reference to the waker. This operation doesn't consume the waker - which is important when Tokio Console counts the number of wakers for a given task to make sure that it hasn't lost all its wakers.

With this information, we can now manually ascertain that this is a self-wake operation. We are waking a task while running within that task.

what happens next

Let's change our version of Tokio to the latest (at the time of writing), 1.35.1.

tokio = { version = "=1.35.1", features = ["full", "tracing"] }

And now run exactly the same example. The output is below (truncated in the same way as before).

2024-01-30T16:00:09.484496Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} new
2024-01-30T16:00:09.484798Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} enter
2024-01-30T16:00:09.484867Z  INFO runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} debugging_tokio_instrumentation: fun=true pre-yield
2024-01-30T16:00:09.484930Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} tokio::task::waker: op="waker.clone", task.id=1
2024-01-30T16:00:09.484998Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} exit
2024-01-30T16:00:09.485073Z TRACE tokio::task::waker: op="waker.wake", task.id=1
2024-01-30T16:00:09.485150Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} enter
2024-01-30T16:00:09.485208Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} exit
2024-01-30T16:00:09.485261Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} enter
2024-01-30T16:00:09.485313Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} exit
2024-01-30T16:00:09.485361Z TRACE runtime.spawn[1]{kind=task, task.name=, task.id=18, loc.file="src/main.rs", loc.line=10, loc.col=5} close

It might not be immediately obvious, but that output is one line longer than the previous one. What jumps out is probably that we can now see a wake operation without scrolling to the right. But first, let's check what happens above that.

Directly below our own fun=true pre-yield event line, we see that there is still a tokio::task::waker event and it is still operating on the same task (and the same task that we are currently inside), the one with the task instrumentation ID of 1. However, this isn't a wake operation, instead it has the field value op=waker.clone. Somewhere, the waker for that task is being cloned.

Straight afterwards we see that the span exits - which means that the call to poll on that task has returned. After that, the task is woken. We see that the operation is waker.wake instead of waker.wake_by_ref, which means that the waker is consumed (this makes sense, as it was cloned before). More importantly than all of that though, is that this wake operation isn't inside the runtime.spawn span for that task, in fact it isn't inside any spans at all, runtime.spawn or otherwise.

This confirms what could be observed in Tokio Console, the instrumentation indicates that this is not a self wake!

what changed?

The reason for this change is the PR tokio#5223 (in Tokio itself). This PR changes the behaviour of yield_now() to defer the wake. When a task yields to the runtime in this way, it is immediately ready to be polled again (that's the whole point). Any other task which is ready will get precedence to be polled first (except under some specific conditions involving the LIFO slot). However the scheduler won't necessarily poll the resource drivers, this means that a task that is always ready may starve the resource drivers despite doing its best to be well behaved by yielding regularly to the runtime.

The PR changes the behaviour to defer waking tasks which call yield_now() until after polling the resource drivers, avoiding the starvation issue.

After some discussion on console#512, we decided that it's OK that Tokio Console can't detect this specific case of self wakes, since the PR on Tokio made them much less likely to result in some performance issue - something which may still occur from other futures self waking.

And that's how I managed to use my very basic subscriber crate to answer a question quicker thanks to pretty colours.

should I use `ari-subscriber`?

Now that we've had a bit of a look at ari-subscriber, the big question is, should anyone be using it?

The answer is no.

Aside from missing a bunch of useful features, ari-subscriber currently does a lot of things "the easy way", which is not very performant. I know how to make it more performant, but I promised myself I'd write this post before doing any more work on the crate.

Unless you too are trying to debug the instrumentation built into Tokio, you're much better off using the fmt::Subscriber from the tracing-subscriber crate.

If you are debugging that instrumentation, please come and say hi! I'd be really interested to hear what you're doing and I might even be able to help.

thanks

Thanks to Luca Palmieri and One for feedback on the post!