hēg denu

yet another site by Hayden Stainsby

how I finally understood async/await in Rust (part 1)

Tuesday, 30 May 2023

Like an increasing number of Rustaceans, I came to Rust after async/await had been stabilised.

(strictly speaking this isn't true, but close enough)

I didn't understand what was happening behind those magic keywords.

async

.await

This made it hard for me to grasp why I had to do things in a certain way.

I wanted to share how I finally understood async Rust.

We'll do this via a series of questions that I had.

We'll explore them in a series of posts.

These were originally going to all go in one post.

But it turned out to be a bit big.

Also, these topics might change order.

Or I might add more.

But first, I'll answer the question on everyone's mind...

why are you writing another beginners guide to async/await?

There are many different beginner guides on async Rust available.

Why am I writing yet another one?

Personally, nothing I read about async/await made everything click.

This is common.

Different people learn things in different ways.

Maybe no one has written the guide that allows you to understand.

Or maybe there's some piece you don't quite get.

Then reading one more guide fills in that gap so that you do get it.

This post describes how I finally understood.

If it helps you get there, great.

If it doesn't, maybe it's a stepping stone on the way.

Some other guides that I particularly liked are:

why doesn’t my task do anything if I don’t await it?

Let's get stuck in.

Perhaps the first hurdle newcomers to async Rust meet is that nothing happens.

Here's an example.

tokio::time::sleep(std::time::Duration::from_millis(100));

This will not sleep.

The compiler will immediately warn you that this won't work.

   = note: futures do nothing unless you `.await` or poll them

Right, let's fix it.

tokio::time::sleep(std::time::Duration::from_millis(100)).await;

Now we're sleeping!

That's all well and good.

But why?

Normally when I call a function, the contents get executed.

What is so special about that async keyword that all my previous experience must be thrown away?

To answer this question, let's look at perhaps the simple async function.

(the simplest that does something)

Generally, guides start with an async function that calls some other async function.

This is simple.

But we want simpler.

We want, hello world.

the simplest async function

We're going to start with an async function that doesn't do anything async.

This might seem silly.

But it will help us answer our question.

async fn hello(name: &'static str) {
    println!("hello, {name}!");
}

We can then call this function in a suitable async context.

#[tokio::main]
async fn main() {
    hello("world").await;
}

Our output is as expected.

hello, world!

But what does this function actually do.

We know that if we remove the .await, nothing happens.

But why?

Let's write our own future that does this.

Specifically, we'll be implementing the trait std::futures::Future.

What's a future?

aside: futures

A future represents an asynchronous computation.

It is something you can hold on to until the operation completes.

The future will then generally give you access to the result.

It's a common name for this concept in many different programming languages.

The name was proposed in 1977 apparently, so it's not new.

(read Wikipedia for more gory details)

What we need to know is that a future is what you give an async runtime.

You do this by .awaiting it.

Then the runtime gives you back the result.

the simplest future

Let's write our simple async function as a future.

A future generally has multiple states.

In fact, most futures are "basically" state machines.

The state machine is driven through it's states by the async runtime.

At a minimum we want 2 states, we'll call them Init and Done

The Init state is what the future starts in.

The Done state is where the future goes once it's complete.

That's simple.

So we'll model our future as an enum.

In the Init state, we need to keep the parameters that would be passed to the async function.

enum Hello {
    Init { name: &'static str },
    Done,
}

state machine diagram

As I said about, a future is a state machine.

So let's draw a state machine!

State machine of our hello world future.

We'll see this in code shortly in implementing poll.

In short, our Hello enum is constructed into the Init state.

Something called poll() on it, transitioning it to the Done state.

(more about poll() below)

That's it.

The object can now only be dropped.

(deconstructed)

This might not make sense yet.

But it should help understanding the code.

This isn't a future yet, so we can't await it.

To fix that, we need to implement the Future trait.

aside: the easy bits of the Future trait

We're going to just look at the easy bits of the std::futures::Future trait.

First, let's look at the trait:

pub trait Future {
    type Output;

    // Required method
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

The Future trait has an associated type that defines the output.

This is what an async function would return.

Our hello function doesn't return anything, so let's skip it for now.

There is one required method that takes a mutable reference to self and a context.

The reference to self is pinned.

We don't need to understand pinning for now.

So just think of it like any other &mut self.

We also don't need the context for now, so we'll skip that too.

The poll method returns a std::task::Poll enum.

The method should return Pending if it still has work to do.

(there are other things needed when returning Pending)

(but we can - yep, you guessed it - skip them for now)

When the future has a value to return, it returns Ready(T).

Here, T is the Future's associate type Output.

We don't actually need a value here either, so if you don't understand T, don't worry.

Skipping over the hard bits makes this easier.

implementing poll

Let's look at the implementation.

use std::{futures::Future, pin::Pin, task::Context};

impl Future for Hello {
    type Output = ();

    fn poll(mut self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> {
        match *self {
            Hello::Init { name } => println!("hello, {name}!"),
            Hello::Done => panic!("Please stop polling me!"),
        };

        *self = Hello::Done;
        Poll::Ready(())
    }
}

Let's go through this bit by bit.

Our async function doesn't return a value.

This means that Output is the unit type ().

This is, in reality, also what our async function returns.

Onto the implementation of poll.

We match on *self.

(remember, Hello is an enum)

If we're in the initial state Init then print out hello, {name}!.

This is the body of our async function.

If we're in the Done state, we panic.

(more on this shortly)

After our match statement, we set our state to Done.

Finally, we return Ready(()).

(that means Ready with a unit type as the value)

(remember that a function that doesn't return anything, actually returns the unit type)

In a moment, we'll look at how to use our new future.

But first, we have a couple of topics pending.

(pun absolutely, 100%, intended)

What about Poll::Pending and what about that panic!.

pending futures

This future is very simple.

It will become ready on the first poll.

But what if that isn't the case?

That's where Poll::Pending is used.

We'll look at how to use Pending at a later date.

future panics

Wait!

What about that panic?

A future is a "one shot" object.

Once it completes - returns Ready(T) - it must never be called again.

This is described in the Panics section of the documentation for this trait.

The trait doesn't require that the future panic.

But it's good practice when you start out, as it will quickly catch some logic errors.

using our future

We need to construct our new future to be able to use it.

Let's wrap it up in a function like our async function.

fn hello(name: &'static str) -> impl Future<Output = ()> {
    Hello::Init { name }
}

The first thing we note about this function is that it isn't marked async.

Because we're returning a "hand made" future, we can't use the async keyword.

Instead we return impl Future<Output = ()>.

Translation: an object that implements the future trait with the associated type Object being the unit type.

We could also expose our custom future and return Hello directly.

(this works the same, because Hello implements the Future trait)

What about the body of the function?

We construct the Init variant of our enum and return it.

Now it's starting to become clear why an async function doesn't do anything if you don't await it.

We're not doing anything!

Just constructing an object, nothing else gets run.

So let's call our future.

We can't call poll().

We don't have a Context to pass to it.

(we could create a Context, but that's a story for another day)

(remember we want to understand how async/await works for the user, not for the async runtime)

Luckily, the await keyword works just fine on "hand made" futures.

(this is what the compiler creates under the hood, after all)

So let's await our future!

Here's our main function.

(note that it is async, we must always be in async context to await a future)

#[tokio::main]
async fn main() {
    hello("world").await;
}

Well, that's boring.

It's exactly the same as when hello() was an async function.

What about the output?

hello, world!

Also exactly the same.

This might seem like a bit of an anti-climax.

But remember, you've just written your first custom future in Rust!

(or maybe your second future, or your hundredth)

(have you really written 100 futures? nice!)

sequence diagram

Here's a sequence diagram of our Hello future.

Sequence diagram of our hello world async function.

It looks wrong that our async main() function is calling poll() directly.

But remember that main is also being poll()ed!

So it has everything needed to call poll() on Hello.

(specifically the context, but we're not worrying about the context)

I'll be creating similar sequence diagrams for each of the futures we write in this series.

Hopefully that will help tie the different concepts together.

task scheduled time in console

Wednesday, 26 April 2023

Update (2023-05-10): A new version of tokio-console is available including this change!

Last week we merged a small set of cool changes in Tokio Console.

They added support for tracking and displaying per task scheduled time.

For the impatient amongst you, here's a screenshot.

Screenshot of tokio-console displaying the task details for the burn task in the app example.

But what is a task's scheduled time?

Actually, let's first make sure we're all on the same page.

What's a task?

tasks

We're discussing async Rust here.

So when I say task, I'm talking about it in that context.

From the Tokio documentation for tokio::task:

A task is a light weight, non-blocking unit of execution.

A task is like a thread, but lighter-weight.

The asynchronous runtime (e.g. Tokio) is responsible for scheduling tasks across its workers.

We won't go much more into detail than that for now.

scheduled time

We often think of a task as having 2 states.

Busy: when it's being executed.

Idle: when it's waiting to be executed.

Let's look at an example of a task moving between these two states.

Time-status diagram showing 1 task in one of 2 states: idle, busy.

We see that the task is either idle or busy.

When a task stops doing work, it yields to the runtime.

This is usually because it is awaiting some other future.

Although it could also voluntarily yield.

(in Tokio this is done by calling tokio::task::yield_now().await)

When a task yields to the runtime, it needs to be woken up.

Tasks get woken by a waker.

(tautologies galore)

We're not going to get into the mechanics of wakers today.

Enough to know that when a task is woken, it is ready to work.

But the runtime might not have a worker to place it on.

So there could be some delay between when a task is woken and when it becomes busy.

This is the scheduled time.

Time-status diagram showing 1 task in one of 3 states: idle, scheduled, busy.

Why is this interesting?

Let's have a look

scheduling delays

Let's look at a case with 2 tasks.

To make things simple, we'll suppose a runtime with only 1 worker.

This means that only a single task can be busy at a time.

Here's a time-status diagram of those's 2 tasks.

Time-status diagram showing 2 tasks, each in one of 2 states: idle, busy. There is no point at where both tasks are busy at the same time.

Nothing looks especially wrong.

(there is one thing, but we don't want to get ahead of ourselves).

But perhaps the behaviour isn't everything we want it to be.

Perhaps Task 2 is sometimes taking a long time to respond to messages from a channel.

Why is this?

We don't see it busy for long periods.

Let's include the scheduled time.

Time-status diagram showing 2 tasks, each in one of 3 states: idle, scheduled, busy. There is no point at where both tasks are busy at the same time. There is one moment when task 1 is busy for a long time and during part of that time, task 2 is scheduled for longer than usual.

Now something does jump out at us.

While task 1 is busy, task 2 is scheduled for a lot longer than usual.

That's something to investigate.

It also makes is clear that task 1 is blocking the executor.

That means that it's busy for so long that it doesn't allow other tasks to proceed.

Bad task 1.

That's the thing that a trained eye might have caught before.

But we don't all benefit from trained eyes.

scheduled time in the console

Tokio console doesn't have these pretty time-status diagrams.

Yet, at least.

But you can now see the scheduled time of each task.

Tokio console showing the task list view. There is a column labelled Sched for the scheduled time.

And sort by that column too.

Let's look at the task with the highest scheduled time, task2.

Tokio console showing the task detail view. There are 2 sets of percentiles and histograms. The top one is for poll (busy) times, the bottom one is for scheduled times.

It's quickly clear that task2 spends most of its time "scheduled".

Exactly 61.34% of its time when this screenshot was taken.

We can also see that during most poll cycles, task2 spends more than 1 second scheduled.

And at least once, over 17 seconds!

How about we have a look at a more common scheduled times histogram.

Let's look at the task details for the burn task that we saw at the beginning.

Tokio console showing the task detail view for the burn task. The scheduled times histogram is more as we'd expect, clustered around the lower end.

Here we see that the scheduled times are more reasonable.

Between 22 and 344 microseconds.

(by the way, this example app is available in the console repo)

Of course, maybe 17 seconds is just fine in your use case.

But with Tokio console, you now have that information easily available.

availability

(updated 2023-05-10)

The scheduled time feature has released!

To use it, you need at least tokio-console 0.1.8 and console-subscriber 0.1.9.

topology spread constraints and horizontal pod autoscaling

Monday, 20 March 2023

I'm back with more gotchas involving topology spread constraints.

This post is effectively part 2 of something that was never intended to be a series.

Part 1 explored topology spread constraints and blue/green deployments.

To be really honest, both these posts should include an extra quantifiers:

"in kubernetes on statefulsets with attached persistent volumes."

But that makes the title too long, even for me.

What's wrong this time?

the setup

Let's review the setup, in case you haven't read part 1.

(it's recommended, but not required reading.)

(but if you don't know what a topology spread constraint is, then maybe read it.)

(or at least read aside: topology spread constraints.)

The app in question is powered by a statefulset.

Each pod has a persistent volume attached.

(hence the statefulset.)

This time we're talking about production.

In production, we scale the number of pods based on the CPU usage of our web server container.

This is known as horizontal pod autoscaling.

aside: horizontal pod autoscaling

We're going to use kubernetes terminology.

In kubernetes a pod is the minimum scaling factor.

If you're coming from standard cloud world, it's sort of equivalent to an instance.

(but with containers and stuff.)

If you're coming from a data center world (on- or off-site) then it's a server.

(but with virtualization and stuff.)

Traditionally you can scale an application in 2 directions.

Vertically and horizontally.

Vertical scaling means making the pods bigger.

More CPU.

More Memory.

That sort of thing.

Horizontal scaling means adding more pods.

Kubernetes has built in primitives for horizontal pod autoscaling.

The idea is that you measure some metric from each pod, averaged across all pods.

(or more than one, but we won't go into that.)

And then you give Kubernetes a target value for that metric.

Kubernetes will then add and remove pods (according to yet more rules) to meet the target value.

Let's imagine we are scaling on pod CPU.

We have two pods and together their average CPU is well over the target.

Two pods shown with their CPU usage as a bar graph. The target value is shown as a horizontal line. The average of the two pods is shown as another horizontal line which is significantly above the target line.

As the average is significantly above the target, Kubernetes will scale up.

Now we have 3 pods, each with less CPU than before.

Three pods shown with their CPU usage as a bar graph. The target value is shown as a horizontal line. The average of the three pods is shown as another horizontal line which is now below the target line.

The average of the CPU usage of the 3 pods is now a little less than the target.

This is enough to make the horizontal pod autoscaler happy.

Of course, all this assumes that your CPU scales per request and that more pods means fewer requests per pod.

Otherwise, this is a poor metric to scale on.

the problem

We previously had issues with pods in pending state for a long time (see part 1).

So we added monitoring for that!

Specifically an alarm when a pod had been in pending state for 30 minutes or more.

This is much longer than it should be for pods that start up in 8 - 10 minutes.

The new alert got deployed on Friday afternoon.

(Yes, we deploy on Fridays, it's just a day.)

And then the alert was triggering all weekend.

First reaction was to ask how we could improve the alert.

Then we looked into the panel driving the alerts.

There were pods in pending state for 40 minutes.

50 minutes.

65 minutes!

What was going on?

the investigation

Looking at the metrics, we saw a pattern emerging.

A single pod was in pending state for a long period.

Then another pod went into pending state.

Shortly afterwards, both pods were ready.

It looked like the following.

A time/status chart showing pending and ready states (or inactive) for 4 pods. The first 2 pods are ready for the whole time period. The 3rd pod goes into pending for a long time. Then the 4th pod goes into pending, followed shortly by the 3rd pod becoming ready and then the 4th pod becoming ready as well.

(Actually the panel didn't look like that at all.)

(What we have available is somewhat more obtuse.)

(But that is how I wish the panel looked.)

This same pattern appeared several times in the space of a few hours.

Oh right, of course.

It's the topology spread constraints again.

It all has to do with how a statefulset is scheduled.

aside: pod management policy

The pod management policy determines how statefulset pods are scheduled.

It has 2 possible values.

The default is OrderedReady.

This means that each pod waits for all previous pods to be scheduled before it gets scheduled.

A time/status chart showing pending and ready states for 4 pods, numbered 0 to 3. Pod-0 starts in pending and then moves to ready. Each subsequent pod starts pending when the preceding pod is ready, then moves to ready itself some time later.

The other options is Parallel.

In this case, pods are all scheduled together.

A time/status chart showing pending and ready states for 4 pods, numbered 0 to 3. All pods start in pending and then move to ready at more or less the same time.

That's like a deployment.

Some applications require the ordering guarantees of OrderedReady.

However, it makes deployment N times slower if you have N pods.

We have no need of those ordering guarantees.

So we use the Parallel pod management policy.

the answer

Now that we have all the context, let's look at what happens.

Let's start a new statefulset with 4 pods.

We set the maximum skew on our zone topology spread constraints to 1.

At most we can have a difference of 1 between the number of pods in zones.

Our 4 pods are distributed as evenly as possible across our 3 zones.

Three zones containing 4 pods, each pod has a volume with the same index as the pod. Zone A contains pod-0/vol-0 and pod-2/vol-2. Zone B contains pod-1/vol-1. Zone C contains pod-3/vol-3.

So far, so good.

We don't have so much load right now, so let's scale down to 2 pods.

Three zones containing 2 pods, each pod has a volume with the same index as the pod. There are 2 additional volumes without pods. Zone A contains pod-0/vol-0 and vol-2 without a pod. Zone B contains pod-1/vol-2. Zone C contains vol-3 without a pod.

After scaling down, we don't remove the volumes.

The volumes are relatively expensive to set up.

The 8 - 10 minutes that pods take to start is mostly preparing the volume.

Downloading data, warming up the on-disk cache.

A pod with a pre-existing volume starts in 1 - 2 minutes instead.

So it's important to keep those volumes.

Now let's suppose that the load on our stateful set has increased.

We need to add another pod.

So let's scale one up.

Three zones containing 2 pods which are ready and 1 which is pending, each pod has a volume with the same index as the pod. There is 1 additional volumes without a pod. Zone A contains pod-0/vol-0 and pod-2/vol-2, but pod-2 can't be scheduled and is in pending state. Zone B contains pod-1/vol-2. Zone C contains vol-3 without a pod.

Now we hit our problem.

The next pod, pod-2, has a pre-existing volume in Zone A.

But if a new pod is scheduled in Zone A, we'll have 2 pods in Zone A.

And no pods in Zone C.

That's a skew of 2, greater than our maximum of 1.

So the scheduler basically waits for a miracle.

It knows that it can't schedule pod-2 because it would breach the topology spread constraints.

And it can't schedule a pod in Zones B or C because vol-2 can't be attached.

So it gives up and tries again later.

And fails.

And gives up and tries again later.

And fails.

And this would go on until the end of time.

Except just then, a miracle occurs.

The load on our stateful set increases even more.

We need another pod!

Now we can schedule pod-3 in Zone C.

And schedule pod-2 in Zone A.

(where we've been trying to schedule it for what feels like aeons.)

And our skew is 1!

Three zones containing 2 pods which are ready and 2 which can now move to ready, each pod has a volume with the same index as the pod. Zone A contains pod-0/vol-0 and pod-2/vol-2, the latter has just been successfully scheduled. Zone B contains pod-1/vol-2. Zone C contains pod-3/vol-3 which has just been successfully scheduled.

And this is why we saw a single pod pending for a long time.

And then 2 pods go ready in quick succession.

the solution

Unfortunately I don't have such a happy ending as in part 1.

The solution is to use the OrderedReady pod management policy.

But that's impractical due to the long per-pod start up time.

So the solution is to loosen the constraint.

Allow the topology spread constraints to be best effort, rather than strict.

the ideal solution

Ideally, I want a new pod management policy.

I'd call it OrderedScheduled.

Each pod would be begin scheduling as soon as the previous pod was scheduled.

A time/status chart showing pending and ready states for 4 pods, numbered 0 to 3. Pod-0 starts in pending and then moves to ready. Each subsequent pod starts pending shortly after the preceding pod starts pending, then moves to ready itself some time later. All the pods will coincide in pending state.

That way, pods are scheduled in order.

So scaling up and down won't breach the topology spread constraints.

Of course, this is just an idea.

There are probably many sides that I haven't considered.

topology spread constraints and blue/green deployments

Saturday, 25 February 2023

This week I discovered a flaw in our topology spread constraints.

It’s weird enough (or at least interesting enough) that I think it’s worth sharing.

This is kubernetes by the way.

And I'm talking about the topologySpreadConstraints field.

(Did you know that k8s is just counting the 8 missing letters between the k and the s.)

(I didn’t until I’d been using k8s in production for 2 years already.)

the setup

The app in question is powered by a statefulset.

Each pod has a persistent volume attached.

(Hence the statefulset.)

In our staging environment, we only run 2 pods.

Request volume is minimal and there’s no need to burn cash.

the problem

Since the day before, around 19:00, one of the two pods was always in state Pending.

It couldn’t be created due to a bunch of taint violations and other stuff.

Normal   NotTriggerScaleUp  4m38s (x590 over 109m)  cluster-autoscaler  (combined from similar events): pod didn't trigger scale-up: 1 in backoff after failed scale-up, 6 node(s) had taint {workload: always-on}, that the pod didn't tolerate, 3 node(s) had taint {workload: mog-gpu}, that the pod didn't tolerate, 3 node(s) had taint {workload: mog}, that the pod didn't tolerate, 3 node(s) had taint {workload: default-arm}, that the pod didn't tolerate, 3 node(s) had taint {workload: nvidia-gpu}, that the pod didn't tolerate, 4 node(s) had volume node affinity conflict, 1 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {workload: prometheus}, that the pod didn't tolerate 

(That "code block" is big, you might have to scroll for a while to see the end.)

I don’t fully understand what all the other stuff is, I’m a kubernetes user, not an administrator.

The staging cluster runs on spot instances.

Permanently having one pod Pending, we sometimes had the other pod evicted.

This left us with zero pods.

Zero pods is bad for availability.

Our on-call engineer got woken up at 04:00.

They then got woken up again at 09:00.

Because they’d already been woken up at 04:00, so they were trying to sleep in.

Being woken up in the middle of the night for the staging environment is actually the bigger problem here.

But it’s not the interesting problem.

The interesting problem is why only one of two pods could be started.

the investigation

Here we get into the good stuff.

I asked our platform team why one of our pods was always in pending state.

They discovered that it was because the 2 persistent volumes were in the same availability zone.

This is in violation of the topology spread constraints that we had specified.

That's this part of the NotTriggerScaleUp message:

1 node(s) didn't match pod topology spread constraints

aside: topology spread constraints

Topology spread constraints are the current best way to control where your pods are allocated.

Stable in kubernetes 1.24.

(I think, please tell me if this is wrong, couldn't find better information.)

This is better than pod affinity and anti-affinity rules which were already available.

A topology spread constraint allows you to say things like:

"I want my pods to be spread across all availability zones equally."

That phrase isn't precise enough, we need to work on it.

"I want my pods to be spread across all availability zones equally."

To properly define my pods we use labels.

So, my pods would actually be, all the pods with the label app=my_app.

Or you can use multiple labels.

Equally is not always possible.

Instead, I'll ask for a maximum skew=1.

The skew is the difference between the availability zone with least pods and the one with most.

If I have 5 pods I they could be allocated as A: 1, B: 2, C: 2.

Example of 4 pods allocated across three zones. 1 in zone A, 2 in zone B, and 2 in zone C.

Maximum number of pods per zone is 2, minimum is 1.

skew=1 (which is less than or equal to 2 - 1 = 1).

They can't be allocated as A: 1, B: 1, C: 3.

Example of 4 pods allocated across three zones. 1 in zone A, 1 in zone B, and 3 in zone C.

Then the skew is is 3 - 1 = 2.

And 2 is greater than our desired skew=1.

As well as availability zone, we could use hostname (which means node).

back to the investigation

Now we know about topology spread constraints, we can continue.

The persistent volume for each pod is created new for each deployment.

So how could both persistent volumes be in the same availability zone.

Our topology spread constraint would not have allowed it!

Then another member of the platform team asked:

"With blue/green deployments, I see a topology spread constraint that would be shared between multiple statefulsets?"

"Hm, or maybe not. Those are probably restricted to the single statefulset."

Then came the forehead slapping moment.

And I went searching in our Helm chart to confirm what I already new.

Our topology spread constraints are like the example above.

They apply to all pods with the label app=my_app.

This makes sense, as in the same namespace we have separate deployments/statefulsets.

So we have topology spread constraints for app=my_app, app=other_app, etc.

But we do blue/green deployments!

aside: blue/green deployments

Blue/green is a zero-downtime deployment technique.

You start with a single current deployment.

This is your blue deployment, all the user requests go there.

Because it's the only deployment.

Blue/green deployment step 1: 2 blue pods receiving user requests.

Now you add another deployment.

This is your green deployment, it isn't receiving any user requests yet.

Blue/green deployment step 2: 2 blue pods receiving user requests. 2 green pods not receiving user requests

Once you're happy that the new (green) deployment is healthy, switch user requests.

Now the green deployment is receiving all the user requests.

Blue/green deployment step 3: 2 blue pods not receiving user requests. 2 green pods receiving user requests

Let's assume everything is going fine.

The green deployment is handling user requests.

Now we can remove our blue deployment.

Blue/green deployment step 4a: 2 green pods receiving user requests.

However, if there is a problem, we can quickly route our user requests back to the blue deployment.

The next image is just step 2 again.

It's that easy.

Blue/green deployment step 2: 2 blue pods receiving user requests. 2 green pods not receiving user requests

Now we can safely work our what's wrong with our new (green) deployment before trying again.

the answer

A pod has to be able to attach to a persistent volume.

We use WaitForFirstConsumer volume claims, so the volume isn't created until after the pod has a node.

And the pod will be allocated so that it conforms to the topology spread constraints.

But during deployment, we have more than just the 2 pods present.

Here's what happened.

We have our current deployment (blue).

Two pods, happily obeying the topology spread constraints.

We have 3 availability zones.

This means that we must have 2 zones with a single pod each, and 1 zone without any.

Blue deployment: 1 blue pod with persistent volume in zone A, another blue pod with persistent volume in zone B. There is nothing in zone C.

Now we add two pods from our new deployment (green).

All pods have the label app=my_app.

So the topology constraints apply to the pods from both deployments together.

This means that we must have 2 zones with a single pod each, and 1 zone with 2 pods.

Which is perfectly legal under our topology spread constraints.

Blue and green deployments: 1 blue pod with persistent volume in zone A, another blue pod with persistent volume in zone B. 2 green pods with respective persistent volumes in zone C.

Then we finish our deployment.

The new deployment (green) becomes the current one.

The previously current deployment is removed.

Leaving us with two pods in a single zone.

Green deployment: 2 green pods with respective persistent volumes in zone C. There is nothing in zones A and B.

This is all fine, until a pod gets evicted.

Our staging cluster runs on spot instances.

So pods get evicted a lot.

This is great to find latent availability problems with your deployments.

Which is how we got here in the first place.

If a pod gets evicted, the volume stays in the same zone.

Green deployment: 1 green pod with persistent volume in zone C, another orphaned persistent volume in zone C without a pod. There is nothing in zones A and B.

Now kubernetes tries to schedule the pod.

It has to go in the zone with the volume it needs to attach to.

That's what this part of the NotTriggerScaleUp message means:

4 node(s) had volume node affinity conflict

But it can't, because we already have the only pod in that zone.

So our current skew is 1 - 0 = 1.

If we put another pod in that zone, our skew will become 2 - 0 = 2.

And a skew of 2 isn't allowed!

the fix

The fix is relatively simple.

Each pod already has a label with the deployment identifier in it, release=depl-123.

So we include this in our topology spread constraints.

Then it will apply to all pods that match both labels app=my_app,release=depl-123.

And the topology spread constraints will only apply to pods across a single deployment.

The point at which both blue and green deployments are active could occupying only zones A and B.

Blue and green deployments: 1 blue pod with persistent volume in zone A, another blue pod with persistent volume in zone B. 1 green pod with persistent volume in zone A, another green pod with persistent volume in zone B. Zone C is empty.

Now we remove the blue deployment.

And the green deployment still adheres to the topology spread constraints.

Green deployment: 1 green pod with persistent volume in zone A, another green pod with persistent volume in zone B. Zone C is empty.

We've solved another hairy problem.

Improving the sleep cycle of our on-call engineer.

track caller

Sunday, 29 January 2023

I've recently contributed a little bit to the Tokio project.

The first issue I worked on was #4413: polish: add #[track_caller] to functions that can panic.

Now I'm going to tell you everything you didn't know that you didn't need to know about #[track_caller].

what

Before Rust 1.42.0 errors messages from calling unwrap() weren't very useful.

You would get something like this:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /.../src/libcore/macros/mod.rs:15:40

This tells you nothing about where unwrap panicked.

This was improved for Option::unwrap() and Result::unwrap()in Rust 1.42.0.

More interestingly, the mechanism to do this was stabilised in Rust 1.46.0.

What was this mechanism? The track_caller attribute.

how

Where would you use #[track_caller]?

Imagine you're writing a library, it's called track_caller_demo.

Here's the whole thing:

/// This function will return non-zero values passed to it.
/// 
/// ### Panics
/// 
/// This function will panic if the value passed is zero.
pub fn do_not_call_with_zero(val: u64) -> u64 {
    if val == 0 {
        panic!("We told you not to do that");
    }

    val
}

We have been quite clear - you MUST NOT pass zero to this function.

Now along comes a user of your crate and writes this code:

use track_caller_demo::do_not_call_with_zero;

fn code_written_by_crate_user() {
    do_not_call_with_zero(0);
}

When the user runs their code, they'll see the following:

thread 'main' panicked at 'We told you not to do that', .cargo/registry/src/github.com-1ecc6299db9ec823/track_caller_demo-0.1.0/src/lib.rs:8:9

And the user says, "the crate author wrote buggy code!"

But we told them not to pass zero to that function.

We did it in multiples ways.

We don't want the user to see where the code panicked in our crate.

We want to show them their mistake.

So we annotate our function with #[track_caller]:

/// This function will return non-zero values passed to it.
/// 
/// ### Panics
/// 
/// This function will panic if the value passed is zero.
#[track_caller]
pub fn do_not_call_with_zero(val: u64) -> u64 {
    if val == 0 {
        panic!("We told you not to do that");
    }

    val
}

Now the user will see the following error message instead:

thread 'main' panicked at 'We told you not to do that', src/bin/zero.rs:4:5

This shows the location in the user's code where they called our library incorrectly.

Success!

except

There is one caveat.

The track_caller attribute must be on the whole call stack.

Every function from the panic, upwards.

Otherwise it won't work.

Let's add a new function to our library:

/// This function will return non-one values passed to it.
/// 
/// ### Panics
/// 
/// This function will panic if the value passed is one.
#[track_caller]
pub fn do_not_call_with_one(val: u64) -> u64 {
    panic_on_bad_value(val, 1);

    val
}

fn panic_on_bad_value(val: u64, bad: u64) {
    if val == bad {
        panic!("We told you not to provide bad value: {}", bad);
    }
}

We annotate our public function with #[track_caller].

Let's check the output:

thread 'main' panicked at 'We told you not to do that', .cargo/registry/src/github.com-1ecc6299db9ec823/track_caller_demo-0.1.0/src/lib.rs:29:9

The panic is pointing at our perfectly good library code!

To make this work, annotate the whole stack:

/// This function will return non-one values passed to it.
/// 
/// ### Panics
/// 
/// This function will panic if the value passed is one.
#[track_caller]
pub fn do_not_call_with_one(val: u64) -> u64 {
    panic_on_bad_value(val, 1);

    val
}

#[track_caller]
fn panic_on_bad_value(val: u64, bad: u64) {
    if val == bad {
        panic!("We told you not to provide bad value: {}", bad);
    }
}

Now we get:

thread 'main' panicked at 'We told you not to provide bad value: 1', src/bin/one.rs:4:5

Much better!

Most of the work on tokio#4413 was writing tests to ensure this didn't happen.

except except

OK, there's another caveat.

The track_caller attribute doesn't work in some places.

It doesn't work on closures (rust#87417).

But it does newly work on async functions (rust#78840).

Although I can't seem to work out which version of Rust that's in.

one more thing

You can also make use of #[track_caller] without a panic!

The same mechanism that panic uses to get the calling location is available for all.

It's called with std::panic::Location::caller().

This is used by the unstable tracing feature in Tokio.

It allows console to display the creation location for each task.

A simple example would be:

/// Calls (prints) the `name` together with  the calling location.
#[track_caller]
pub fn call_me(name: &str) {
    let caller = std::panic::Location::caller();

    println!(
        "Calling '{name}' from {file}:{line}",
        name = name,
        file = caller.file(),
        line = caller.line(),
    );
}

Because we're using #[track_caller], the panic location will give us where call_me was called from.

If we call it twice in succession:

fn main() {
    call_me("Baby");

    call_me("Maybe");
}

We would get the output:

Calling 'Baby' from src/bin/extra.rs:4
Calling 'Maybe' from src/bin/extra.rs:6

And this trick also works multiple layers into your call stack.

As long as you remember to annotate every function on the way down.

Which is pretty cool.

not working

Saturday, 22 October 2022

In 2020, COVID-19 hit Europe.

But unless you're reading this in the far future, you already knew that.

Like many companies with offices in Germany, mine put many of the staff on Kurzarbeit.

Kurzarbeit means short work in German.

A company reduces the working hours of some of its staff, in my case I went to 4 days a week.

The salary is reduced proportionately, but the German government will fill some of the difference.

In my case, I didn't get a lot filled in, but it was only for 2 months.

And those 2 months were fantastic.

It happened at a fortuitous time for me.

My partner was still on parental leave after our daughter was born.

So I got 3 day weekends with my new family.

But that wasn't the best bit.

Because the German government is paying some of the employees salaries, they have certain expectations.

Employees shouldn't be working more than their (reduced) contracted hours.

Our company took this very seriously.

We had to fill in time sheets for the first time ever.

Not to ensure that we were working our hours, but to ensure that we weren't working over our hours.

And no moving hours or minutes between days.

Strictly 8 hours a day, not a minute more.

We were given strict instructions to not work outside of the hours on our timesheet.

No opening any work application or anything that may be logged.

No reading email.

Definitely no responding.

No reading instant messages.

Definitely no responding.

No logging onto the VPN, which meant we couldn't do anything else anyway.

And then the entire international company was informed.

Staff in Germany cannot respond to communication outside work hours.

Not in the evening, not on the weekend, not even 1 minute after the timesheet said that they stopped working.

Not working outside work hours wasn't the best part of those 2 months of Kurzarbeit.

It was the expectation by the rest of the organization that I wouldn't be working.

If there's one thing you can instill in your organization's culture that will greatly improve employee life, it's that.

Expected that employees aren't working outside their working hours.

initial commit

Wednesday, 10 August 2022

The first post on hēg denu.

Here's some code to look at.

#!/usr/bin/perl
chop($_=<>);@s=split/ /;foreach$m(@s){if($m=='*'){$z=pop@t;$x=
pop@t;$a=eval"$x$m$z";push@t,$a;}else{push@t,$m;}}print"$a\n";