One could say I have a bit of an obsession with build times.

I believe having a "tight feedback loop" is extremely valuable: when I work on a large codebase, I want to be able to make small incremental changes and check very often that things are going as expected.

Especially if I'm working on a project that needs to move quickly: say, the product for an early-stage startup, or a side-project for which I only ever get to do 1-hour work bursts at most.

There's a lot about engineering that we need to endure: a lot of problems have inherent complexity. Coordination across humans is hard, and it's lossy. It's really tricky to strike the right balance between "not enough testing" and "ossifying the codebase" (making it harder to change its structure later).

But there's things we don't need to suffer. And one of them is "watching the compiler go brrrr" for excessive amounts of time.

Previously, on this here site

Of course, everyone has their own definition of "excessive". I've been in multiple discussions where I've argued Rust compile times aren't that bad, and then they'll tell me "okay but some projects at work take over 2 hours to compile", to which my response is "oh wow no that's bonkers actually, how the heck did y'all do that?"

And that's what makes it so hard to discuss these kind of things online: you may have it good. It might work great for you. But someone else might be having a really poor time with it, for quite some time now, and they've just resigned themselves that "that's how it is I guess".

It's often proprietary codebases, too, so short of being hired, there's few opportunities for someone to come in, look at the absolute state of their CI pipelines and go "damn, you really live like this?".

So I wrote Why is my Rust build so slow, in which we reviewed things like:

Finally, we found that there was a workaround we could use with the warp crate, that made our compile times go down significantly. Splitting crates into smaller crates was also interesting for increased concurrency.

There's a bunch of other articles that have been written about this topic around the interwebs, but ultimately most of them fall short: what do you do when you've done all of that, but you still suspect you're hitting a pathological case in rustc that makes your build much slower than it ought to be?

What then?

Well, I hit one of these just recently, so, let's look into it.

The offending code

Originally, I hit this performance bug with some tower code. It's a crate used by hyper, itself used by higher-level crates like warp and axum, so it's no coincidence that folks have had build time troubles with all of these.

The crux of the problem is "gee grandma, what big types you have there", because... essentially, tower is a couple of important traits.

Here's one:

Rust code
pub trait Service<Request> {
    type Response;
    type Error;
    type Future: Future<Output = Result<Self::Response, Self::Error>>;

    fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>>;
    fn call(&mut self, req: Request) -> Self::Future;
}

And the basic idea is "oh shit we don't have async trait methods" (hence the Future associated type) but also "backpressure is love, backpressure is life", by way of the poll_ready method.

What happens is you hand hyper a Service<Request>, and for each new HTTP request it first calls poll_ready until it returns Poll::Ready. It doesn't do that in a loop: it knows when to try again because in poll_ready, if we weren't ready, we registered ourselves with the Waker that's hidden in the Context, so we know when... a socket is ready to be written to / read from, or a timer has elapsed, or some semaphore has more permits available... and that's how async Rust works, more or less!

Wow, most concise explanation yet!

Yes, and completely opaque to anyone who doesn't already know how it works. But that doesn't matter much for this article. Once poll_ready has returned Poll::Ready, then call can be called, and it returns a Future, which is spawned as a background task, and then the request can be handled asynchronously.

Because we don't have "async trait methods", ie. the trait can't just be this:

Rust code
pub trait Service<Request> {
    type Response;
    type Error;

    async fn poll_ready(&self) -> Result<(), Self::Error>;
    async fn call(&self, req: Request) -> Result<Self::Response, Self::Error>;
}

...we have this Future associated type:

Rust code
    type Future: Future<Output = Result<Self::Response, Self::Error>>;

And most of the time, that's gonna end up being a boxed future, especially if you're using async blocks.

So, often, your service ends up looking something like this:

Rust code
use hyper::{Body, Request, Response};
use std::{
    convert::Infallible,
    future::Future,
    pin::Pin,
    task::{Context, Poll},
    time::Duration,
};
use tokio::time::sleep;
use tower::Service;

struct MyService;

impl Service<Request<Body>> for MyService {
    type Response = Response<Body>;
    type Error = Infallible;
    type Future = Pin<Box<dyn Future<Output = Result<Self::Response, Self::Error>>>>;

    fn poll_ready(&mut self, _cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
        // we don't apply any backpressure here, so always ready
        Poll::Ready(Ok(()))
    }

    fn call(&mut self, req: Request<Body>) -> Self::Future {
        Box::pin(async move {
            // imagine we're fetching data from an API instead
            sleep(Duration::from_millis(250)).await;

            let body = format!("you've requested {:?}", req.uri());
            let res = Response::builder().body(Body::from(body)).unwrap();
            Ok(res)
        })
    }
}

And, ridiculous verbosity aside, the nice thing about this is that you can compose those services. So we might take that "sleeping for 250ms" functionality and move it into another service, like so:

Rust code
use hyper::{Body, Request, Response};
use std::{
    convert::Infallible,
    future::Future,
    pin::Pin,
    task::{Context, Poll},
    time::Duration,
};
use tokio::time::sleep;
use tower::Service;

struct SleepService<S> {
    inner: S,
}

impl<S, Req> Service<Req> for SleepService<S>
where
    S: Service<Req>,
    <S as Service<Req>>::Future: 'static,
{
    type Response = S::Response;
    type Error = S::Error;
    type Future = Pin<Box<dyn Future<Output = Result<Self::Response, Self::Error>>>>;

    fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
        self.inner.poll_ready(cx)
    }

    fn call(&mut self, req: Req) -> Self::Future {
        let fut = self.inner.call(req);

        Box::pin(async move {
            // imagine we're fetching data from an API instead
            sleep(Duration::from_millis(250)).await;

            fut.await
        })
    }
}

struct MyService;

impl Service<Request<Body>> for MyService {
    type Response = Response<Body>;
    type Error = Infallible;
    // this future is no longer boxed! it's a future that immediately completes.
    type Future = std::future::Ready<Result<Self::Response, Self::Error>>;

    fn poll_ready(&mut self, _cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
        Poll::Ready(Ok(()))
    }

    fn call(&mut self, req: Request<Body>) -> Self::Future {
        let body = format!("you've requested {:?}", req.uri());
        let res = Response::builder().body(Body::from(body)).unwrap();
        std::future::ready(Ok(res))
    }
}

And that's neat because now you can separate concerns cleanly across multiple services: some will rate-limit, some will retry, some will handle errors, etc.

And the end result is that the service you hand to hyper isn't merely MyService, it's something like:

Rust code
type ServiceStack = Trace<
  WebSocketUpgrade<
    UpstreamTimings<
      CompressResponse<
        PrepareRequest<
          HandleErrors<
            HttpsRedirect<
              Balancer<
                Retry<
                  MyService
                >
              >
            >
          >
        >
      >
    >
  >
>;

Thankfully, tower provides another type, Layer, which turns the former into something like:

Rust code
let service = ServiceBuilder::new()
  .layer(TraceLayer)
  .layer(WebSocketUpgradeLayer)
  .layer(UpstreamTimingsLayer)
  .layer(CompressResponseLayer)
  .layer(PrepareRequestLayer)
  .layer(HandleErrorsLayer)
  .layer(HttpsRedirectLayer)
  .layer(BalancerLayer)
  .layer(RetryLayer)
  .service(MyService);

Which is nicer to read, if you can remember that it goes outer -> inner. But it still builds the same gigantic type.

One downside is that the type errors you get with these can be gnarly. Learning to address those is a bit of a special skill I've acquired, and I'm not sure it should be a prerequisite.

And another downside is that these large types tend to hit corner cases in rustc (the rust compiler) that are... slow. Really slow. Exponentially slower as you add more layers to your service stack in fact.

And I think we've had quite enough of tower for one article, so I'm going to move on to the small repro case that eddyb made from my already-small repro case.

The repro case

Stolen directly from rust issue 99188, here it is:

Rust code
trait Trait<'a> {
    type A;
    type B;
    type C;
    type D;

    fn method() {}
}

impl<T> Trait<'_> for &'_ T
where
    for<'x> T: Trait<'x, A = (), B = (), C = (), D = ()>,
{
    type A = ();
    type B = ();
    type C = ();
    type D = ();
}

impl Trait<'_> for () {
    type A = ();
    type B = ();
    type C = ();
    type D = ();
}

pub fn main() {
    #[cfg(depth = "7")]
    <&&&&&&&() as Trait>::method();
    #[cfg(depth = "8")]
    <&&&&&&&&() as Trait>::method();
    #[cfg(depth = "9")]
    <&&&&&&&&&() as Trait>::method();
    #[cfg(depth = "10")]
    <&&&&&&&&&&() as Trait>::method();
}

What's neat about having a self-contained repro case like that (no dependencies, single file) is that we can call rustc directly instead of calling cargo.

So first, let's check that we have pathological behavior thanks to those neat cfg attributes there. We'll pass --emit=metadata to rustc so that it doesn't do full codegen with LLVM. I think this is close to cargo check? Unsure.

Shell session
$ command time -f 'took %Us' rustc src/main.rs --emit=metadata --cfg 'depth = "7"'
took 0.17s
$ command time -f 'took %Us' rustc src/main.rs --emit=metadata --cfg 'depth = "8"'
took 0.76s
$ command time -f 'took %Us' rustc src/main.rs --emit=metadata --cfg 'depth = "9"'
took 3.38s
$ command time -f 'took %Us' rustc src/main.rs --emit=metadata --cfg 'depth = "10"'
took 15.53s

We're doing this on stable at the time of this writing, btw:

Shell session
$ rustc --version
rustc 1.62.0 (a8314ef7d 2022-06-27)

So yeah, every time you add one level of nesting, it gets 4x slower. So compile time is 4x,4^x , where xx is the level of nesting — that's exponential growth (the bad kind).

But why? Where is it spending time? What is going on?

Let's find out.

But before we do, to keep things manageable, let's add more cfgs with smaller values, like so:

Rust code
pub fn main() {
    #[cfg(depth = "1")]
    <&() as Trait>::method();
    #[cfg(depth = "2")]
    <&&() as Trait>::method();
    #[cfg(depth = "3")]
    <&&&() as Trait>::method();
    #[cfg(depth = "4")]
    <&&&&() as Trait>::method();
    #[cfg(depth = "5")]
    <&&&&&() as Trait>::method();
    #[cfg(depth = "6")]
    <&&&&&&() as Trait>::method();
    #[cfg(depth = "7")]
    <&&&&&&&() as Trait>::method();
    #[cfg(depth = "8")]
    <&&&&&&&&() as Trait>::method();
    #[cfg(depth = "9")]
    <&&&&&&&&&() as Trait>::method();
    #[cfg(depth = "10")]
    <&&&&&&&&&&() as Trait>::method();
}

Because as you'll see, things can get verbose fast.

Tracing with RUSTC_LOG

As of... some time ago, I don't remember, rustc is using the tracing crate, along with tracing-tree, which means it's very easy for us to turn on some debug logging.

I was excited about this because I generally love tracing, as I've said before, and seeing it used in a compiler caused me great joy.

rustc logging is set up so that you can set RUSTC_LOG to any values accepted by EnvFilter.

So, let's set it to info and see what we get:

Shell session
$ RUSTC_LOG=info rustc src/main.rs --emit=metadata --cfg 'depth = "1"'
 INFO rustc_metadata::creader resolving crate `std`
 INFO rustc_metadata::creader falling back to a load
 INFO rustc_metadata::locator lib candidate: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-7ca39ac42651c3df.so
 INFO rustc_metadata::locator lib candidate: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd_detect-a61cdd33cfa8394f.rlib
 INFO rustc_metadata::locator lib candidate: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-7ca39ac42651c3df.rlib
 INFO rustc_metadata::locator rlib reading metadata from: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd_detect-a61cdd33cfa8394f.rlib
 INFO rustc_metadata::locator Rejecting via crate name

Okay, so rustc knows about crates! Interesting. Let's skip ahead, there's a ton of those.

Shell session
INFO rustc_metadata::creader resolved crates:
  name: std
  cnum: 1
  hash: 514c814935f60577
  reqd: Explicit
   rlib: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-7ca39ac42651c3df.rlib
  name: core
  cnum: 2
  hash: 40880dfc3076f43e
  reqd: Explicit
   rlib: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-83735dd4dae9b02c.rlib

Neat, it found a bunch of crates (19 in total, not shown here).

Shell session
  INFO rustc_interface::passes 0 parse sess buffered_lints
┐rustc_typeck::check::wfcheck::check_trait 
├─┐rustc_trait_selection::traits::project::normalize_with_depth_to depth=0, value=[Binder(TraitPredicate(<Self as Trait<'a>>, polarity:Positive), [])]

Okay, now it started doing type checking! (You can tell from the crate name: rustc_typeck — rustc too is split across a bunch of crates).

Much later, we see:

Shell session
┐rustc_trait_selection::traits::project::normalize_with_depth_to depth=1, value=Binder(ProjectionPredicate(ProjectionTy { substs: [_, ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b057]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))])
┘
┐rustc_trait_selection::traits::project::normalize_with_depth_to depth=1, value=Binder(ProjectionPredicate(ProjectionTy { substs: [_, ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b057]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))])
┘
┐rustc_trait_selection::traits::project::normalize_with_depth_to depth=1, value=Binder(ProjectionPredicate(ProjectionTy { substs: [_, ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b057]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))])
┘
┐rustc_trait_selection::traits::project::normalize_with_depth_to depth=1, value=Binder(ProjectionPredicate(ProjectionTy { substs: [_, ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b057]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))])

Ah! That seems related to our code.

Let's reformat one of these manually:

Binder(
  ProjectionPredicate(
    ProjectionTy {
      substs: [
        _,
        ReLateBound(
          DebruijnIndex(0),
          BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) }
        )
      ],
      item_def_id: DefId(0:5 ~ main[b057]::Trait::A)
    },
    Ty(())
  ),
  [
    Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))
  ]
)

Looking through rustc's source code a bit, we find:

Binder is a binder for higher-ranked lifetimes or types. It is part of the compiler's representation for things like for<'a> Fn(&'a isize) (which would be represented by the type PolyTraitRef == Binder<'tcx, TraitRef>). Note that when we instantiate, erase, or otherwise "discharge" these bound vars, we change the type from Binder<'tcx, T> to just T (see e.g., liberate_late_bound_regions).

That seems to correspond to this constraint in our code:

Rust code
where
    for<'x> T: Trait<'x, A = (), B = (), C = (), D = ()>,

In there we have a ProjectionPredicate:

This kind of predicate has no direct correspondent in the syntax, but it roughly corresponds to the syntactic forms:

  1. T: TraitRef<..., Item = Type>
  2. <T as TraitRef<...>>::Item == Type (NYI)

In particular, form #1 is "desugared" to the combination of a normal trait predicate (T: TraitRef<...>) and one of these predicates. Form #2 is a broader form in that it also permits equality between arbitrary types. Processing an instance of Form #2 eventually yields one of these ProjectionPredicate instances to normalize the LHS.

My read here is that this:

Rust code
where
    for<'x> T: Trait<'x, A = (), B = (), C = (), D = ()>,

Gets translated into this (which is not valid syntax):

Rust code
where
    for<'x> T: Trait<'x>,
    for<'x> <T as Trait<'x>>::A = (),
    for<'x> <T as Trait<'x>>::B = (),
    for<'x> <T as Trait<'x>>::C = (),
    for<'x> <T as Trait<'x>>::D = (),

If you try it you'll run into "equality constraints are not yet supported in where clauses", with a link to this issue.

But that seems to be what the compiler desugars it to, according to the comment. And so, we're looking at something like for<'x> <T as Trait<'x>>::A = ().

Still in that logging output, we see a mention of De Bruijn indices, named after Nicolaas Govert de Bruijn, and is just a convenient way to "represent terms of lambda calculus without naming the bound variables". Here it's used to represent lifetimes.

That's as much as I can divine from this, but one thing's for sure: these log lines are related to typechecking our code.

Finally, we get a nice summary of... the distribution of type features for types in the interner? It's not "time spent", but it's fun to look at:

 INFO rustc_interface::passes Pre-codegen
Ty interner             total           ty lt ct all
    Adt               :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    Array             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    Slice             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    RawPtr            :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    Ref               :     15 25.4%,  5.1%  15.3%  0.0%  0.0%
    FnDef             :      9 15.3%,  0.0%   8.5%  0.0%  0.0%
    FnPtr             :      1  1.7%,  0.0%   0.0%  0.0%  0.0%
    Placeholder       :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    Generator         :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    GeneratorWitness  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    Dynamic           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    Closure           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    Tuple             :      1  1.7%,  0.0%   0.0%  0.0%  0.0%
    Bound             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    Param             :      2  3.4%,  0.0%   0.0%  0.0%  0.0%
    Infer             :      3  5.1%,  3.4%   0.0%  0.0%  0.0%
    Projection        :     28 47.5%,  0.0%  13.6%  0.0%  0.0%
    Opaque            :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
    Foreign           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
                  total     59         8.5%  37.3%  0.0%  0.0%
InternalSubsts interner: #95
Region interner: #54
Const Allocation interner: #0
Layout interner: #0

And... there's 47.5% in Projection, which I think refers to those <T as Trait>::AssocType = Stuff constraints.

And now I'm really curious how these evolve when we add more nesting, so, let's filter log output so we only have stuff from rustc_interface::passes and do text processing:

Shell session
for i in $(seq 1 9); do echo "============ Level ${i} ============"; RUSTC_LOG=rustc_interface::passes=info rustc src/main.rs --emit=metadata --cfg "depth = \"${i}\"" 2>&1 | grep "Post-codegen" -A 100 | grep -E "Ref|FnDef|Infer|Projection"; done
============ Level 1 ============
    Ref               :     15 25.4%,  5.1%  15.3%  0.0%  0.0%
    FnDef             :      9 15.3%,  0.0%   8.5%  0.0%  0.0%
    Infer             :      3  5.1%,  3.4%   0.0%  0.0%  0.0%
    Projection        :     28 47.5%,  0.0%  13.6%  0.0%  0.0%
============ Level 2 ============
    Ref               :     37 43.5%, 21.2%  31.8%  0.0%  0.0%
    FnDef             :      8  9.4%,  0.0%   4.7%  0.0%  0.0%
    Infer             :      8  9.4%,  8.2%   0.0%  0.0%  0.0%
    Projection        :     28 32.9%,  0.0%   9.4%  0.0%  0.0%
============ Level 3 ============
    Ref               :     65 54.6%, 29.4%  42.9%  0.0%  0.0%
    FnDef             :      9  7.6%,  0.0%   4.2%  0.0%  0.0%
    Infer             :     13 10.9%, 10.1%   0.0%  0.0%  0.0%
    Projection        :     28 23.5%,  0.0%   6.7%  0.0%  0.0%
============ Level 4 ============
    Ref               :    113 62.4%, 40.9%  52.5%  0.0%  0.0%
    FnDef             :      9  5.0%,  0.0%   2.8%  0.0%  0.0%
    Infer             :     27 14.9%, 14.4%   0.0%  0.0%  0.0%
    Projection        :     28 15.5%,  0.0%   4.4%  0.0%  0.0%
============ Level 5 ============
    Ref               :    231 60.9%, 48.3%  55.1%  0.0%  0.0%
    FnDef             :      9  2.4%,  0.0%   1.3%  0.0%  0.0%
    Infer             :    107 28.2%, 28.0%   0.0%  0.0%  0.0%
    Projection        :     28  7.4%,  0.0%   2.1%  0.0%  0.0%
============ Level 6 ============
    Ref               :    652 58.2%, 53.1%  55.9%  0.0%  0.0%
    FnDef             :      9  0.8%,  0.0%   0.4%  0.0%  0.0%
    Infer             :    427 38.1%, 38.0%   0.0%  0.0%  0.0%
    Projection        :     28  2.5%,  0.0%   0.7%  0.0%  0.0%
============ Level 7 ============
    Ref               :   2248 56.3%, 54.6%  55.5%  0.0%  0.0%
    FnDef             :      9  0.2%,  0.0%   0.1%  0.0%  0.0%
    Infer             :   1707 42.7%, 42.7%   0.0%  0.0%  0.0%
    Projection        :     28  0.7%,  0.0%   0.2%  0.0%  0.0%
============ Level 8 ============
    Ref               :   8616 55.6%, 55.2%  55.4%  0.0%  0.0%
    FnDef             :      9  0.1%,  0.0%   0.0%  0.0%  0.0%
    Infer             :   6827 44.1%, 44.1%   0.0%  0.0%  0.0%
    Projection        :     28  0.2%,  0.0%   0.1%  0.0%  0.0%
============ Level 9 ============
    Ref               :  33976 55.4%, 55.3%  55.3%  0.0%  0.0%
    FnDef             :      9  0.0%,  0.0%   0.0%  0.0%  0.0%
    Infer             :  27307 44.5%, 44.5%   0.0%  0.0%  0.0%
    Projection        :     28  0.0%,  0.0%   0.0%  0.0%  0.0%

Ah! It looks like Infer is EXPloding. And so is Ref, but I'm not exactly sure why.

Wait, why are we doing this again? Doesn't rustc have self-profiling? That seems like it could be useful there.

You know what, that's a good point, let's try that.

Self-profiling with -Z self-profile

We'll choose nesting level 8, because it takes a handful of seconds on my machine so it's probably long enough to measure stuff, yet short enough that it doesn't get really annoying.

So, we've learned this in a previous article, let's get moving. As before, and because we're still on stable at this point, we'll use the RUSTC_BOOTSTRAP environment variable to unlock nightly features, which is something you shouldn't normally do:

Shell session
$ RUSTC_BOOTSTRAP=1 rustc src/main.rs --emit=metadata --cfg 'depth = "8"' -Z self-profile
(no output)
# this takes the most recent `*.mm_profdata` file
$ summarize summarize $(ls -t *profdata | head -1) | head
+------------------------------------------------+-----------+-----------------+----------+------------+
| Item                                           | Self time | % of total time | Time     | Item count |
+------------------------------------------------+-----------+-----------------+----------+------------+
| evaluate_obligation                            | 738.19ms  | 98.892          | 738.21ms | 51         |
+------------------------------------------------+-----------+-----------------+----------+------------+
| expand_crate                                   | 2.32ms    | 0.311           | 2.71ms   | 1          |
+------------------------------------------------+-----------+-----------------+----------+------------+
| free_global_ctxt                               | 892.96µs  | 0.120           | 892.96µs | 1          |
+------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_fulfill_obligation                     | 556.41µs  | 0.075           | 738.71ms | 1          |
thread 'main' panicked at 'Cannot print table to standard output : Broken pipe (os error 32)', /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/prettytable-rs-0.8.0/src/lib.rs:194:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Alright! We have a candidate there for sure.

Here it is in flamegraph form:

Shell session
$ flamegraph $(ls -t *profdata | head -1)
# using the `sfz` crate as a static file server to visualize the SVG in a browser
$ sfz
Files served on http://127.0.0.1:5000
A flamegraph, showing that almost all time is spent in: all, rustc, analysis, MIR_borrow_checking, mir_borrowck, mir_promoted, mir_const, unsafety_check_result, mir_built, resolve_instance, codegen_fulfill_obligation, and evaluate_obligation

And in chrome://tracing form, which has a bunch of useful temporal information the flamegraph lacks:

Shell session
$ crox $(ls -t *profdata | tail -1)
# this generated a `chrome_profiler.json` file
A screenshot of chrome://tracing in the Google Chrome browser, showing the same thing the flamegraph showed, but upside-down. Also we now see that there's several evaluate_obligation calls

Here for example, we see that evaluate_obligation is actually done multiple times: the first one takes the longest, the next couple ones are a little bit quicker, and the last few are quicker still.

But which "obligations" is it evaluating? We don't know, because we haven't recorded arguments. Which we can do with another flag, -Z self-profile-events:

Shell session
$ RUSTC_BOOTSTRAP=1 rustc src/main.rs --emit=metadata --cfg 'depth = "8"' -Z self-profile -Z self-profile-events=default,args

The summary looks mostly the same:

Shell session
$ summarize summarize $(ls -t *profdata | head -1) | head
+-------------------------------------------------+-----------+-----------------+----------+------------+
| Item                                            | Self time | % of total time | Time     | Item count |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| evaluate_obligation                             | 774.03ms  | 98.623          | 774.06ms | 51         |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| expand_crate                                    | 2.32ms    | 0.295           | 2.71ms   | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_module_children           | 1.98ms    | 0.253           | 1.98ms   | 298        |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| codegen_fulfill_obligation                      | 737.23µs  | 0.094           | 774.74ms | 1          |
thread 'main' panicked at 'Cannot print table to standard output : Broken pipe (os error 32)', /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/prettytable-rs-0.8.0/src/lib.rs:194:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Let's look at the chrome tracing profile directly:

The same chrome tracing profile, but this time there's an arg0 that shows which obligation it's checking

This time, we can see which obligation is being checked, and the first one is (manually formatted for your convenience):

Canonical {
  max_universe: U0,
  variables: [],
  value: ParamEnvAnd {
    param_env: ParamEnv {
      caller_bounds: [],
      reveal: UserFacing,
      constness: NotConst
    },
    value: Binder(
      ProjectionPredicate(
        ProjectionTy {
          substs: [
            &&&&&&&&(),
            ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) })
          ],
          item_def_id: DefId(0:5 ~ main[b057]::Trait::A)
        },
        Ty(())
      ),
      [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))]
    )
  }
}
Cool bear's hot tip

There's not really a good reason for that #[derive(Debug)]-style output being in there btw. There's machinery in place in rustc to format these nicely, like for<'x> Predicate<'x> but apparently it's not hooked all the way through.

Now that I think about it, that would be a neat first contribution... wink wink.

Did you just say "wink wink" out loud?

You're not the boss of me.

The next three are the same, except the item_def_id is for Trait::B, Trait::C and Trait::D respectively. The next one doesn't have a projection, it's simply the for<'x> T: Trait<'x> obligation.

And then the other ones are similar, but for the inner levels: seven & instead of eight, and so on until we reach the bottom, where we check for<'x> &() Trait<'x>.

Which... is interesting. You'd think that... once rustc has proven that for<'x> &() Trait<'x>, it can re-use that result to prove that for<'x> &&(): Trait<'x>, using that impl:

Rust code
impl<T> Trait<'_> for &'_ T
where
    for<'x> T: Trait<'x, A = (), B = (), C = (), D = ()>,

And then it can re-use that to prove for<'x> &&&(): Trait<'x>, and so on until for<'x> &&&&&&&&(): Trait<'x>.

And in fact, it can! If we remove the projections, ie. we make that bound just this:

Rust code
impl<T> Trait<'_> for &'_ T
where
    for<'x> T: Trait<'x>,

Then it completes instantly:

Shell session
$ command time -f "took %Us" env RUSTC_BOOTSTRAP=1 rustc src/main.rs --emit=metadata --cfg 'depth = "8"' -Z self-profile -Z self-profile-events=default,args
took 0.04s

And at that point, most of the time is spent in.. expand_crate (macro stuff?) and metadata_decode_entry_module_children (reading rlib files?):

Shell session
$ summarize summarize $(ls -t *profdata | head -1) | head
+-------------------------------------------------+-----------+-----------------+----------+------------+
| Item                                            | Self time | % of total time | Time     | Item count |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| expand_crate                                    | 2.34ms    | 25.009          | 2.74ms   | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| metadata_decode_entry_module_children           | 2.27ms    | 24.295          | 2.27ms   | 298        |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| self_profile_alloc_query_strings                | 449.57µs  | 4.815           | 3.14ms   | 1          |
+-------------------------------------------------+-----------+-----------------+----------+------------+
| <unknown>                                       | 424.47µs  | 4.546           | 5.03ms   | 320        |
thread 'main' panicked at 'Cannot print table to standard output : Broken pipe (os error 32)', /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/prettytable-rs-0.8.0/src/lib.rs:194:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

So something is definitely up, and we should look further.

I'm sorry, amos, if I could have one second of your time. Are you seriously suggesting folks who write Rust casually have a moral obligation to do all of this?

Oh! Oh no, absolutely not. It's just... if you want to, then you definitely can, and it's a bit under-documented right now, so I wanted to share my notes on how, so that the next time it happens, I can refer to those instead of furiously spamming everyone. And if others do that too maybe we can get to the bottom of these faster, that's all.

Ah, so you're saying not to do this?

I'm saying do whatever you want. I'll show you how, and then you can decide what you want to do with that knowledge.

I see I see. So you're going to want to keep this accessible still, right? It's not that bad so far, but I could see going a little too far w-

So, as you can imagine, at some point, we're gonna want to compile our own copy of rustc.

...what was I just saying.

No no, listen, hear me out. It's not as bad as you think. I was afraid too, once!

But it turns out there's a a very nice guide to rustc development available and it's easy to follow.

Okay but what possible reason could you have for doing this? Are we going to fix the bug in that article?

Oh no, not at all. But we're going to learn more about it.

See, the release build of the stable Rust compiler may have nightly features hidden behind RUSTC_BOOTSTRAP=1, but there's things it doesn't have.

Let's take tracing: say we want more detail than info: maybe we want debug spans and events. If we try it on stable, we get this:

Shell session
$ RUSTC_LOG=debug rustc src/main.rs --emit=metadata --cfg 'depth = "1"'
warning: some trace filter directives would enable traces that are disabled statically
 | `debug` would enable the DEBUG level for all targets
 = note: the static max level is `info`
 = help: to enable DEBUG logging, remove the `max_level_info` feature
(cut)

Oh, it.. wait, tracing lets you strip certain levels statically? Like, at build time?

Yes it does!. So the debug spans and events we're looking for are not even in the stable rust binary.

Profiling with perf

Here's another thing. Say we record the execution of rustc with the perf profiler:

Shell session
$ perf record rustc src/main.rs --emit=metadata --cfg 'depth = "8"'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.048 MB perf.data (850 samples) ]
Cool bear's hot tip

You have to set up a bunch of procfs variables to be able to do perf record unprivileged, because the kernel is (rightfully) paranoid about perf_events and kernel symbol tables by default on distributions like Ubuntu.

This is not covered here, because as you try it, perf walks you through fixing it.

And then we report:

Shell session
$ perf report --stdio
(cut)
     7.26%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_infer::infer::InferCtxt>::rollback_to
     5.54%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_middle::ty::context::TyCtxt>::intern_substs
     4.31%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_infer::infer::region_constraints::RegionConstraintCollector>::make_eqregion
     4.18%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_middle::ty::context::TyCtxt>::mk_region
     3.57%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_infer::infer::region_constraints::leak_check::LeakCheck>::assign_placeholder_values
     3.57%  rustc    rustc                                [.] free
     2.83%  rustc    librustc_driver-a7bf53640eed303a.so  [.] _RNvNtNtCs81r3Jv5jMWm_21rustc_trait_selection6traits7project29opt_normalize_projection_type.llvm.158142954743480087
     2.71%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <alloc::rc::Rc<rustc_middle::traits::ObligationCauseCode> as core::ops::drop::Drop>::drop
     2.71%  rustc    rustc                                [.] malloc
     2.58%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_middle::ty::context::CtxtInterners>::intern_predicate
     1.72%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_data_structures::graph::scc::SccsConstruction<rustc_data_structures::graph::vec_graph::VecGraph<rustc_infer::infer::region_constraints::leak_check::LeakCheckNode>, rustc_infer::infer::region_constraints::leak_check::LeakCheckScc>>::inspect_node
     1.48%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_trait_selection::traits::select::SelectionContext>::rematch_impl
     1.48%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_trait_selection::traits::select::SelectionContext>::vtable_impl
     1.35%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_middle::ty::sty::Region as core::ops::deref::Deref>::deref
     1.35%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_middle::ty::subst::SubstFolder as rustc_middle::ty::fold::TypeFolder>::fold_ty
     1.23%  rustc    [kernel.kallsyms]                    [k] do_user_addr_fault

Well... because we don't have debug info, we don't see what we should be seeing, which is...

Uhh you didn't even tell perf to record call stacks?

Ah right! Let's try that again:

Shell session
# note: `-g` is the more confusing equivalent to `--call-graph fp`. we use
# the longer form to avoid any misunderstandings about what `-g` truly does.
# here we're using "frame pointers" to gather stack traces when we sample.
$ perf record --call-graph fp rustc src/main.rs --emit=metadata --cfg 'depth = "8"'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.112 MB perf.data (1306 samples) ]

$ perf report --stdio
(cut)
     8.77%     0.00%  rustc    [unknown]                            [.] 0000000000000000
            |
            ---0
               |          
               |--0.79%--<rustc_infer::infer::InferCtxt>::commit_if_ok::<(), (), rustc_trait_selection::traits::project::assemble_candidates_from_impls::{closure#0}>
               |          
               |--0.67%--pthread_mutex_trylock@GLIBC_2.2.5
               |          
               |--0.56%--_RNvNtNtCs81r3Jv5jMWm_21rustc_trait_selection6traits7project29opt_normalize_projection_type.llvm.158142954743480087
               |          
               |--0.56%--free
               |          
               |--0.56%--0x7fd7407eb520
               |          
                --0.56%--0x40000
                          <rustc_infer::infer::region_constraints::RegionConstraintCollector>::make_eqregion

     6.64%     6.64%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_infer::infer::InferCtxt>::rollback_to
            |
            ---<rustc_infer::infer::InferCtxt>::rollback_to

     6.19%     6.19%  rustc    librustc_driver-a7bf53640eed303a.so  [.] <rustc_middle::ty::context::TyCtxt>::intern_substs
            |          
             --5.51%--<rustc_middle::ty::context::TyCtxt>::intern_substs

There. That looks wrong.

Sure, but you're using frame pointers here. What about --call-graph dwarf?

Sure, let's try that:

Shell session
$ perf record --call-graph dwarf rustc src/main.rs --emit=metadata --cfg 'depth = "8"'
[ perf record: Woken up 31 times to write data ]
[ perf record: Captured and wrote 7.733 MB perf.data (955 samples) ]

(As usual, using dwarf makes for a much larger perf.data file since perf just takes a snapshot of the stack and deals with symbolication later)

Running perf report --stdio gives us a bunch of these:

Shell session
addr2line: DWARF error: invalid or unhandled FORM value: 0x23
addr2line: DWARF error: invalid or unhandled FORM value: 0x23
(this repeats for a long while)

And then:

Shell session
    98.53%     0.00%  rustc    [unknown]                            [k] 0xffffffffffffffff
            |
            ---0xffffffffffffffff
               |          
               |--43.14%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively
               |          |          
               |          |--32.77%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>>
               |          |          <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively
               |          |          |          
               |          |          |--17.07%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>>
               |          |          |          <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively
               |          |          |          |          
               |          |          |          |--15.71%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>>
               |          |          |          |          <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively
               |          |          |          |          |          
               |          |          |          |          |--11.94%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>>
               |          |          |          |          |          <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively
               |          |          |          |          |          |          
               |          |          |          |          |          |--6.18%--rustc_trait_selection::traits::project::poly_project_and_unify_type
               |          |          |          |          |          |          rustc_trait_selection::traits::project::project_and_unify_type
               |          |          |          |          |          |          _RNvNtNtCs81r3Jv5jMWm_21rustc_trait_selection6traits7project29opt_normalize_projection_type.llvm.158142954743480087
               |          |          |          |          |          |          |          
               |          |          |          |          |          |          |--2.20%--<rustc_infer::infer::InferCtxt>::commit_if_ok::<(), (), rustc_trait_selection::traits::project::assemble_candidates_from_impls::{closure#0}>
               |          |          |          |          |          |          |          
               |          |          |          |          |          |          |--1.57%--<rustc_infer::traits::project::ProjectionCache>::try_start
               |          |          |          |          |          |          |          |          
               |          |          |          |          |          |          |          |--0.73%--__memcpy_avx_unaligned_erms (inlined)
               |          |          |          |          |          |          |          |          
               |          |          |          |          |          |          |           --0.52%--_RINvMs5_NtCskqCQpbzSEkT_9hashbrown3rawINtB6_8RawTableTNtNtNtCs7lUMoIymc80_11rustc_infer6traits7project18ProjectionCacheKeyNtBS_20ProjectionCacheEntryEE14reserve_rehashNCINvNtB8_3map11make_hasherBQ_BQ_B1W_INtNtCs8R1TIarAOKy_4core4hash18BuildHasherDefaultNtCs5jvd1lB4BAC_10rustc_hash8FxHasherEE0EBW_.llvm.5086343079582176625
               |          |          |          |          |          |          |          
               |          |          |          |          |          |          |--0.63%--<rustc_middle::ty::sty::ProjectionTy>::trait_ref
               |          |          |          |          |          |          |          
               |          |          |          |          |          |           --0.63%--<alloc::vec::drain_filter::DrainFilter<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>, rustc_trait_selection::traits::project::opt_normalize_projection_type::{closure#0}> as core::ops::drop::Drop>::drop
               |          |          |          |          |          |          
               |          |          |          |          |           --5.76%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>>
               |          |          |          |          |                     <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively
               |          |          |          |          |                     |          
               |          |          |          |          |                      --5.65%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>>
               |          |          |          |          |                                <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively
               |          |          |          |          |                                |          
               |          |          |          |          |                                |--4.29%--rustc_trait_selection::traits::project::poly_project_and_unify_type
               |          |          |          |          |                                |          |          
               |          |          |          |          |                                |          |--2.93%--<&rustc_middle::ty::list::List<rustc_middle::ty::subst::GenericArg> as rustc_middle::ty::fold::TypeFoldable>::try_fold_with::<rustc_middle::ty::fold::BoundVarReplacer>
               |          |          |          |          |                                |          |          |          
               |          |          |          |          |                                |          |          |--2.41%--<rustc_middle::ty::fold::BoundVarReplacer as rustc_middle::ty::fold::TypeFolder>::fold_region
               |          |          |          |          |                                |          |          |          |          
               |          |          |          |          |                                |          |          |           --2.30%--_RNCINvMs5_NtNtCs3KapO9mZVGx_12rustc_middle2ty4foldNtNtBa_7context6TyCtxt18replace_bound_varsNtBa_14TraitPredicateNCINvMs_NtNtCs7lUMoIymc80_11rustc_i
nfer5infer13higher_rankedNtB1X_9InferCtxt36replace_bound_vars_with_placeholdersB1s_E0NCB1P_s_0NCB1P_s0_0E0Cs81r3Jv5jMWm_21rustc_trait_selection.llvm.15196016308777180469 (inlined)
               |          |          |          |          |                                |          |          |                     |          
               |          |          |          |          |                                |          |          |                      --1.78%--<rustc_middle::ty::context::TyCtxt>::mk_region
               |          |          |          |          |                                |          |          |          
               |          |          |          |          |                                |          |           --0.52%--<rustc_middle::ty::context::TyCtxt>::intern_substs
(cut)

Oh, actually that looks okay?

Let's try making flamegraph with inferno:

Shell session
$ perf script | inferno-collapse-perf | inferno-flamegraph > rustc.svg
(lots and lots of DWARF errors)
another flamegraph, showing various internal rustc functions

Yeah huh, this is not as bad as I thought actually.

I have a feeling the DWARF errors are caused by the lack of support for newer DWARF features in Ubuntu 22.04.

Mhh maybe if you tell perf to use another addr2line implementation? The one from LLVM maybe?

Yeah yeah, I've seen the perf patch for that too, but: it doesn't seem to be accepted by perf script on Ubuntu 22.04, and also llvm-addr2line (a wrapper over llvm-symbolize) is straight up not provided by LLVM 14 packages from their own repository. So uh, yeah.

And the other issue is: if you look closer at the flamegraph above, you'll see that it goes: all, rustc, and then [unknown]. A bunch of the call stack is missing!

That's because perf only captures 8KB of the stack by default. You can bump that up to 65528 bytes with --call-graph dwarf,65528 (yes that's the actual upper limit), but that makes perf.data even larger, and perf report / perf script even slower.

Update: as it turns out, llvm-addr2line is provided by the packages in LLVM's APT repository, it's just named llvm-addr2line-14 for me. Still no --addr2line= flag in perf for me though, so with a quick wrapper in /tmp/addr2line-wrapper/addr2line

Bash
#!/bin/bash -eu
llvm-addr2line-14 "$@"
Cool bear's hot tip

This trick works (almost) anytime you want to trick a program into executing another program instead of what's naturally in your $PATH.

Let's try using the Firefox profiler to visualize our profile this time, by following the instructions carefully:

Shell session
$ perf record --call-graph dwarf rustc src/main.rs --emit=metadata --cfg 'depth = "8"' 
$ chmod +x /tmp/addr2line-wrapper/addr2line
$ PATH=/tmp/addr2line-wrapper:$PATH perf script -F +pid > test.perf

Then we just drag it onto the web app, and... there!

a screenshot of the Firefox profiler

One extremely neat thing about Firefox profiler is you can share your profile online. Here's the profile above, have a look!

Profiling with not-perf (nperf)

Even though perf is super versatile, as far as CPU sampling profilers go, I prefer something a little more modern, like not-perf.

It produces smaller profiles, does symbolication on the fly (I stole this term from Sentry, it just means "turning addresses into symbols", what addr2line does), and has built-in flamegraph and chrome-tracing generation facilities.

It's not published on crates.io as far as I can tell, so you'll want to clone it locally and run cargo install --path ./cli or something, and then you can do this in one terminal session:

Shell session
$ nperf record --process rustc --wait

And this in another:

Shell session
$ rustc src/main.rs --emit=metadata --cfg 'depth = "8"'

Our nperf session completes nicely:

Shell session
$ nperf record --process rustc --wait
[2022-07-13T11:38:35Z INFO  nperf_core::ps] Waiting for process named 'rustc'...
[2022-07-13T11:38:44Z INFO  nperf_core::ps] Process 'rustc' found with PID 1438757!
[2022-07-13T11:38:44Z INFO  nperf_core::profiler] Opening "20220713_113844_1438757_rustc.nperf" for writing...
[2022-07-13T11:38:44Z INFO  nperf_core::cmd_record] Opening perf events for process with PID 1438757...
[2022-07-13T11:38:44Z INFO  nperf_core::profiler] Finished output file initialization
[2022-07-13T11:38:44Z INFO  nperf_core::cmd_record] Enabling perf events...
[2022-07-13T11:38:44Z INFO  nperf_core::cmd_record] Running...
[2022-07-13T11:38:45Z INFO  nperf_core::profiler] Collected 457 samples in total!

And now we can generate a flamegraph from it:

Shell session
$ nperf flamegraph $(ls -t *.nperf | head -1) -o rustc.svg
[2022-07-13T11:39:34Z WARN  nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libm.so.6': 'a8c28af0bfefefcf69ba73f7d4582d82e01f71.debug'
[2022-07-13T11:39:34Z WARN  nperf_core::data_reader] Missing external debug symbols for '/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/libLLVM-14-rust-1.62.0-stable.so': 'crti.o'
[2022-07-13T11:39:34Z WARN  nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libc.so.6': 'c3cb85f9e55046776471fed05ec441581d1969.debug'
[2022-07-13T11:39:34Z WARN  nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libz.so.1.2.11': '2e9768303892af62d212ec9357a7f1a0824a91.debug'
[2022-07-13T11:39:34Z WARN  nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/librt.so.1': '0915c138c86b44167c7f53256fc375332fbe47.debug'
[2022-07-13T11:39:34Z WARN  nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libgcc_s.so.1': '3e9a70a55a0249b33d01e903049a7657f5c0ae.debug'
[2022-07-13T11:39:34Z WARN  nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libdl.so.2': '63bd5ea2a58dd8b60ee92a5081f753391db642.debug'
[2022-07-13T11:39:34Z WARN  nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2': '1b0b998999c397062e1016f0c95dc0e8820117.debug'

(As usual, Ubuntu's awful track record with debug symbols means we're missing info for a bunch of system libraries, and good luck finding those!)

A flamegraph generated with nperf. It's much deeper, we have full callstacks here

This time our callstacks are complete, so we can see the whole thing. At this point we already have a good candidate, but I'm going to keep showing tools because I can.

nperf can also generate a chrome_profiler.json file that we can load with chrome://tracing:

Shell session
$ nperf trace-events $(ls -t *.nperf | head -1) -o chrome_profiler.json 
(cut: missing symbols warnings)
[2022-07-13T11:43:48Z INFO  nperf_core::cmd_trace_events] Profiling data frequency: 900
[2022-07-13T11:43:48Z INFO  nperf_core::cmd_trace_events] Trace period for frame collapsing: 1140us

Unfortunately, it's a little buggy at the moment and the generated JSON has an empty record in first position, like so:

JSON
[,{"name": "0x0000000000000000","ph": "B","ts": 102138637276.717,"pid": 1438757,"tid": 1438757}

We can just remove that extra comma and be on our merry way.

A chrome tracing profile, full of colors but really hard to read unless we zoom in

This isn't very readable unless we zoom in a little:

the same profile, zoomed in. We can see resolve_instance calls inner_resolve_instance, codegen_fulfill_obligation, try_execute_query, codegen_fulfill_obligation, a closure, select_all_or_error, process_obligations, process_changed_obligations, predicate_must_hold_considering_regions, evaluate_obligation a couple times, evaluate_root_obligation, then several layers of evaluate_predicate_recursively, which end up in poly_project_and_unify_type, opt_normalize_projection_type, commit_if_ok, select, and in another leaf, there's confirm_candidate, rematch_impl, and make_eqregion

If we load our profile in another tool, like Speedscope, we can get another visualization:

speedscope screenshot, focused on a bunch of evaluate_predicate_recursively calls.

Even though we're in "time order" mode, we can see that the stack goes deep, very deep. Good thing rustc uses the stacker crate internally.

Profiling with callgrind

I wanted to try out another profiling tool: valgrind's own "callgrind".

Cool bear's hot tip

valgrind is "an instrumentation framework for building dynamic analysis tools". It doesn't just detect memory corruption, memory leaks or cache misses, it also comes with a profiler.

But I couldn't. Apparently rustup shims include DWARF5 debug information, which Valgrind doesn't properly support yet:

Rust code
$ valgrind --quiet --trace-children=yes --tool=callgrind rustc src/main.rs --emit=metadata -v
### unhandled dwarf2 abbrev form code 0x25
### unhandled dwarf2 abbrev form code 0x25
### unhandled dwarf2 abbrev form code 0x25
### unhandled dwarf2 abbrev form code 0x23
==1479608== Valgrind: debuginfo reader: ensure_valid failed:
==1479608== Valgrind:   during call to ML_(img_get)
==1479608== Valgrind:   request for range [3693702622, +4) exceeds
==1479608== Valgrind:   valid image size of 15825920 for image:
==1479608== Valgrind:   "/home/amos/.cargo/bin/rustc"
==1479608== 
==1479608== Valgrind: debuginfo reader: Possibly corrupted debuginfo file.
==1479608== Valgrind: I can't recover.  Giving up.  Sorry.
==1479608== 

Here's the upstream valgrind issue.

Stripping those, or using the "actual" rustc executable didn't get me much further, even with a custom build of rustc. It seems something goes wrong with LLVM at some point:

Rust code
$ valgrind --quiet --trace-children=yes --tool=callgrind /home/amos/.rustup/toolchains/stage1/bin/rustc src/main.rs --emit=metadata -v
rustc -Cllvm-args="..." with: Unknown command line argument '-generate-arange-section'.  Try: 'rustc -Cllvm-args="..." with --help'
rustc -Cllvm-args="..." with: Did you mean '--W=enerate-arange-section'?
rustc -Cllvm-args="..." with: Unknown command line argument '-mergefunc-use-aliases'.  Try: 'rustc -Cllvm-args="..." with --help'
rustc -Cllvm-args="..." with: Did you mean '--W=ergefunc-use-aliases'?

And manually adding unstable rustc flags just to get it to stop passing those flags to LLVM didn't fix whatever the problem is. It just failed later:

Rust code
$ valgrind --quiet --trace-children=yes --tool=callgrind /home/amos/.rustup/toolchains/stage1/bin/rustc src/main.rs --emit=metadata -v -Z no-generate-arange-section -Z merge-functions=disabled
LLVM ERROR: Expected token before separator in datalayout string

I did make sure that rustc was loading its own copy of LLVM:

Rust code
$ valgrind --quiet --trace-children=yes --tool=callgrind env LD_DEBUG=files /home/amos/.rustup/toolchains/stage1/bin/rustc src/main.rs --emit=metadata -v 2>&1 | grep -i llvm
   1481228:     file=libLLVM-14-rust-1.64.0-nightly.so [0];  needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/librustc_driver-dc663798d2165e22.so [0]
   1481228:     file=libLLVM-14-rust-1.64.0-nightly.so [0];  generating link map
   1481228:     file=librt.so.1 [0];  needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0]
   1481228:     file=libdl.so.2 [0];  needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0]
   1481228:     file=libpthread.so.0 [0];  needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0]
   1481228:     file=libz.so.1 [0];  needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0]
   1481228:     calling init: /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so
rustc -Cllvm-args="..." with: Unknown command line argument '-generate-arange-section'.  Try: 'rustc -Cllvm-args="..." with --help'
rustc -Cllvm-args="..." with: Did you mean '--W=enerate-arange-section'?
rustc -Cllvm-args="..." with: Unknown command line argument '-mergefunc-use-aliases'.  Try: 'rustc -Cllvm-args="..." with --help'
rustc -Cllvm-args="..." with: Did you mean '--W=ergefunc-use-aliases'?
   1481228:     calling fini: /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0]

So I have no idea what's going on here.

Next!

Self-profiling deeper with jq

So we've seen two ways to generate flamegraphs and chrome-tracing profiles.

One through rustc self-profiling (-Z self-profile), and one through profilers like perf, nperf, or, whenever I can figure out how to fix it, valgrind.

The problem with CPU profilers is that we don't know what arguments are being passed to those functions. We noticed we were spending a lot of time in evaluate_predicate_recursively, but we don't know what predicate is being evaluated recursively.

And the problem with self-profiling (and tracing) is that... it depends on how much code has been instrumented.

That's where building our own rustc comes in. I followed the dev guide and reached a point where I could build my own version of the compiler with a single command!

Here's a "hot" build: that's me changing code in a single rustc crate, not a full, from-scratch build. It's not even a stage2 build, just stage1:

Shell session
$ ./x.py build library
Building rustbuild
   Compiling lzma-sys v0.1.17
   Compiling xz2 v0.1.6
   Compiling bootstrap v0.0.0 (/home/amos/work/rust/src/bootstrap)
    Finished dev [unoptimized] target(s) in 5.16s
Building stage0 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
    Finished release [optimized] target(s) in 0.16s
Copying stage0 std from stage0 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu)
Building stage0 compiler artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
   Compiling rustc_llvm v0.0.0 (/home/amos/work/rust/compiler/rustc_llvm)
   Compiling rustc_log v0.0.0 (/home/amos/work/rust/compiler/rustc_log)
   Compiling rustc_trait_selection v0.0.0 (/home/amos/work/rust/compiler/rustc_trait_selection)
   Compiling rustc_lint v0.0.0 (/home/amos/work/rust/compiler/rustc_lint)
   Compiling rustc_ty_utils v0.0.0 (/home/amos/work/rust/compiler/rustc_ty_utils)
   Compiling rustc_const_eval v0.0.0 (/home/amos/work/rust/compiler/rustc_const_eval)
   Compiling rustc_traits v0.0.0 (/home/amos/work/rust/compiler/rustc_traits)
   Compiling rustc_mir_build v0.0.0 (/home/amos/work/rust/compiler/rustc_mir_build)
   Compiling rustc_typeck v0.0.0 (/home/amos/work/rust/compiler/rustc_typeck)
   Compiling rustc_plugin_impl v0.0.0 (/home/amos/work/rust/compiler/rustc_plugin_impl)
   Compiling rustc_mir_transform v0.0.0 (/home/amos/work/rust/compiler/rustc_mir_transform)
   Compiling rustc_borrowck v0.0.0 (/home/amos/work/rust/compiler/rustc_borrowck)
   Compiling rustc_privacy v0.0.0 (/home/amos/work/rust/compiler/rustc_privacy)
   Compiling rustc_codegen_llvm v0.0.0 (/home/amos/work/rust/compiler/rustc_codegen_llvm)
   Compiling rustc_interface v0.0.0 (/home/amos/work/rust/compiler/rustc_interface)
   Compiling rustc_driver v0.0.0 (/home/amos/work/rust/compiler/rustc_driver)
   Compiling rustc_smir v0.0.0 (/home/amos/work/rust/compiler/rustc_smir)
   Compiling rustc-main v0.0.0 (/home/amos/work/rust/compiler/rustc)
    Finished release [optimized + debuginfo] target(s) in 33.73s
Copying stage0 rustc from stage0 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu)
Assembling stage1 compiler (x86_64-unknown-linux-gnu)
Building stage1 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
   Compiling compiler_builtins v0.1.73
   Compiling core v0.0.0 (/home/amos/work/rust/library/core)
   Compiling libc v0.2.126
   Compiling cc v1.0.69
   Compiling memchr v2.4.1
   Compiling std v0.0.0 (/home/amos/work/rust/library/std)
   Compiling unwind v0.0.0 (/home/amos/work/rust/library/unwind)
   Compiling rustc-std-workspace-core v1.99.0 (/home/amos/work/rust/library/rustc-std-workspace-core)
   Compiling alloc v0.0.0 (/home/amos/work/rust/library/alloc)
   Compiling cfg-if v0.1.10
   Compiling adler v0.2.3
   Compiling rustc-demangle v0.1.21
   Compiling rustc-std-workspace-alloc v1.99.0 (/home/amos/work/rust/library/rustc-std-workspace-alloc)
   Compiling panic_unwind v0.0.0 (/home/amos/work/rust/library/panic_unwind)
   Compiling panic_abort v0.0.0 (/home/amos/work/rust/library/panic_abort)
   Compiling gimli v0.25.0
   Compiling miniz_oxide v0.4.0
   Compiling hashbrown v0.12.0
   Compiling std_detect v0.1.5 (/home/amos/work/rust/library/stdarch/crates/std_detect)
   Compiling object v0.26.2
   Compiling addr2line v0.16.0
   Compiling rustc-std-workspace-std v1.99.0 (/home/amos/work/rust/library/rustc-std-workspace-std)
   Compiling proc_macro v0.0.0 (/home/amos/work/rust/library/proc_macro)
   Compiling unicode-width v0.1.8
   Compiling getopts v0.2.21
   Compiling test v0.0.0 (/home/amos/work/rust/library/test)
    Finished release [optimized] target(s) in 41.80s
Copying stage1 std from stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu)
Build completed successfully in 0:01:21

It "only" took 1 minutes and 20 seconds. Scare quotes because that's not quite the "tight feedback loop" I'm dreaming of, but also, going into rustc development, I expected much, much worse. (But also, big CPU on this machine, etc. I'm not sure how much it actually helps since there's a bunch of discrete steps).

Oh, also, by following more instructions in the dev guide, that build is available through rustup as the "stage1" toolchain.

Compare stable, and my build:

Shell session
$ rustc --version
rustc 1.62.0 (a8314ef7d 2022-06-27)
$ rustc +stage1 --version
rustc 1.64.0-dev

Anyway, if we do a quick search in rustc's code, we find this function:

Rust code
//  in `rust/compiler/rustc_trait_selection/src/traits/select/mod.rs`

    fn evaluate_predicate_recursively<'o>(
        &mut self,
        previous_stack: TraitObligationStackList<'o, 'tcx>,
        obligation: PredicateObligation<'tcx>,
    ) -> Result<EvaluationResult, OverflowError> {
        // (cut)
    }

And, well, we can just tell the self-profiler to record this activity, with an argument:

Rust code
//  in `rust/compiler/rustc_trait_selection/src/traits/select/mod.rs`

    fn evaluate_predicate_recursively<'o>(
        &mut self,
        previous_stack: TraitObligationStackList<'o, 'tcx>,
        obligation: PredicateObligation<'tcx>,
    ) -> Result<EvaluationResult, OverflowError> {
        let _prof_timer = self.infcx.tcx.prof.generic_activity_with_arg_recorder(
            "evaluate_predicate_recursively",
            |recorder| {
                recorder.record_arg(format!("{:?}", obligation));
            },
        );

        // (cut)
    }

Now we rebuild with ./x.py build library... and run it again with -Z self-profile, making sure to record arguments:

Shell session
$ rustc +stage1 src/main.rs --emit=metadata --cfg 'depth = "7"' -Z self-profile -Z self-profile-events=default,args

And instead of opening that up in chrome://tracing, we can do some digging ourselves with the generated JSON:

Shell session
$ crox $(ls -t *profdata | tail -1)
$ jq -r '.[] | select(.name == "evaluate_predicate_recursively") | .args.arg0' chrome_profiler.json | sed -E 's/depth=[0-9]+/depth=?/' | sort -n | uniq -c | sort -n
      1 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
      1 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
      1 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
      1 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
      1 Obligation(predicate=Binder(TraitPredicate(<&&&&&&() as std::marker::Sized>, polarity:Positive), []), depth=?)
      1 Obligation(predicate=Binder(TraitPredicate(<() as std::process::Termination>, polarity:Positive), []), depth=?)
      1 Obligation(predicate=Binder(TraitPredicate(<&&&&&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
      6 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
      6 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
      6 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
      6 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
      6 Obligation(predicate=Binder(TraitPredicate(<&&&&&() as std::marker::Sized>, polarity:Positive), []), depth=?)
      6 Obligation(predicate=Binder(TraitPredicate(<&&&&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
     26 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
     26 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
     26 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
     26 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
     26 Obligation(predicate=Binder(TraitPredicate(<&&&&() as std::marker::Sized>, polarity:Positive), []), depth=?)
     26 Obligation(predicate=Binder(TraitPredicate(<&&&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    106 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    106 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    106 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    106 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    106 Obligation(predicate=Binder(TraitPredicate(<&&&() as std::marker::Sized>, polarity:Positive), []), depth=?)
    106 Obligation(predicate=Binder(TraitPredicate(<&&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    426 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    426 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    426 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    426 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
    426 Obligation(predicate=Binder(TraitPredicate(<&&() as std::marker::Sized>, polarity:Positive), []), depth=?)
    426 Obligation(predicate=Binder(TraitPredicate(<&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   1706 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   1706 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   1706 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   1706 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   1706 Obligation(predicate=Binder(TraitPredicate(<&() as std::marker::Sized>, polarity:Positive), []), depth=?)
   1706 Obligation(predicate=Binder(TraitPredicate(<&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   6826 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   6826 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   6826 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   6826 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   6826 Obligation(predicate=Binder(TraitPredicate(<() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?)
   6828 Obligation(predicate=Binder(TraitPredicate(<() as std::marker::Sized>, polarity:Positive), []), depth=?)

And I don't know how you feel about it but... this is such a cool visualization of the problem. We can see that the top-level obligations (on &&&&&&()) are only evaluated once, the next level are evaluated 6 times (1 for Trait, 4 associated types, and 1 for the implicit Sized trait).

Yeah yeah that's cool and all but... don't we have debug-level traces now that we have our own rustc build?

Oh! Yes we do!

Tracing deeper with debug spans

The problem with just exporting RUSTC_LOG=debug is that we're gonna get a lot of spans. Even with depth = "1":

Shell session
$ RUSTC_LOG="debug" rustc +stage1 src/main.rs --emit=metadata --cfg 'depth = "1"' 2>&1 | wc -l
6864

But we already know which span we want!

In fact, I didn't show you, I willingly withheld that information from you, but evaluate_predicate_recursively is actually already instrumented!

Rust code
    #[instrument(
        level = "debug",
        skip(self, previous_stack),
        fields(previous_stack = ?previous_stack.head())
    )]
    fn evaluate_predicate_recursively<'o>(
        &mut self,
        previous_stack: TraitObligationStackList<'o, 'tcx>,
        obligation: PredicateObligation<'tcx>,
    ) -> Result<EvaluationResult, OverflowError> {
        // (cut)
    }

So instead of using the self-profiling facilities, we can just use tracing.

Like so:

Shell session
$ RUSTC_LOG="rustc_trait_selection::traits::select[evaluate_predicate_recursively]=debug" rustc +stage1 src/main.rs --emit=metadata --cfg 'depth = "1"'
┐rustc_trait_selection::traits::select::evaluate_predicate_recursively obligation=Obligation(predicate=Binder(TraitPredicate(<() as std::marker::Sized>, polarity:Positive), []), depth=0), previous_stack=None
├─┐rustc_trait_selection::traits::select::evaluate_trait_predicate_recursively obligation=Obligation(predicate=Binder(TraitPredicate(<() as std::marker::Sized>, polarity:Positive), []), depth=0)
(cut)
├─┘
├─0ms DEBUG rustc_trait_selection::traits::select finished: Ok(EvaluatedToOk) from Obligation(predicate=Binder(TraitPredicate(<() as std::process::Termination>, polarity:Positive), []), depth=0)
┘

But even with just that, we have 1427 lines to go through. And that's just at depth 1. At depth 2 we have 7559 lines, depth 3 gives us 38313 lines... we are looking into an exponential problem after all.

Wouldn't it be convenient if there was a way to visualize all that from the comfort of a web browser?

Say... what if folks working on backend web services had all agreed on a standard, years in the making, to send over spans and events and stuff?

Oh. Oh no. Amos what are you doi-

Enter OpenTelemetry, a "collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior".

That seems perfect. And what do you know, there's a tracing-opentelemetry adapter!

Now all we need is something self-hosted to receive those traces and... I guess jaeger could do.

Let's look at how rustc sets up logging:

Rust code
// in `rust/compiler/rustc_log/src/lib.rs`

// (most functions and imports cut)

/// In contrast to `init_rustc_env_logger` this allows you to choose an env var
/// other than `RUSTC_LOG`.
pub fn init_env_logger(env: &str) -> Result<(), Error> {
    let filter = match env::var(env) {
        Ok(env) => EnvFilter::new(env),
        _ => EnvFilter::default().add_directive(Directive::from(LevelFilter::WARN)),
    };

    let color_logs = match env::var(String::from(env) + "_COLOR") {
        Ok(value) => match value.as_ref() {
            "always" => true,
            "never" => false,
            "auto" => stderr_isatty(),
            _ => return Err(Error::InvalidColorValue(value)),
        },
        Err(VarError::NotPresent) => stderr_isatty(),
        Err(VarError::NotUnicode(_value)) => return Err(Error::NonUnicodeColorValue),
    };

    let verbose_entry_exit = match env::var_os(String::from(env) + "_ENTRY_EXIT") {
        None => false,
        Some(v) => &v != "0",
    };

    let layer = tracing_tree::HierarchicalLayer::default()
        .with_writer(io::stderr)
        .with_indent_lines(true)
        .with_ansi(color_logs)
        .with_targets(true)
        .with_verbose_exit(verbose_entry_exit)
        .with_verbose_entry(verbose_entry_exit)
        .with_indent_amount(2);
    #[cfg(parallel_compiler)]
    let layer = layer.with_thread_ids(true).with_thread_names(true);

    let subscriber = tracing_subscriber::Registry::default().with(filter).with(layer);
    tracing::subscriber::set_global_default(subscriber).unwrap();

    Ok(())
}

Alright, seem straight-forward enough. So now we just-

Amos there is no way this works. Are you going to pull tokio in there too? Async rustc? Is that the sort of chaos you're after?

Well, maybe, let's find out. So first we pull in the dependencies we need:

TOML markup
# in `rust/compiler/rustc_log/Cargo.toml`

[dependencies]
atty = "0.2"
tracing = "0.1.28"
tracing-subscriber = { version = "0.3.3", default-features = false, features = ["fmt", "env-filter", "smallvec", "parking_lot", "ansi"] }
tracing-tree = "0.2.0"
# 👇 new
tracing-opentelemetry = { version = "0.17.4" }
opentelemetry-jaeger = { version = "0.16.0", features = [
	"reqwest_blocking_collector_client",
] }

And then we send our traces to jaeger instead if the RUSTC_JAEGER environment variable is set:

Rust code
// in `rust/compiler/rustc_log/src/lib.rs`

pub fn init_env_logger(env: &str) -> Result<(), Error> {
    let filter = match env::var(env) {
        Ok(env) => EnvFilter::new(env),
        _ => EnvFilter::default().add_directive(Directive::from(LevelFilter::WARN)),
    };

    if env::var("RUSTC_JAEGER").is_ok() {
        let tracer = opentelemetry_jaeger::new_pipeline()
            .with_service_name("rustc")
            .with_collector_endpoint("http://localhost:14268/api/traces")
            .install_simple().unwrap();

        let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);

        let subscriber = tracing_subscriber::Registry::default().with(filter).with(telemetry);
        tracing::subscriber::set_global_default(subscriber).unwrap();
    } else {
        let color_logs = match env::var(String::from(env) + "_COLOR") {
            Ok(value) => match value.as_ref() {
                "always" => true,
                "never" => false,
                "auto" => stderr_isatty(),
                _ => return Err(Error::InvalidColorValue(value)),
            },
            Err(VarError::NotPresent) => stderr_isatty(),
            Err(VarError::NotUnicode(_value)) => return Err(Error::NonUnicodeColorValue),
        };

        let verbose_entry_exit = match env::var_os(String::from(env) + "_ENTRY_EXIT") {
            None => false,
            Some(v) => &v != "0",
        };

        let layer = tracing_tree::HierarchicalLayer::default()
            .with_writer(io::stderr)
            .with_indent_lines(true)
            .with_ansi(color_logs)
            .with_targets(true)
            .with_verbose_exit(verbose_entry_exit)
            .with_verbose_entry(verbose_entry_exit)
            .with_indent_amount(2);
        #[cfg(parallel_compiler)]
        let layer = layer.with_thread_ids(true).with_thread_names(true);

        let subscriber = tracing_subscriber::Registry::default().with(filter).with(layer);
        tracing::subscriber::set_global_default(subscriber).unwrap();
    }

    Ok(())
}

Now we can run jaeger's "all-in-one" docker image in a terminal session somewhere...

Shell session
$ docker kill jaeger ; docker rm jaeger; docker run -d --name jaeger -e SPAN_STORAGE_TYPE=badger -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 -e COLLECTOR_OTLP_ENABLED=true -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 4317:4317 -p 4318:4318 -p 14250:14250 -p 14268:14268 -p 14269:14269 -p 9411:9411 jaegertracing/all-in-one:1.36
c69025488ee2384bda5263f10954eb0830d0f41e045dad5b76da5143e6d2e948

It's now running in the background, and the web UI is available at http://localhost:16686/.

Time to send some traces to it!

Shell session
$ RUSTC_JAEGER=1 RUSTC_LOG="rustc_trait_selection::traits::select[evaluate_predicate_recursively]=debug" rustc +stage1 src/main.rs --emit=metadata --cfg 'depth = "4"'

By picking "rustc" from the service dropdown and hitting search, we can see a few traces:

Jaeger screenshot: home screen. There's a latency graph showing an operation taking 200ms. Operation names are something like poly_project_and_unify_type, evaluate_predicate_recursively, etc.

See that dot taking 200ms in the latency graph? Let's click on that... and then collapse a few spans... and voila:

A detailed view of an evaluate_predicate_recursively span taking 201.61ms

So! There's 4076 total spans, a depth of 25, and we can see exactly what happens, when, and because of what.

We can expand or collapse individual spans, along with their "tags" (which tracing calls attributes, if I recall correctly). Among them, the arguments to our instrumented functions!

So for example the top-level predicate is about what we'd expect:

The top-level span is now expanded, showing tags like busy_ns, code.filepath, code.lineno, code.namespace, idle_ns, internal.span.format, obligation, otel.library.name, otel.library.version, previous_stack, thread.id and thread.name

Here is it manually formatted again:

Obligation(
  predicate=Binder(
    ProjectionPredicate(
      ProjectionTy {
        substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })],
        item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A)
      },
      Ty(())
    ),
    [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]
  ),
  depth=0
)

Then we move onto poly_project_and_unify_type, which is relatively quick, and then evaluate_predicates_recursively (predicates plural), with these:

[
  Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2),
  Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2),
  Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2),
  Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2),
  Obligation(predicate=Binder(TraitPredicate(<&&() as std::marker::Sized>, polarity:Positive), []), depth=2), Obligation(predicate=Binder(TraitPredicate(<&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2)
]

If we keep digging, we find another set of poly_project_etc and evaluate_predicate_recursively, with:

Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2)

Then another, with:

Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=4)

Then another, with:

Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=6)

(Notice the number of & decreases, and the depth increases).

And that one is trivially evaluated with no further nesting.

Exploring the rest of the graph shows much of the same. To evaluate any constraint of the form for<'x> <&&() as Trait<'x>>::A = (), we must first evaluate all of these:

And for the first 5 of these, we must evaluate all of these for every deeper level. And somewhere therein lies the bug: the result of those "inner" evaluations should be cached and we should never recurse that often, but we do.

What now?

Well, now I've run out of knowledge! I'm monitoring rust issue 99188 with interest, and might keep digging to see if I can somehow figure out why these aren't getting cached.

Maybe this little exploration will help solve the actual issue, maybe not! Maybe it'll give someone a taste for diving into rustc internals.

Anyway, that's all I got for you today. As always, thanks for watching, and don't forget to like comment and sub-

Oooh you need a nap.

Yes I do.