When rustc explodes
đ This page was last updated ~2 years ago. Just so you know.
One could say I have a bit of an obsession with build times.
I believe having a "tight feedback loop" is extremely valuable: when I work on a large codebase, I want to be able to make small incremental changes and check very often that things are going as expected.
Especially if I'm working on a project that needs to move quickly: say, the product for an early-stage startup, or a side-project for which I only ever get to do 1-hour work bursts at most.
There's a lot about engineering that we need to endure: a lot of problems have inherent complexity. Coordination across humans is hard, and it's lossy. It's really tricky to strike the right balance between "not enough testing" and "ossifying the codebase" (making it harder to change its structure later).
But there's things we don't need to suffer. And one of them is "watching the compiler go brrrr" for excessive amounts of time.
Previously, on this here site
Of course, everyone has their own definition of "excessive". I've been in multiple discussions where I've argued Rust compile times aren't that bad, and then they'll tell me "okay but some projects at work take over 2 hours to compile", to which my response is "oh wow no that's bonkers actually, how the heck did y'all do that?"
And that's what makes it so hard to discuss these kind of things online: you may have it good. It might work great for you. But someone else might be having a really poor time with it, for quite some time now, and they've just resigned themselves that "that's how it is I guess".
It's often proprietary codebases, too, so short of being hired, there's few opportunities for someone to come in, look at the absolute state of their CI pipelines and go "damn, you really live like this?".
So I wrote Why is my Rust build so slow, in which we reviewed things like:
- What cargo actually does (it invokes
rustc
, takes care of conditional compilation, caching, downloading dependencies, and a thousand other things) - How much time is being spent compiling each crate? What kind of dependencies
are there, and what concurrency do we achieve? (with
--timings
, stable as of Rust 1.60) - Using a faster linker, like LLVM lld, or mold.
- The cost of debug information (
debug = true
ordebug = 2
, vsdebug = 1
ordebug = 0
). - Incremental compilation (
incremental = true
inCargo.toml
), enabled by default for debug builds except those few times where it was broken and thus disabled in stable Rust - The cost of LTO (link-time optimization), and its different flavors: off, thin, "fat".
- rustc self-profiling with
-Z self-profile
, and the various measureme tools:summarize
,flamegraph
, andcrox
- Looking at
crox
output withchrome://tracing
or Perfetto
Finally, we found that there was a workaround we could use with the warp
crate, that made our compile times go down significantly. Splitting crates into
smaller crates was also interesting for increased concurrency.
There's a bunch of other articles that have been written about this topic around the interwebs, but ultimately most of them fall short: what do you do when you've done all of that, but you still suspect you're hitting a pathological case in rustc that makes your build much slower than it ought to be?
What then?
Well, I hit one of these just recently, so, let's look into it.
The offending code
Originally, I hit this performance bug with some tower code. It's a crate used by hyper, itself used by higher-level crates like warp and axum, so it's no coincidence that folks have had build time troubles with all of these.
The crux of the problem is "gee grandma, what big types you have there", because... essentially, tower is a couple of important traits.
Here's one:
pub trait Service<Request> { type Response; type Error; type Future: Future<Output = Result<Self::Response, Self::Error>>; fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>>; fn call(&mut self, req: Request) -> Self::Future; }
And the basic idea is "oh shit we don't have async trait methods" (hence the
Future
associated type) but also "backpressure is love, backpressure is life",
by way of the poll_ready
method.
What happens is you hand hyper
a Service<Request>
, and for each new HTTP
request it first calls poll_ready
until it returns Poll::Ready
. It doesn't
do that in a loop: it knows when to try again because in poll_ready
, if we
weren't ready, we registered ourselves with the
Waker that's
hidden in the Context
, so we know when... a socket is ready to be written to /
read from, or a timer has elapsed, or some semaphore has more permits
available... and that's how async Rust works, more or less!
Wow, most concise explanation yet!
Yes, and completely opaque to anyone who doesn't already know how it works. But
that doesn't matter much for this article. Once poll_ready
has returned
Poll::Ready
, then call
can be called, and it returns a Future
, which is
spawned as a background task, and then the request can be handled asynchronously.
Because we don't have "async trait methods", ie. the trait can't just be this:
pub trait Service<Request> { type Response; type Error; async fn poll_ready(&self) -> Result<(), Self::Error>; async fn call(&self, req: Request) -> Result<Self::Response, Self::Error>; }
...we have this Future
associated type:
type Future: Future<Output = Result<Self::Response, Self::Error>>;
And most of the time, that's gonna end up being a boxed future, especially if you're using async blocks.
So, often, your service ends up looking something like this:
use hyper::{Body, Request, Response}; use std::{ convert::Infallible, future::Future, pin::Pin, task::{Context, Poll}, time::Duration, }; use tokio::time::sleep; use tower::Service; struct MyService; impl Service<Request<Body>> for MyService { type Response = Response<Body>; type Error = Infallible; type Future = Pin<Box<dyn Future<Output = Result<Self::Response, Self::Error>>>>; fn poll_ready(&mut self, _cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> { // we don't apply any backpressure here, so always ready Poll::Ready(Ok(())) } fn call(&mut self, req: Request<Body>) -> Self::Future { Box::pin(async move { // imagine we're fetching data from an API instead sleep(Duration::from_millis(250)).await; let body = format!("you've requested {:?}", req.uri()); let res = Response::builder().body(Body::from(body)).unwrap(); Ok(res) }) } }
And, ridiculous verbosity aside, the nice thing about this is that you can compose those services. So we might take that "sleeping for 250ms" functionality and move it into another service, like so:
use hyper::{Body, Request, Response}; use std::{ convert::Infallible, future::Future, pin::Pin, task::{Context, Poll}, time::Duration, }; use tokio::time::sleep; use tower::Service; struct SleepService<S> { inner: S, } impl<S, Req> Service<Req> for SleepService<S> where S: Service<Req>, <S as Service<Req>>::Future: 'static, { type Response = S::Response; type Error = S::Error; type Future = Pin<Box<dyn Future<Output = Result<Self::Response, Self::Error>>>>; fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> { self.inner.poll_ready(cx) } fn call(&mut self, req: Req) -> Self::Future { let fut = self.inner.call(req); Box::pin(async move { // imagine we're fetching data from an API instead sleep(Duration::from_millis(250)).await; fut.await }) } } struct MyService; impl Service<Request<Body>> for MyService { type Response = Response<Body>; type Error = Infallible; // this future is no longer boxed! it's a future that immediately completes. type Future = std::future::Ready<Result<Self::Response, Self::Error>>; fn poll_ready(&mut self, _cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> { Poll::Ready(Ok(())) } fn call(&mut self, req: Request<Body>) -> Self::Future { let body = format!("you've requested {:?}", req.uri()); let res = Response::builder().body(Body::from(body)).unwrap(); std::future::ready(Ok(res)) } }
And that's neat because now you can separate concerns cleanly across multiple services: some will rate-limit, some will retry, some will handle errors, etc.
And the end result is that the service you hand to hyper
isn't merely
MyService
, it's something like:
type ServiceStack = Trace< WebSocketUpgrade< UpstreamTimings< CompressResponse< PrepareRequest< HandleErrors< HttpsRedirect< Balancer< Retry< MyService > > > > > > > > >;
Thankfully, tower
provides another type, Layer
, which turns the former
into something like:
let service = ServiceBuilder::new() .layer(TraceLayer) .layer(WebSocketUpgradeLayer) .layer(UpstreamTimingsLayer) .layer(CompressResponseLayer) .layer(PrepareRequestLayer) .layer(HandleErrorsLayer) .layer(HttpsRedirectLayer) .layer(BalancerLayer) .layer(RetryLayer) .service(MyService);
Which is nicer to read, if you can remember that it goes outer -> inner
. But
it still builds the same gigantic type.
One downside is that the type errors you get with these can be gnarly. Learning to address those is a bit of a special skill I've acquired, and I'm not sure it should be a prerequisite.
And another downside is that these large types tend to hit corner cases in rustc (the rust compiler) that are... slow. Really slow. Exponentially slower as you add more layers to your service stack in fact.
And I think we've had quite enough of tower
for one article, so I'm going to
move on to the small repro case that eddyb made from
my already-small repro case.
The repro case
Stolen directly from rust issue 99188, here it is:
trait Trait<'a> { type A; type B; type C; type D; fn method() {} } impl<T> Trait<'_> for &'_ T where for<'x> T: Trait<'x, A = (), B = (), C = (), D = ()>, { type A = (); type B = (); type C = (); type D = (); } impl Trait<'_> for () { type A = (); type B = (); type C = (); type D = (); } pub fn main() { #[cfg(depth = "7")] <&&&&&&&() as Trait>::method(); #[cfg(depth = "8")] <&&&&&&&&() as Trait>::method(); #[cfg(depth = "9")] <&&&&&&&&&() as Trait>::method(); #[cfg(depth = "10")] <&&&&&&&&&&() as Trait>::method(); }
What's neat about having a self-contained repro case like that (no dependencies,
single file) is that we can call rustc
directly instead of calling cargo
.
So first, let's check that we have pathological behavior thanks to those neat
cfg
attributes there. We'll pass --emit=metadata
to rustc so that it doesn't
do full codegen with LLVM. I think this is close to cargo check
? Unsure.
$ command time -f 'took %Us' rustc src/main.rs --emit=metadata --cfg 'depth = "7"' took 0.17s $ command time -f 'took %Us' rustc src/main.rs --emit=metadata --cfg 'depth = "8"' took 0.76s $ command time -f 'took %Us' rustc src/main.rs --emit=metadata --cfg 'depth = "9"' took 3.38s $ command time -f 'took %Us' rustc src/main.rs --emit=metadata --cfg 'depth = "10"' took 15.53s
We're doing this on stable at the time of this writing, btw:
$ rustc --version rustc 1.62.0 (a8314ef7d 2022-06-27)
So yeah, every time you add one level of nesting, it gets 4x slower. So compile time is , where is the level of nesting â that's exponential growth (the bad kind).
But why? Where is it spending time? What is going on?
Let's find out.
But before we do, to keep things manageable, let's add more cfg
s with smaller
values, like so:
pub fn main() { #[cfg(depth = "1")] <&() as Trait>::method(); #[cfg(depth = "2")] <&&() as Trait>::method(); #[cfg(depth = "3")] <&&&() as Trait>::method(); #[cfg(depth = "4")] <&&&&() as Trait>::method(); #[cfg(depth = "5")] <&&&&&() as Trait>::method(); #[cfg(depth = "6")] <&&&&&&() as Trait>::method(); #[cfg(depth = "7")] <&&&&&&&() as Trait>::method(); #[cfg(depth = "8")] <&&&&&&&&() as Trait>::method(); #[cfg(depth = "9")] <&&&&&&&&&() as Trait>::method(); #[cfg(depth = "10")] <&&&&&&&&&&() as Trait>::method(); }
Because as you'll see, things can get verbose fast.
Tracing with RUSTC_LOG
As of... some time ago, I don't remember, rustc
is using the
tracing crate, along with
tracing-tree, which means it's very easy
for us to turn on some debug logging.
I was excited about this because I generally love tracing
, as I've said
before, and seeing it used in a compiler
caused me great joy.
rustc logging is set up so that you can set RUSTC_LOG
to any values accepted
by EnvFilter.
So, let's set it to info
and see what we get:
$ RUSTC_LOG=info rustc src/main.rs --emit=metadata --cfg 'depth = "1"' INFO rustc_metadata::creader resolving crate `std` INFO rustc_metadata::creader falling back to a load INFO rustc_metadata::locator lib candidate: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-7ca39ac42651c3df.so INFO rustc_metadata::locator lib candidate: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd_detect-a61cdd33cfa8394f.rlib INFO rustc_metadata::locator lib candidate: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-7ca39ac42651c3df.rlib INFO rustc_metadata::locator rlib reading metadata from: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd_detect-a61cdd33cfa8394f.rlib INFO rustc_metadata::locator Rejecting via crate name
Okay, so rustc knows about crates! Interesting. Let's skip ahead, there's a ton of those.
INFO rustc_metadata::creader resolved crates: name: std cnum: 1 hash: 514c814935f60577 reqd: Explicit rlib: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-7ca39ac42651c3df.rlib name: core cnum: 2 hash: 40880dfc3076f43e reqd: Explicit rlib: /home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-83735dd4dae9b02c.rlib
Neat, it found a bunch of crates (19 in total, not shown here).
INFO rustc_interface::passes 0 parse sess buffered_lints ârustc_typeck::check::wfcheck::check_trait ââârustc_trait_selection::traits::project::normalize_with_depth_to depth=0, value=[Binder(TraitPredicate(<Self as Trait<'a>>, polarity:Positive), [])]
Okay, now it started doing type checking! (You can tell from the crate name:
rustc_typeck
â rustc too is split across a bunch of crates).
Much later, we see:
ârustc_trait_selection::traits::project::normalize_with_depth_to depth=1, value=Binder(ProjectionPredicate(ProjectionTy { substs: [_, ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b057]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))]) â ârustc_trait_selection::traits::project::normalize_with_depth_to depth=1, value=Binder(ProjectionPredicate(ProjectionTy { substs: [_, ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b057]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))]) â ârustc_trait_selection::traits::project::normalize_with_depth_to depth=1, value=Binder(ProjectionPredicate(ProjectionTy { substs: [_, ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b057]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))]) â ârustc_trait_selection::traits::project::normalize_with_depth_to depth=1, value=Binder(ProjectionPredicate(ProjectionTy { substs: [_, ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b057]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))])
Ah! That seems related to our code.
Let's reformat one of these manually:
Binder( ProjectionPredicate( ProjectionTy { substs: [ _, ReLateBound( DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) } ) ], item_def_id: DefId(0:5 ~ main[b057]::Trait::A) }, Ty(()) ), [ Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x)) ] )
Looking through rustc's source code a bit, we find:
Binder is a binder for higher-ranked lifetimes or types. It is part of the compiler's representation for things like
for<'a> Fn(&'a isize)
(which would be represented by the typePolyTraitRef == Binder<'tcx, TraitRef>
). Note that when we instantiate, erase, or otherwise "discharge" these bound vars, we change the type fromBinder<'tcx, T>
to justT
(see e.g.,liberate_late_bound_regions
).
That seems to correspond to this constraint in our code:
where for<'x> T: Trait<'x, A = (), B = (), C = (), D = ()>,
In there we have a ProjectionPredicate
:
This kind of predicate has no direct correspondent in the syntax, but it roughly corresponds to the syntactic forms:
T: TraitRef<..., Item = Type>
<T as TraitRef<...>>::Item == Type
(NYI)In particular, form #1 is "desugared" to the combination of a normal trait predicate (
T: TraitRef<...>
) and one of these predicates. Form #2 is a broader form in that it also permits equality between arbitrary types. Processing an instance of Form #2 eventually yields one of theseProjectionPredicate
instances to normalize the LHS.
My read here is that this:
where for<'x> T: Trait<'x, A = (), B = (), C = (), D = ()>,
Gets translated into this (which is not valid syntax):
where for<'x> T: Trait<'x>, for<'x> <T as Trait<'x>>::A = (), for<'x> <T as Trait<'x>>::B = (), for<'x> <T as Trait<'x>>::C = (), for<'x> <T as Trait<'x>>::D = (),
If you try it you'll run into "equality constraints are not yet supported in
where
clauses", with a link to this
issue.
But that seems to be what the compiler desugars it to, according to the comment.
And so, we're looking at something like for<'x> <T as Trait<'x>>::A = ()
.
Still in that logging output, we see a mention of De Bruijn indices, named after Nicolaas Govert de Bruijn, and is just a convenient way to "represent terms of lambda calculus without naming the bound variables". Here it's used to represent lifetimes.
That's as much as I can divine from this, but one thing's for sure: these log lines are related to typechecking our code.
Finally, we get a nice summary of... the distribution of type features for types in the interner? It's not "time spent", but it's fun to look at:
INFO rustc_interface::passes Pre-codegen Ty interner total ty lt ct all Adt : 0 0.0%, 0.0% 0.0% 0.0% 0.0% Array : 0 0.0%, 0.0% 0.0% 0.0% 0.0% Slice : 0 0.0%, 0.0% 0.0% 0.0% 0.0% RawPtr : 0 0.0%, 0.0% 0.0% 0.0% 0.0% Ref : 15 25.4%, 5.1% 15.3% 0.0% 0.0% FnDef : 9 15.3%, 0.0% 8.5% 0.0% 0.0% FnPtr : 1 1.7%, 0.0% 0.0% 0.0% 0.0% Placeholder : 0 0.0%, 0.0% 0.0% 0.0% 0.0% Generator : 0 0.0%, 0.0% 0.0% 0.0% 0.0% GeneratorWitness : 0 0.0%, 0.0% 0.0% 0.0% 0.0% Dynamic : 0 0.0%, 0.0% 0.0% 0.0% 0.0% Closure : 0 0.0%, 0.0% 0.0% 0.0% 0.0% Tuple : 1 1.7%, 0.0% 0.0% 0.0% 0.0% Bound : 0 0.0%, 0.0% 0.0% 0.0% 0.0% Param : 2 3.4%, 0.0% 0.0% 0.0% 0.0% Infer : 3 5.1%, 3.4% 0.0% 0.0% 0.0% Projection : 28 47.5%, 0.0% 13.6% 0.0% 0.0% Opaque : 0 0.0%, 0.0% 0.0% 0.0% 0.0% Foreign : 0 0.0%, 0.0% 0.0% 0.0% 0.0% total 59 8.5% 37.3% 0.0% 0.0% InternalSubsts interner: #95 Region interner: #54 Const Allocation interner: #0 Layout interner: #0
And... there's 47.5% in Projection
, which I think refers to those <T as Trait>::AssocType = Stuff
constraints.
And now I'm really curious how these evolve when we add more nesting, so, let's
filter log output so we only have stuff from rustc_interface::passes
and do
text processing:
for i in $(seq 1 9); do echo "============ Level ${i} ============"; RUSTC_LOG=rustc_interface::passes=info rustc src/main.rs --emit=metadata --cfg "depth = \"${i}\"" 2>&1 | grep "Post-codegen" -A 100 | grep -E "Ref|FnDef|Infer|Projection"; done ============ Level 1 ============ Ref : 15 25.4%, 5.1% 15.3% 0.0% 0.0% FnDef : 9 15.3%, 0.0% 8.5% 0.0% 0.0% Infer : 3 5.1%, 3.4% 0.0% 0.0% 0.0% Projection : 28 47.5%, 0.0% 13.6% 0.0% 0.0% ============ Level 2 ============ Ref : 37 43.5%, 21.2% 31.8% 0.0% 0.0% FnDef : 8 9.4%, 0.0% 4.7% 0.0% 0.0% Infer : 8 9.4%, 8.2% 0.0% 0.0% 0.0% Projection : 28 32.9%, 0.0% 9.4% 0.0% 0.0% ============ Level 3 ============ Ref : 65 54.6%, 29.4% 42.9% 0.0% 0.0% FnDef : 9 7.6%, 0.0% 4.2% 0.0% 0.0% Infer : 13 10.9%, 10.1% 0.0% 0.0% 0.0% Projection : 28 23.5%, 0.0% 6.7% 0.0% 0.0% ============ Level 4 ============ Ref : 113 62.4%, 40.9% 52.5% 0.0% 0.0% FnDef : 9 5.0%, 0.0% 2.8% 0.0% 0.0% Infer : 27 14.9%, 14.4% 0.0% 0.0% 0.0% Projection : 28 15.5%, 0.0% 4.4% 0.0% 0.0% ============ Level 5 ============ Ref : 231 60.9%, 48.3% 55.1% 0.0% 0.0% FnDef : 9 2.4%, 0.0% 1.3% 0.0% 0.0% Infer : 107 28.2%, 28.0% 0.0% 0.0% 0.0% Projection : 28 7.4%, 0.0% 2.1% 0.0% 0.0% ============ Level 6 ============ Ref : 652 58.2%, 53.1% 55.9% 0.0% 0.0% FnDef : 9 0.8%, 0.0% 0.4% 0.0% 0.0% Infer : 427 38.1%, 38.0% 0.0% 0.0% 0.0% Projection : 28 2.5%, 0.0% 0.7% 0.0% 0.0% ============ Level 7 ============ Ref : 2248 56.3%, 54.6% 55.5% 0.0% 0.0% FnDef : 9 0.2%, 0.0% 0.1% 0.0% 0.0% Infer : 1707 42.7%, 42.7% 0.0% 0.0% 0.0% Projection : 28 0.7%, 0.0% 0.2% 0.0% 0.0% ============ Level 8 ============ Ref : 8616 55.6%, 55.2% 55.4% 0.0% 0.0% FnDef : 9 0.1%, 0.0% 0.0% 0.0% 0.0% Infer : 6827 44.1%, 44.1% 0.0% 0.0% 0.0% Projection : 28 0.2%, 0.0% 0.1% 0.0% 0.0% ============ Level 9 ============ Ref : 33976 55.4%, 55.3% 55.3% 0.0% 0.0% FnDef : 9 0.0%, 0.0% 0.0% 0.0% 0.0% Infer : 27307 44.5%, 44.5% 0.0% 0.0% 0.0% Projection : 28 0.0%, 0.0% 0.0% 0.0% 0.0%
Ah! It looks like Infer
is EXPloding. And so is Ref
, but I'm not exactly
sure why.
Wait, why are we doing this again? Doesn't rustc have self-profiling? That seems like it could be useful there.
You know what, that's a good point, let's try that.
Self-profiling with -Z self-profile
We'll choose nesting level 8, because it takes a handful of seconds on my machine so it's probably long enough to measure stuff, yet short enough that it doesn't get really annoying.
So, we've learned this in a previous
article, let's get moving. As before,
and because we're still on stable at this point, we'll use the RUSTC_BOOTSTRAP
environment variable to unlock nightly features, which is something you
shouldn't normally do:
$ RUSTC_BOOTSTRAP=1 rustc src/main.rs --emit=metadata --cfg 'depth = "8"' -Z self-profile (no output) # this takes the most recent `*.mm_profdata` file $ summarize summarize $(ls -t *profdata | head -1) | head +------------------------------------------------+-----------+-----------------+----------+------------+ | Item | Self time | % of total time | Time | Item count | +------------------------------------------------+-----------+-----------------+----------+------------+ | evaluate_obligation | 738.19ms | 98.892 | 738.21ms | 51 | +------------------------------------------------+-----------+-----------------+----------+------------+ | expand_crate | 2.32ms | 0.311 | 2.71ms | 1 | +------------------------------------------------+-----------+-----------------+----------+------------+ | free_global_ctxt | 892.96”s | 0.120 | 892.96”s | 1 | +------------------------------------------------+-----------+-----------------+----------+------------+ | codegen_fulfill_obligation | 556.41”s | 0.075 | 738.71ms | 1 | thread 'main' panicked at 'Cannot print table to standard output : Broken pipe (os error 32)', /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/prettytable-rs-0.8.0/src/lib.rs:194:23 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Alright! We have a candidate there for sure.
Here it is in flamegraph form:
$ flamegraph $(ls -t *profdata | head -1) # using the `sfz` crate as a static file server to visualize the SVG in a browser $ sfz Files served on http://127.0.0.1:5000
And in chrome://tracing
form, which has a bunch of useful temporal information
the flamegraph lacks:
$ crox $(ls -t *profdata | tail -1) # this generated a `chrome_profiler.json` file
Here for example, we see that evaluate_obligation
is actually done multiple
times: the first one takes the longest, the next couple ones are a little bit
quicker, and the last few are quicker still.
But which "obligations" is it evaluating? We don't know, because we haven't
recorded arguments. Which we can do with another flag, -Z self-profile-events
:
$ RUSTC_BOOTSTRAP=1 rustc src/main.rs --emit=metadata --cfg 'depth = "8"' -Z self-profile -Z self-profile-events=default,args
The summary looks mostly the same:
$ summarize summarize $(ls -t *profdata | head -1) | head +-------------------------------------------------+-----------+-----------------+----------+------------+ | Item | Self time | % of total time | Time | Item count | +-------------------------------------------------+-----------+-----------------+----------+------------+ | evaluate_obligation | 774.03ms | 98.623 | 774.06ms | 51 | +-------------------------------------------------+-----------+-----------------+----------+------------+ | expand_crate | 2.32ms | 0.295 | 2.71ms | 1 | +-------------------------------------------------+-----------+-----------------+----------+------------+ | metadata_decode_entry_module_children | 1.98ms | 0.253 | 1.98ms | 298 | +-------------------------------------------------+-----------+-----------------+----------+------------+ | codegen_fulfill_obligation | 737.23”s | 0.094 | 774.74ms | 1 | thread 'main' panicked at 'Cannot print table to standard output : Broken pipe (os error 32)', /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/prettytable-rs-0.8.0/src/lib.rs:194:23 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Let's look at the chrome tracing profile directly:
This time, we can see which obligation is being checked, and the first one is (manually formatted for your convenience):
Canonical { max_universe: U0, variables: [], value: ParamEnvAnd { param_env: ParamEnv { caller_bounds: [], reveal: UserFacing, constness: NotConst }, value: Binder( ProjectionPredicate( ProjectionTy { substs: [ &&&&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x) }) ], item_def_id: DefId(0:5 ~ main[b057]::Trait::A) }, Ty(()) ), [Region(BrNamed(DefId(0:12 ~ main[b057]::{impl#0}::'x), 'x))] ) } }
There's not really a good reason for that #[derive(Debug)]
-style output being
in there btw. There's machinery in place in rustc to format these nicely, like
for<'x> Predicate<'x>
but apparently it's not hooked all the way through.
Now that I think about it, that would be a neat first contribution... wink wink.
Did you just say "wink wink" out loud?
You're not the boss of me.
The next three are the same, except the item_def_id
is for Trait::B
,
Trait::C
and Trait::D
respectively. The next one doesn't have a projection,
it's simply the for<'x> T: Trait<'x>
obligation.
And then the other ones are similar, but for the inner levels: seven &
instead
of eight, and so on until we reach the bottom, where we check for<'x> &() Trait<'x>
.
Which... is interesting. You'd think that... once rustc has proven that for<'x> &() Trait<'x>
,
it can re-use that result to prove that for<'x> &&(): Trait<'x>
, using that impl
:
impl<T> Trait<'_> for &'_ T where for<'x> T: Trait<'x, A = (), B = (), C = (), D = ()>,
And then it can re-use that to prove for<'x> &&&(): Trait<'x>
, and so on until
for<'x> &&&&&&&&(): Trait<'x>
.
And in fact, it can! If we remove the projections, ie. we make that bound just this:
impl<T> Trait<'_> for &'_ T where for<'x> T: Trait<'x>,
Then it completes instantly:
$ command time -f "took %Us" env RUSTC_BOOTSTRAP=1 rustc src/main.rs --emit=metadata --cfg 'depth = "8"' -Z self-profile -Z self-profile-events=default,args took 0.04s
And at that point, most of the time is spent in.. expand_crate
(macro stuff?)
and metadata_decode_entry_module_children
(reading rlib files?):
$ summarize summarize $(ls -t *profdata | head -1) | head +-------------------------------------------------+-----------+-----------------+----------+------------+ | Item | Self time | % of total time | Time | Item count | +-------------------------------------------------+-----------+-----------------+----------+------------+ | expand_crate | 2.34ms | 25.009 | 2.74ms | 1 | +-------------------------------------------------+-----------+-----------------+----------+------------+ | metadata_decode_entry_module_children | 2.27ms | 24.295 | 2.27ms | 298 | +-------------------------------------------------+-----------+-----------------+----------+------------+ | self_profile_alloc_query_strings | 449.57”s | 4.815 | 3.14ms | 1 | +-------------------------------------------------+-----------+-----------------+----------+------------+ | <unknown> | 424.47”s | 4.546 | 5.03ms | 320 | thread 'main' panicked at 'Cannot print table to standard output : Broken pipe (os error 32)', /home/amos/.cargo/registry/src/github.com-1ecc6299db9ec823/prettytable-rs-0.8.0/src/lib.rs:194:23 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
So something is definitely up, and we should look further.
I'm sorry, amos, if I could have one second of your time. Are you seriously suggesting folks who write Rust casually have a moral obligation to do all of this?
Oh! Oh no, absolutely not. It's just... if you want to, then you definitely can, and it's a bit under-documented right now, so I wanted to share my notes on how, so that the next time it happens, I can refer to those instead of furiously spamming everyone. And if others do that too maybe we can get to the bottom of these faster, that's all.
Ah, so you're saying not to do this?
I'm saying do whatever you want. I'll show you how, and then you can decide what you want to do with that knowledge.
I see I see. So you're going to want to keep this accessible still, right? It's not that bad so far, but I could see going a little too far w-
So, as you can imagine, at some point, we're gonna want to compile our own copy of rustc.
...what was I just saying.
No no, listen, hear me out. It's not as bad as you think. I was afraid too, once!
But it turns out there's a a very nice guide to rustc development available and it's easy to follow.
Okay but what possible reason could you have for doing this? Are we going to fix the bug in that article?
Oh no, not at all. But we're going to learn more about it.
See, the release build of the stable Rust compiler may have nightly features
hidden behind RUSTC_BOOTSTRAP=1
, but there's things it doesn't have.
Let's take tracing: say we want more detail than info
: maybe we want debug
spans and events. If we try it on stable, we get this:
$ RUSTC_LOG=debug rustc src/main.rs --emit=metadata --cfg 'depth = "1"' warning: some trace filter directives would enable traces that are disabled statically | `debug` would enable the DEBUG level for all targets = note: the static max level is `info` = help: to enable DEBUG logging, remove the `max_level_info` feature (cut)
Oh, it.. wait, tracing
lets you strip certain levels statically? Like, at
build time?
Yes it does!. So the debug spans and events we're looking for are not even in the stable rust binary.
Profiling with perf
Here's another thing. Say we record the execution of rustc
with the perf
profiler:
$ perf record rustc src/main.rs --emit=metadata --cfg 'depth = "8"' [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.048 MB perf.data (850 samples) ]
You have to set up a bunch of procfs variables to be able to do perf record
unprivileged, because the kernel is (rightfully) paranoid about perf_events
and kernel symbol tables by default on distributions like Ubuntu.
This is not covered here, because as you try it, perf
walks you through fixing
it.
And then we report:
$ perf report --stdio (cut) 7.26% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_infer::infer::InferCtxt>::rollback_to 5.54% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_middle::ty::context::TyCtxt>::intern_substs 4.31% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_infer::infer::region_constraints::RegionConstraintCollector>::make_eqregion 4.18% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_middle::ty::context::TyCtxt>::mk_region 3.57% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_infer::infer::region_constraints::leak_check::LeakCheck>::assign_placeholder_values 3.57% rustc rustc [.] free 2.83% rustc librustc_driver-a7bf53640eed303a.so [.] _RNvNtNtCs81r3Jv5jMWm_21rustc_trait_selection6traits7project29opt_normalize_projection_type.llvm.158142954743480087 2.71% rustc librustc_driver-a7bf53640eed303a.so [.] <alloc::rc::Rc<rustc_middle::traits::ObligationCauseCode> as core::ops::drop::Drop>::drop 2.71% rustc rustc [.] malloc 2.58% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_middle::ty::context::CtxtInterners>::intern_predicate 1.72% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_data_structures::graph::scc::SccsConstruction<rustc_data_structures::graph::vec_graph::VecGraph<rustc_infer::infer::region_constraints::leak_check::LeakCheckNode>, rustc_infer::infer::region_constraints::leak_check::LeakCheckScc>>::inspect_node 1.48% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_trait_selection::traits::select::SelectionContext>::rematch_impl 1.48% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_trait_selection::traits::select::SelectionContext>::vtable_impl 1.35% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_middle::ty::sty::Region as core::ops::deref::Deref>::deref 1.35% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_middle::ty::subst::SubstFolder as rustc_middle::ty::fold::TypeFolder>::fold_ty 1.23% rustc [kernel.kallsyms] [k] do_user_addr_fault
Well... because we don't have debug info, we don't see what we should be seeing, which is...
Uhh you didn't even tell perf
to record call stacks?
Ah right! Let's try that again:
# note: `-g` is the more confusing equivalent to `--call-graph fp`. we use # the longer form to avoid any misunderstandings about what `-g` truly does. # here we're using "frame pointers" to gather stack traces when we sample. $ perf record --call-graph fp rustc src/main.rs --emit=metadata --cfg 'depth = "8"' [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.112 MB perf.data (1306 samples) ] $ perf report --stdio (cut) 8.77% 0.00% rustc [unknown] [.] 0000000000000000 | ---0 | |--0.79%--<rustc_infer::infer::InferCtxt>::commit_if_ok::<(), (), rustc_trait_selection::traits::project::assemble_candidates_from_impls::{closure#0}> | |--0.67%--pthread_mutex_trylock@GLIBC_2.2.5 | |--0.56%--_RNvNtNtCs81r3Jv5jMWm_21rustc_trait_selection6traits7project29opt_normalize_projection_type.llvm.158142954743480087 | |--0.56%--free | |--0.56%--0x7fd7407eb520 | --0.56%--0x40000 <rustc_infer::infer::region_constraints::RegionConstraintCollector>::make_eqregion 6.64% 6.64% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_infer::infer::InferCtxt>::rollback_to | ---<rustc_infer::infer::InferCtxt>::rollback_to 6.19% 6.19% rustc librustc_driver-a7bf53640eed303a.so [.] <rustc_middle::ty::context::TyCtxt>::intern_substs | --5.51%--<rustc_middle::ty::context::TyCtxt>::intern_substs
There. That looks wrong.
Sure, but you're using frame pointers here. What about --call-graph dwarf
?
Sure, let's try that:
$ perf record --call-graph dwarf rustc src/main.rs --emit=metadata --cfg 'depth = "8"' [ perf record: Woken up 31 times to write data ] [ perf record: Captured and wrote 7.733 MB perf.data (955 samples) ]
(As usual, using dwarf
makes for a much larger perf.data
file since perf
just takes a snapshot of the stack and deals with symbolication later)
Running perf report --stdio
gives us a bunch of these:
addr2line: DWARF error: invalid or unhandled FORM value: 0x23 addr2line: DWARF error: invalid or unhandled FORM value: 0x23 (this repeats for a long while)
And then:
98.53% 0.00% rustc [unknown] [k] 0xffffffffffffffff | ---0xffffffffffffffff | |--43.14%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively | | | |--32.77%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>> | | <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively | | | | | |--17.07%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>> | | | <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively | | | | | | | |--15.71%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>> | | | | <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively | | | | | | | | | |--11.94%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>> | | | | | <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively | | | | | | | | | | | |--6.18%--rustc_trait_selection::traits::project::poly_project_and_unify_type | | | | | | rustc_trait_selection::traits::project::project_and_unify_type | | | | | | _RNvNtNtCs81r3Jv5jMWm_21rustc_trait_selection6traits7project29opt_normalize_projection_type.llvm.158142954743480087 | | | | | | | | | | | | | |--2.20%--<rustc_infer::infer::InferCtxt>::commit_if_ok::<(), (), rustc_trait_selection::traits::project::assemble_candidates_from_impls::{closure#0}> | | | | | | | | | | | | | |--1.57%--<rustc_infer::traits::project::ProjectionCache>::try_start | | | | | | | | | | | | | | | |--0.73%--__memcpy_avx_unaligned_erms (inlined) | | | | | | | | | | | | | | | --0.52%--_RINvMs5_NtCskqCQpbzSEkT_9hashbrown3rawINtB6_8RawTableTNtNtNtCs7lUMoIymc80_11rustc_infer6traits7project18ProjectionCacheKeyNtBS_20ProjectionCacheEntryEE14reserve_rehashNCINvNtB8_3map11make_hasherBQ_BQ_B1W_INtNtCs8R1TIarAOKy_4core4hash18BuildHasherDefaultNtCs5jvd1lB4BAC_10rustc_hash8FxHasherEE0EBW_.llvm.5086343079582176625 | | | | | | | | | | | | | |--0.63%--<rustc_middle::ty::sty::ProjectionTy>::trait_ref | | | | | | | | | | | | | --0.63%--<alloc::vec::drain_filter::DrainFilter<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>, rustc_trait_selection::traits::project::opt_normalize_projection_type::{closure#0}> as core::ops::drop::Drop>::drop | | | | | | | | | | | --5.76%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>> | | | | | <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively | | | | | | | | | | | --5.65%--<rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicates_recursively::<alloc::vec::Vec<rustc_infer::traits::Obligation<rustc_middle::ty::Predicate>>> | | | | | <rustc_trait_selection::traits::select::SelectionContext>::evaluate_predicate_recursively | | | | | | | | | | | |--4.29%--rustc_trait_selection::traits::project::poly_project_and_unify_type | | | | | | | | | | | | | |--2.93%--<&rustc_middle::ty::list::List<rustc_middle::ty::subst::GenericArg> as rustc_middle::ty::fold::TypeFoldable>::try_fold_with::<rustc_middle::ty::fold::BoundVarReplacer> | | | | | | | | | | | | | | | |--2.41%--<rustc_middle::ty::fold::BoundVarReplacer as rustc_middle::ty::fold::TypeFolder>::fold_region | | | | | | | | | | | | | | | | | --2.30%--_RNCINvMs5_NtNtCs3KapO9mZVGx_12rustc_middle2ty4foldNtNtBa_7context6TyCtxt18replace_bound_varsNtBa_14TraitPredicateNCINvMs_NtNtCs7lUMoIymc80_11rustc_i nfer5infer13higher_rankedNtB1X_9InferCtxt36replace_bound_vars_with_placeholdersB1s_E0NCB1P_s_0NCB1P_s0_0E0Cs81r3Jv5jMWm_21rustc_trait_selection.llvm.15196016308777180469 (inlined) | | | | | | | | | | | | | | | | | --1.78%--<rustc_middle::ty::context::TyCtxt>::mk_region | | | | | | | | | | | | | | | --0.52%--<rustc_middle::ty::context::TyCtxt>::intern_substs (cut)
Oh, actually that looks okay?
Let's try making flamegraph with inferno:
$ perf script | inferno-collapse-perf | inferno-flamegraph > rustc.svg (lots and lots of DWARF errors)
Yeah huh, this is not as bad as I thought actually.
I have a feeling the DWARF errors are caused by the lack of support for newer DWARF features in Ubuntu 22.04.
Mhh maybe if you tell perf
to use another addr2line
implementation? The one
from LLVM maybe?
Yeah yeah, I've seen the perf patch for
that
too, but: it doesn't seem to be accepted by perf script
on Ubuntu 22.04, and also
llvm-addr2line
(a wrapper over llvm-symbolize
) is straight up not provided by
LLVM 14 packages from their own repository. So uh, yeah.
And the other issue is: if you look closer at the flamegraph above, you'll see that
it goes: all
, rustc
, and then [unknown]
. A bunch of the call stack is missing!
That's because perf only captures 8KB of the stack by default. You can bump that
up to 65528 bytes with --call-graph dwarf,65528
(yes that's the actual upper
limit), but that makes perf.data
even larger, and perf report
/ perf script
even slower.
Update: as it turns out, llvm-addr2line
is provided by the packages in
LLVM's APT repository, it's just named llvm-addr2line-14
for me. Still no
--addr2line=
flag in perf
for me though, so with a quick wrapper in
/tmp/addr2line-wrapper/addr2line
#!/bin/bash -eu llvm-addr2line-14 "$@"
This trick works (almost) anytime you want to trick a program into executing
another program instead of what's naturally in your $PATH
.
Let's try using the Firefox profiler to visualize our profile this time, by following the instructions carefully:
$ perf record --call-graph dwarf rustc src/main.rs --emit=metadata --cfg 'depth = "8"' $ chmod +x /tmp/addr2line-wrapper/addr2line $ PATH=/tmp/addr2line-wrapper:$PATH perf script -F +pid > test.perf
Then we just drag it onto the web app, and... there!
One extremely neat thing about Firefox profiler is you can share your profile online. Here's the profile above, have a look!
Profiling with not-perf
(nperf
)
Even though perf
is super versatile, as far as CPU sampling profilers go, I prefer
something a little more modern, like not-perf.
It produces smaller profiles, does symbolication on the fly (I stole this term
from Sentry, it just means "turning addresses into symbols", what addr2line
does), and has built-in flamegraph and chrome-tracing generation facilities.
It's not published on crates.io as far as I can tell, so you'll want to clone it
locally and run cargo install --path ./cli
or something, and then you can do this
in one terminal session:
$ nperf record --process rustc --wait
And this in another:
$ rustc src/main.rs --emit=metadata --cfg 'depth = "8"'
Our nperf session completes nicely:
$ nperf record --process rustc --wait [2022-07-13T11:38:35Z INFO nperf_core::ps] Waiting for process named 'rustc'... [2022-07-13T11:38:44Z INFO nperf_core::ps] Process 'rustc' found with PID 1438757! [2022-07-13T11:38:44Z INFO nperf_core::profiler] Opening "20220713_113844_1438757_rustc.nperf" for writing... [2022-07-13T11:38:44Z INFO nperf_core::cmd_record] Opening perf events for process with PID 1438757... [2022-07-13T11:38:44Z INFO nperf_core::profiler] Finished output file initialization [2022-07-13T11:38:44Z INFO nperf_core::cmd_record] Enabling perf events... [2022-07-13T11:38:44Z INFO nperf_core::cmd_record] Running... [2022-07-13T11:38:45Z INFO nperf_core::profiler] Collected 457 samples in total!
And now we can generate a flamegraph from it:
$ nperf flamegraph $(ls -t *.nperf | head -1) -o rustc.svg [2022-07-13T11:39:34Z WARN nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libm.so.6': 'a8c28af0bfefefcf69ba73f7d4582d82e01f71.debug' [2022-07-13T11:39:34Z WARN nperf_core::data_reader] Missing external debug symbols for '/home/amos/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/libLLVM-14-rust-1.62.0-stable.so': 'crti.o' [2022-07-13T11:39:34Z WARN nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libc.so.6': 'c3cb85f9e55046776471fed05ec441581d1969.debug' [2022-07-13T11:39:34Z WARN nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libz.so.1.2.11': '2e9768303892af62d212ec9357a7f1a0824a91.debug' [2022-07-13T11:39:34Z WARN nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/librt.so.1': '0915c138c86b44167c7f53256fc375332fbe47.debug' [2022-07-13T11:39:34Z WARN nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libgcc_s.so.1': '3e9a70a55a0249b33d01e903049a7657f5c0ae.debug' [2022-07-13T11:39:34Z WARN nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/libdl.so.2': '63bd5ea2a58dd8b60ee92a5081f753391db642.debug' [2022-07-13T11:39:34Z WARN nperf_core::data_reader] Missing external debug symbols for '/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2': '1b0b998999c397062e1016f0c95dc0e8820117.debug'
(As usual, Ubuntu's awful track record with debug symbols means we're missing info for a bunch of system libraries, and good luck finding those!)
This time our callstacks are complete, so we can see the whole thing. At this point we already have a good candidate, but I'm going to keep showing tools because I can.
nperf can also generate a chrome_profiler.json
file that we can load with
chrome://tracing
:
$ nperf trace-events $(ls -t *.nperf | head -1) -o chrome_profiler.json (cut: missing symbols warnings) [2022-07-13T11:43:48Z INFO nperf_core::cmd_trace_events] Profiling data frequency: 900 [2022-07-13T11:43:48Z INFO nperf_core::cmd_trace_events] Trace period for frame collapsing: 1140us
Unfortunately, it's a little buggy at the moment and the generated JSON has an empty record in first position, like so:
[,{"name": "0x0000000000000000","ph": "B","ts": 102138637276.717,"pid": 1438757,"tid": 1438757}
We can just remove that extra comma and be on our merry way.
This isn't very readable unless we zoom in a little:
If we load our profile in another tool, like Speedscope, we can get another visualization:
Even though we're in "time order" mode, we can see that the stack goes deep, very deep. Good thing rustc uses the stacker crate internally.
Profiling with callgrind
I wanted to try out another profiling tool: valgrind's own "callgrind".
valgrind is "an instrumentation framework for building dynamic analysis tools". It doesn't just detect memory corruption, memory leaks or cache misses, it also comes with a profiler.
But I couldn't. Apparently rustup shims include DWARF5 debug information, which Valgrind doesn't properly support yet:
$ valgrind --quiet --trace-children=yes --tool=callgrind rustc src/main.rs --emit=metadata -v ### unhandled dwarf2 abbrev form code 0x25 ### unhandled dwarf2 abbrev form code 0x25 ### unhandled dwarf2 abbrev form code 0x25 ### unhandled dwarf2 abbrev form code 0x23 ==1479608== Valgrind: debuginfo reader: ensure_valid failed: ==1479608== Valgrind: during call to ML_(img_get) ==1479608== Valgrind: request for range [3693702622, +4) exceeds ==1479608== Valgrind: valid image size of 15825920 for image: ==1479608== Valgrind: "/home/amos/.cargo/bin/rustc" ==1479608== ==1479608== Valgrind: debuginfo reader: Possibly corrupted debuginfo file. ==1479608== Valgrind: I can't recover. Giving up. Sorry. ==1479608==
Here's the upstream valgrind issue.
Stripping those, or using the "actual" rustc executable didn't get me much further, even with a custom build of rustc. It seems something goes wrong with LLVM at some point:
$ valgrind --quiet --trace-children=yes --tool=callgrind /home/amos/.rustup/toolchains/stage1/bin/rustc src/main.rs --emit=metadata -v rustc -Cllvm-args="..." with: Unknown command line argument '-generate-arange-section'. Try: 'rustc -Cllvm-args="..." with --help' rustc -Cllvm-args="..." with: Did you mean '--W=enerate-arange-section'? rustc -Cllvm-args="..." with: Unknown command line argument '-mergefunc-use-aliases'. Try: 'rustc -Cllvm-args="..." with --help' rustc -Cllvm-args="..." with: Did you mean '--W=ergefunc-use-aliases'?
And manually adding unstable rustc flags just to get it to stop passing those flags to LLVM didn't fix whatever the problem is. It just failed later:
$ valgrind --quiet --trace-children=yes --tool=callgrind /home/amos/.rustup/toolchains/stage1/bin/rustc src/main.rs --emit=metadata -v -Z no-generate-arange-section -Z merge-functions=disabled LLVM ERROR: Expected token before separator in datalayout string
I did make sure that rustc was loading its own copy of LLVM:
$ valgrind --quiet --trace-children=yes --tool=callgrind env LD_DEBUG=files /home/amos/.rustup/toolchains/stage1/bin/rustc src/main.rs --emit=metadata -v 2>&1 | grep -i llvm 1481228: file=libLLVM-14-rust-1.64.0-nightly.so [0]; needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/librustc_driver-dc663798d2165e22.so [0] 1481228: file=libLLVM-14-rust-1.64.0-nightly.so [0]; generating link map 1481228: file=librt.so.1 [0]; needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0] 1481228: file=libdl.so.2 [0]; needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0] 1481228: file=libpthread.so.0 [0]; needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0] 1481228: file=libz.so.1 [0]; needed by /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0] 1481228: calling init: /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so rustc -Cllvm-args="..." with: Unknown command line argument '-generate-arange-section'. Try: 'rustc -Cllvm-args="..." with --help' rustc -Cllvm-args="..." with: Did you mean '--W=enerate-arange-section'? rustc -Cllvm-args="..." with: Unknown command line argument '-mergefunc-use-aliases'. Try: 'rustc -Cllvm-args="..." with --help' rustc -Cllvm-args="..." with: Did you mean '--W=ergefunc-use-aliases'? 1481228: calling fini: /home/amos/work/rust/build/x86_64-unknown-linux-gnu/stage1/bin/../lib/../lib/libLLVM-14-rust-1.64.0-nightly.so [0]
So I have no idea what's going on here.
Next!
Self-profiling deeper with jq
So we've seen two ways to generate flamegraphs and chrome-tracing profiles.
One through rustc self-profiling (-Z self-profile
), and one through profilers
like perf
, nperf
, or, whenever I can figure out how to fix it, valgrind
.
The problem with CPU profilers is that we don't know what arguments are being
passed to those functions. We noticed we were spending a lot of time in
evaluate_predicate_recursively
, but we don't know what predicate is being
evaluated recursively.
And the problem with self-profiling (and tracing) is that... it depends on how much code has been instrumented.
That's where building our own rustc comes in. I followed the dev guide and reached a point where I could build my own version of the compiler with a single command!
Here's a "hot" build: that's me changing code in a single rustc
crate, not a
full, from-scratch build. It's not even a stage2
build, just stage1
:
$ ./x.py build library Building rustbuild Compiling lzma-sys v0.1.17 Compiling xz2 v0.1.6 Compiling bootstrap v0.0.0 (/home/amos/work/rust/src/bootstrap) Finished dev [unoptimized] target(s) in 5.16s Building stage0 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu) Finished release [optimized] target(s) in 0.16s Copying stage0 std from stage0 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu) Building stage0 compiler artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu) Compiling rustc_llvm v0.0.0 (/home/amos/work/rust/compiler/rustc_llvm) Compiling rustc_log v0.0.0 (/home/amos/work/rust/compiler/rustc_log) Compiling rustc_trait_selection v0.0.0 (/home/amos/work/rust/compiler/rustc_trait_selection) Compiling rustc_lint v0.0.0 (/home/amos/work/rust/compiler/rustc_lint) Compiling rustc_ty_utils v0.0.0 (/home/amos/work/rust/compiler/rustc_ty_utils) Compiling rustc_const_eval v0.0.0 (/home/amos/work/rust/compiler/rustc_const_eval) Compiling rustc_traits v0.0.0 (/home/amos/work/rust/compiler/rustc_traits) Compiling rustc_mir_build v0.0.0 (/home/amos/work/rust/compiler/rustc_mir_build) Compiling rustc_typeck v0.0.0 (/home/amos/work/rust/compiler/rustc_typeck) Compiling rustc_plugin_impl v0.0.0 (/home/amos/work/rust/compiler/rustc_plugin_impl) Compiling rustc_mir_transform v0.0.0 (/home/amos/work/rust/compiler/rustc_mir_transform) Compiling rustc_borrowck v0.0.0 (/home/amos/work/rust/compiler/rustc_borrowck) Compiling rustc_privacy v0.0.0 (/home/amos/work/rust/compiler/rustc_privacy) Compiling rustc_codegen_llvm v0.0.0 (/home/amos/work/rust/compiler/rustc_codegen_llvm) Compiling rustc_interface v0.0.0 (/home/amos/work/rust/compiler/rustc_interface) Compiling rustc_driver v0.0.0 (/home/amos/work/rust/compiler/rustc_driver) Compiling rustc_smir v0.0.0 (/home/amos/work/rust/compiler/rustc_smir) Compiling rustc-main v0.0.0 (/home/amos/work/rust/compiler/rustc) Finished release [optimized + debuginfo] target(s) in 33.73s Copying stage0 rustc from stage0 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu) Assembling stage1 compiler (x86_64-unknown-linux-gnu) Building stage1 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu) Compiling compiler_builtins v0.1.73 Compiling core v0.0.0 (/home/amos/work/rust/library/core) Compiling libc v0.2.126 Compiling cc v1.0.69 Compiling memchr v2.4.1 Compiling std v0.0.0 (/home/amos/work/rust/library/std) Compiling unwind v0.0.0 (/home/amos/work/rust/library/unwind) Compiling rustc-std-workspace-core v1.99.0 (/home/amos/work/rust/library/rustc-std-workspace-core) Compiling alloc v0.0.0 (/home/amos/work/rust/library/alloc) Compiling cfg-if v0.1.10 Compiling adler v0.2.3 Compiling rustc-demangle v0.1.21 Compiling rustc-std-workspace-alloc v1.99.0 (/home/amos/work/rust/library/rustc-std-workspace-alloc) Compiling panic_unwind v0.0.0 (/home/amos/work/rust/library/panic_unwind) Compiling panic_abort v0.0.0 (/home/amos/work/rust/library/panic_abort) Compiling gimli v0.25.0 Compiling miniz_oxide v0.4.0 Compiling hashbrown v0.12.0 Compiling std_detect v0.1.5 (/home/amos/work/rust/library/stdarch/crates/std_detect) Compiling object v0.26.2 Compiling addr2line v0.16.0 Compiling rustc-std-workspace-std v1.99.0 (/home/amos/work/rust/library/rustc-std-workspace-std) Compiling proc_macro v0.0.0 (/home/amos/work/rust/library/proc_macro) Compiling unicode-width v0.1.8 Compiling getopts v0.2.21 Compiling test v0.0.0 (/home/amos/work/rust/library/test) Finished release [optimized] target(s) in 41.80s Copying stage1 std from stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu) Build completed successfully in 0:01:21
It "only" took 1 minutes and 20 seconds. Scare quotes because that's not quite the "tight feedback loop" I'm dreaming of, but also, going into rustc development, I expected much, much worse. (But also, big CPU on this machine, etc. I'm not sure how much it actually helps since there's a bunch of discrete steps).
Oh, also, by following more instructions in the dev guide, that build is available through rustup as the "stage1" toolchain.
Compare stable, and my build:
$ rustc --version rustc 1.62.0 (a8314ef7d 2022-06-27) $ rustc +stage1 --version rustc 1.64.0-dev
Anyway, if we do a quick search in rustc's code, we find this function:
// in `rust/compiler/rustc_trait_selection/src/traits/select/mod.rs` fn evaluate_predicate_recursively<'o>( &mut self, previous_stack: TraitObligationStackList<'o, 'tcx>, obligation: PredicateObligation<'tcx>, ) -> Result<EvaluationResult, OverflowError> { // (cut) }
And, well, we can just tell the self-profiler to record this activity, with an argument:
// in `rust/compiler/rustc_trait_selection/src/traits/select/mod.rs` fn evaluate_predicate_recursively<'o>( &mut self, previous_stack: TraitObligationStackList<'o, 'tcx>, obligation: PredicateObligation<'tcx>, ) -> Result<EvaluationResult, OverflowError> { let _prof_timer = self.infcx.tcx.prof.generic_activity_with_arg_recorder( "evaluate_predicate_recursively", |recorder| { recorder.record_arg(format!("{:?}", obligation)); }, ); // (cut) }
Now we rebuild with ./x.py build library
... and run it again with -Z self-profile
, making sure to record arguments:
$ rustc +stage1 src/main.rs --emit=metadata --cfg 'depth = "7"' -Z self-profile -Z self-profile-events=default,args
And instead of opening that up in chrome://tracing
, we can do some digging
ourselves with the generated JSON:
$ crox $(ls -t *profdata | tail -1) $ jq -r '.[] | select(.name == "evaluate_predicate_recursively") | .args.arg0' chrome_profiler.json | sed -E 's/depth=[0-9]+/depth=?/' | sort -n | uniq -c | sort -n 1 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 1 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 1 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 1 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 1 Obligation(predicate=Binder(TraitPredicate(<&&&&&&() as std::marker::Sized>, polarity:Positive), []), depth=?) 1 Obligation(predicate=Binder(TraitPredicate(<() as std::process::Termination>, polarity:Positive), []), depth=?) 1 Obligation(predicate=Binder(TraitPredicate(<&&&&&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6 Obligation(predicate=Binder(TraitPredicate(<&&&&&() as std::marker::Sized>, polarity:Positive), []), depth=?) 6 Obligation(predicate=Binder(TraitPredicate(<&&&&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 26 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 26 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 26 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 26 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 26 Obligation(predicate=Binder(TraitPredicate(<&&&&() as std::marker::Sized>, polarity:Positive), []), depth=?) 26 Obligation(predicate=Binder(TraitPredicate(<&&&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 106 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 106 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 106 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 106 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 106 Obligation(predicate=Binder(TraitPredicate(<&&&() as std::marker::Sized>, polarity:Positive), []), depth=?) 106 Obligation(predicate=Binder(TraitPredicate(<&&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 426 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 426 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 426 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 426 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 426 Obligation(predicate=Binder(TraitPredicate(<&&() as std::marker::Sized>, polarity:Positive), []), depth=?) 426 Obligation(predicate=Binder(TraitPredicate(<&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 1706 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 1706 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 1706 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 1706 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 1706 Obligation(predicate=Binder(TraitPredicate(<&() as std::marker::Sized>, polarity:Positive), []), depth=?) 1706 Obligation(predicate=Binder(TraitPredicate(<&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6826 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6826 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6826 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6826 Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6826 Obligation(predicate=Binder(TraitPredicate(<() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=?) 6828 Obligation(predicate=Binder(TraitPredicate(<() as std::marker::Sized>, polarity:Positive), []), depth=?)
And I don't know how you feel about it but... this is such a cool
visualization of the problem. We can see that the top-level obligations (on
&&&&&&()
) are only evaluated once, the next level are evaluated 6 times (1 for
Trait
, 4 associated types, and 1 for the implicit Sized
trait).
Yeah yeah that's cool and all but... don't we have debug
-level traces now that
we have our own rustc build?
Oh! Yes we do!
Tracing deeper with debug spans
The problem with just exporting RUSTC_LOG=debug
is that we're gonna get a
lot of spans. Even with depth = "1"
:
$ RUSTC_LOG="debug" rustc +stage1 src/main.rs --emit=metadata --cfg 'depth = "1"' 2>&1 | wc -l 6864
But we already know which span we want!
In fact, I didn't show you, I willingly withheld that information from you, but
evaluate_predicate_recursively
is actually already instrumented!
#[instrument( level = "debug", skip(self, previous_stack), fields(previous_stack = ?previous_stack.head()) )] fn evaluate_predicate_recursively<'o>( &mut self, previous_stack: TraitObligationStackList<'o, 'tcx>, obligation: PredicateObligation<'tcx>, ) -> Result<EvaluationResult, OverflowError> { // (cut) }
So instead of using the self-profiling facilities, we can just use tracing.
Like so:
$ RUSTC_LOG="rustc_trait_selection::traits::select[evaluate_predicate_recursively]=debug" rustc +stage1 src/main.rs --emit=metadata --cfg 'depth = "1"' ârustc_trait_selection::traits::select::evaluate_predicate_recursively obligation=Obligation(predicate=Binder(TraitPredicate(<() as std::marker::Sized>, polarity:Positive), []), depth=0), previous_stack=None ââârustc_trait_selection::traits::select::evaluate_trait_predicate_recursively obligation=Obligation(predicate=Binder(TraitPredicate(<() as std::marker::Sized>, polarity:Positive), []), depth=0) (cut) âââ ââ0ms DEBUG rustc_trait_selection::traits::select finished: Ok(EvaluatedToOk) from Obligation(predicate=Binder(TraitPredicate(<() as std::process::Termination>, polarity:Positive), []), depth=0) â
But even with just that, we have 1427 lines to go through. And that's just at depth 1. At depth 2 we have 7559 lines, depth 3 gives us 38313 lines... we are looking into an exponential problem after all.
Wouldn't it be convenient if there was a way to visualize all that from the comfort of a web browser?
Say... what if folks working on backend web services had all agreed on a standard, years in the making, to send over spans and events and stuff?
Oh. Oh no. Amos what are you doi-
Enter OpenTelemetry, a "collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your softwareâs performance and behavior".
That seems perfect. And what do you know, there's a tracing-opentelemetry adapter!
Now all we need is something self-hosted to receive those traces and... I guess jaeger could do.
Let's look at how rustc sets up logging:
// in `rust/compiler/rustc_log/src/lib.rs` // (most functions and imports cut) /// In contrast to `init_rustc_env_logger` this allows you to choose an env var /// other than `RUSTC_LOG`. pub fn init_env_logger(env: &str) -> Result<(), Error> { let filter = match env::var(env) { Ok(env) => EnvFilter::new(env), _ => EnvFilter::default().add_directive(Directive::from(LevelFilter::WARN)), }; let color_logs = match env::var(String::from(env) + "_COLOR") { Ok(value) => match value.as_ref() { "always" => true, "never" => false, "auto" => stderr_isatty(), _ => return Err(Error::InvalidColorValue(value)), }, Err(VarError::NotPresent) => stderr_isatty(), Err(VarError::NotUnicode(_value)) => return Err(Error::NonUnicodeColorValue), }; let verbose_entry_exit = match env::var_os(String::from(env) + "_ENTRY_EXIT") { None => false, Some(v) => &v != "0", }; let layer = tracing_tree::HierarchicalLayer::default() .with_writer(io::stderr) .with_indent_lines(true) .with_ansi(color_logs) .with_targets(true) .with_verbose_exit(verbose_entry_exit) .with_verbose_entry(verbose_entry_exit) .with_indent_amount(2); #[cfg(parallel_compiler)] let layer = layer.with_thread_ids(true).with_thread_names(true); let subscriber = tracing_subscriber::Registry::default().with(filter).with(layer); tracing::subscriber::set_global_default(subscriber).unwrap(); Ok(()) }
Alright, seem straight-forward enough. So now we just-
Amos there is no way this works. Are you going to pull tokio in there too? Async rustc? Is that the sort of chaos you're after?
Well, maybe, let's find out. So first we pull in the dependencies we need:
# in `rust/compiler/rustc_log/Cargo.toml` [dependencies] atty = "0.2" tracing = "0.1.28" tracing-subscriber = { version = "0.3.3", default-features = false, features = ["fmt", "env-filter", "smallvec", "parking_lot", "ansi"] } tracing-tree = "0.2.0" # đ new tracing-opentelemetry = { version = "0.17.4" } opentelemetry-jaeger = { version = "0.16.0", features = [ "reqwest_blocking_collector_client", ] }
And then we send our traces to jaeger instead if the RUSTC_JAEGER
environment
variable is set:
// in `rust/compiler/rustc_log/src/lib.rs` pub fn init_env_logger(env: &str) -> Result<(), Error> { let filter = match env::var(env) { Ok(env) => EnvFilter::new(env), _ => EnvFilter::default().add_directive(Directive::from(LevelFilter::WARN)), }; if env::var("RUSTC_JAEGER").is_ok() { let tracer = opentelemetry_jaeger::new_pipeline() .with_service_name("rustc") .with_collector_endpoint("http://localhost:14268/api/traces") .install_simple().unwrap(); let telemetry = tracing_opentelemetry::layer().with_tracer(tracer); let subscriber = tracing_subscriber::Registry::default().with(filter).with(telemetry); tracing::subscriber::set_global_default(subscriber).unwrap(); } else { let color_logs = match env::var(String::from(env) + "_COLOR") { Ok(value) => match value.as_ref() { "always" => true, "never" => false, "auto" => stderr_isatty(), _ => return Err(Error::InvalidColorValue(value)), }, Err(VarError::NotPresent) => stderr_isatty(), Err(VarError::NotUnicode(_value)) => return Err(Error::NonUnicodeColorValue), }; let verbose_entry_exit = match env::var_os(String::from(env) + "_ENTRY_EXIT") { None => false, Some(v) => &v != "0", }; let layer = tracing_tree::HierarchicalLayer::default() .with_writer(io::stderr) .with_indent_lines(true) .with_ansi(color_logs) .with_targets(true) .with_verbose_exit(verbose_entry_exit) .with_verbose_entry(verbose_entry_exit) .with_indent_amount(2); #[cfg(parallel_compiler)] let layer = layer.with_thread_ids(true).with_thread_names(true); let subscriber = tracing_subscriber::Registry::default().with(filter).with(layer); tracing::subscriber::set_global_default(subscriber).unwrap(); } Ok(()) }
Now we can run jaeger's "all-in-one" docker image in a terminal session somewhere...
$ docker kill jaeger ; docker rm jaeger; docker run -d --name jaeger -e SPAN_STORAGE_TYPE=badger -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 -e COLLECTOR_OTLP_ENABLED=true -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 4317:4317 -p 4318:4318 -p 14250:14250 -p 14268:14268 -p 14269:14269 -p 9411:9411 jaegertracing/all-in-one:1.36 c69025488ee2384bda5263f10954eb0830d0f41e045dad5b76da5143e6d2e948
It's now running in the background, and the web UI is available at http://localhost:16686/.
Time to send some traces to it!
$ RUSTC_JAEGER=1 RUSTC_LOG="rustc_trait_selection::traits::select[evaluate_predicate_recursively]=debug" rustc +stage1 src/main.rs --emit=metadata --cfg 'depth = "4"'
By picking "rustc" from the service dropdown and hitting search, we can see a few traces:
See that dot taking 200ms in the latency graph? Let's click on that... and then collapse a few spans... and voila:
So! There's 4076 total spans, a depth of 25, and we can see exactly what happens, when, and because of what.
We can expand or collapse individual spans, along with their "tags" (which tracing calls attributes, if I recall correctly). Among them, the arguments to our instrumented functions!
So for example the top-level predicate is about what we'd expect:
Here is it manually formatted again:
Obligation( predicate=Binder( ProjectionPredicate( ProjectionTy { substs: [&&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(()) ), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))] ), depth=0 )
Then we move onto poly_project_and_unify_type
, which is relatively quick, and
then evaluate_predicates_recursively
(predicates plural), with these:
[ Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2), Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:6 ~ main[b0d6]::Trait::B) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2), Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:7 ~ main[b0d6]::Trait::C) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2), Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:8 ~ main[b0d6]::Trait::D) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2), Obligation(predicate=Binder(TraitPredicate(<&&() as std::marker::Sized>, polarity:Positive), []), depth=2), Obligation(predicate=Binder(TraitPredicate(<&&() as Trait<'x>>, polarity:Positive), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2) ]
If we keep digging, we find another set of poly_project_etc
and
evaluate_predicate_recursively
, with:
Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=2)
Then another, with:
Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [&(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=4)
Then another, with:
Obligation(predicate=Binder(ProjectionPredicate(ProjectionTy { substs: [(), ReLateBound(DebruijnIndex(0), BoundRegion { var: 0, kind: BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x) })], item_def_id: DefId(0:5 ~ main[b0d6]::Trait::A) }, Ty(())), [Region(BrNamed(DefId(0:12 ~ main[b0d6]::{impl#0}::'x), 'x))]), depth=6)
(Notice the number of &
decreases, and the depth
increases).
And that one is trivially evaluated with no further nesting.
Exploring the rest of the graph shows much of the same. To evaluate any
constraint of the form for<'x> <&&() as Trait<'x>>::A = ()
, we must first
evaluate all of these:
for<'x> &(): Trait<'x>
for<'x> <&() as Trait<'x>>::A = ()
for<'x> <&() as Trait<'x>>::B = ()
for<'x> <&() as Trait<'x>>::C = ()
for<'x> <&() as Trait<'x>>::D = ()
for<'x> &(): Sized
And for the first 5 of these, we must evaluate all of these for every deeper level. And somewhere therein lies the bug: the result of those "inner" evaluations should be cached and we should never recurse that often, but we do.
What now?
Well, now I've run out of knowledge! I'm monitoring rust issue 99188 with interest, and might keep digging to see if I can somehow figure out why these aren't getting cached.
Maybe this little exploration will help solve the actual issue, maybe not! Maybe it'll give someone a taste for diving into rustc internals.
Anyway, that's all I got for you today. As always, thanks for watching, and don't forget to like comment and sub-
Oooh you need a nap.
Yes I do.
Thanks to my sponsors: Foxie Solutions, Julian Schmid, Alejandro Angulo, Marcus Brito, Brandon Piña, MatÄj Volf, Ernest French, Zoran Zaric, Cass, Chris Biscardi, Valentin Mariette, Alex Doroshenko, Raine Tingley, Evan Relf, Tom Forbes, James Brown, anichno, Leo Shimonaka, Scott Steele, Geoffroy Couprie and 227 more
If you liked what you saw, please support my work!
Here's another article just for you:
I use the draw.io desktop app to
make diagrams for my website. I run it on an actual desktop, like Windows or
macOS, but the asset pipeline that converts .drawio
files, to .pdf
, to
.svg
, and then to .svg
again (but smaller) runs on Linux.
So I have a Rust program somewhere that opens headless chromium, and loads just the HTML/JS/CSS part of draw.io I need to render my diagrams, and then use Chromium's "print to PDF" functionality to save a PDF.