May 02, 2022 29 min #ddos · #cloudflare · #postmortem · #rust

I won free load testing

👋 This page was last updated ~3 years ago. Just so you know.

Thanks to my sponsors: Benjamin Röjder Delnavaz, Mario Fleischhacker, xales, Ahmad Alhashemi, Mark, Justy, Marie Janssen, Beat Scherrer, Beth Rennie, Victor Song, jer, Marty Penner, Nyefan, Andronik, Ian McLinden, Laine Taffin Altman, Evan Relf, Mattia Valzelli, budrick, Daniel Papp and 266 more

Long story short: a couple of my articles got really popular on a bunch of sites, and someone, somewhere, went “well, let’s see how much traffic that smart-ass can handle”, and suddenly I was on the receiving end of a couple DDoS attacks.

It really doesn’t matter what the articles were about — the attack is certainly not representative of how folks on either side of any number of debates generally behave.

My assumption is that it’s a small group (maybe a Discord?) with a botnet, who wanted to have fun.

And, friends: fun was had.

The main attack

My main site (“the blog”) received about 34M requests over 72h - in three spikes. It’s behind Cloudflare at the time of this writing, so, here’s a pretty graph (the granularity / bucket size is 1 hour):

8 million requests spike the morning of saturday the 30th, and an 11 million requests spike in the evening.

To give you an idea of scale, the article that “blew up” on multiple news aggregators only got around 130K hits. This is what organic traffic looks like (with a long tail and everything):

a 9 thousand requests spike on the afternoon of friday the 29th, that slowly withers down to 500 requests over the next two days

In some ways, that attack was unsophisticated: it hit a single route, / (the front page), and two thirds of the requests used a single user agent:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36

Which is just the user-agent of Chrome’s current version on Linux 64-bit. The traffic doesn’t look like something like headless Chrome was used, (otherwise there’d be a lot more requests for assets like fonts, stylesheets, images etc.), but it could’ve been.

On the other hand, the attack was pretty well distributed:

4 million requests from China, 3.5 million from India, 3 million from Brazil, 2M from the US, then Indonesia, the Philippines, Tor, Hong Kong, the UK, Thailand, everything else is under 800K requests

Here are the top 15 AS the traffic came from:

AS4134: China Telecom (China, backbone)
AS61317: Heficed (cloud provider)
AS398465: Rackdog (cloud provider)
AS18209: Actcorp (India, telecom)
AS14061: DigitalOcean (cloud provider)
AS17639: Converge ICT (Philippines, telecom)
AS9829: Bharat Sanchar Nigam Limited (India, backbone)
AS7713: Telkom Indonesia (Indonesia, telecom)
AS208294: CIA Triad LLC
- IP prefix is in the US
- Phone number has Australian prefix (+61)?
- Associated with Zwiebelfreunde who provide Tor exit nodes
AS8452: TE-AS (Egypt, telecom)
AS205100: F3 Netze, (Germany, associated with Freifunk Franken)
AS9299: PLDT (Philippines, telecom)
AS36947: Algérie Telecom (Algeria, telecom)
AS141995: Contabo Asia (Singapore ops of a German cloud provider)
AS27699: Telefonica (Brazil, telecom)

So, you know. The flip side of compute and bandwidth becoming a very affordable commodity worldwide is that… it’s also affordable for bad actors.

Also, I’m sure Heficed, Rackdog, DigitalOcean and Contabo all say in their Terms of Service “don’t use us to DDoS someone else”, but it appears you can definitely do it for a few hours without them noticing (whether it was through compromised instances or not).

The secondary attack

The other target was my recently-launched video platform, which is a separate service, and runs on fly.io.

Here’s that attack as seen from fly.io metrics (that I see as a user):

a data transfer graph, showing a peak at 3.1GB/s

Here is the same attack as seen from Honeycomb, where data is self-reported (from multiple instances of my app) and sampled (not all traces are sent to Honeycomb):

a honeycomb heatmap, showing request duration shooting up to 11 minutes!

Why yes, I do see something that stands out.

Let’s use BubbleUp on that whole area there:

the honeycomb bubbleup feature, showing that 100% of requests in my selection are for the same path, for the I’m in ur address space video, a file named av1_2160p.mp4

Single biggest asset (4K@60 video asset for a 50 minute video), makes sense.

They even left me a note! How thoughtful.

(That’s a reference to one of the articles that got popular this week.)

I don’t track which ranges were requested, but they did do some range requests, and some full requests. Not sure if that was random or just two individuals going at it:

That attack was a lot less distributed: it came from a bunch of Vultr IPs, and it stopped after banning the whole AS (just for my app).

How effective were the attacks?

To evaluate whether or not the attacks were successful, we need to discuss why people on the internet perform them.

Arguably, the only reason is “for the lulz” (for laughs), but the lulz has multiple tiers: the primary goal is to “deny service” - to overload the server(s) so that some content is not accessible anymore.

Then there’s secondary goals: because providers typically bill for bandwidth, if it costs the target some money, that’s even more fun. And if said providers decide that actually, they don’t want a customer who’s getting targeted like that, and boot them off their platform, then that’s maximum fun.

And then of course there’s the power trip, the idea that you have control over someone else, the ability to “punish” them for something they did. Just like any other form of harassment really.

So, were those goals met?

The primary goal definitely was: the site was inaccessible for a few hours over the course of Saturday, April 30th. As far as HTTP status codes go, I’m counting:

11M “499 Client Closed Request”
6.5M “503 Service Unavailable”
6M “403 Forbidden”
4M “200 OK”
3M “524 Origin Timeout”
2M “522 Origin Connection time-out”
1.5M “429 Too Many Requests”
500K “520 Origin Error”
100K “521 Origin Down”

A few people reached out to let me know about it, bummed they couldn’t read the article for the time being. But I mean, as long as The Internet Archive is up, is anything really down? News aggregator users reflexively save articles that become somewhat popular, expecting them to go down, or in case they need receipts.

As for the secondary objectives: none of them were met.

It didn’t cost me a single dollar:

Hetzner doesn’t bill for bandwidth, it’s a fixed monthly price kind of deal
Cloudflare eats bigger attacks for breakfast, and they don’t charge for it either.

As for fly.io, well, I work there, so, they pay me.

It didn’t get me in trouble either. On the contrary, every time something like that happens, I make new friends and hear back from old friends.

And honestly? Everyone brought popcorn and took notes. An attack like that is both entertaining and informative. It kinda gave me a kick in the butt to address some issues with my website. And it gave everyone ideas on how to better protect against this kind of thing.

Just like this article right here is good intel for anyone who is looking to DDoS me. But for me, sharing that info is part of the fun, and I don’t really believe you can really achieve resiliency through obscurity.

Why the attack worked

A service “going down” can mean many things: being temporarily suspended by a cloud provider is one failure mode, for example. Having resource utilization increase so much that a machine is constantly swapping and everything is 1000x slower is another.

Here, there was a single point of failure: my origin, a single Hetzner dedicated server, running (some guessed it) some Rust code.

It’s not exactly a secret: I outlined how my website is built back in 2020, and although I’ve made multiple incremental improvements since, it still works essentially the same way.

Before I wrote my own server, I was using static site generators (nanoc, Hugo, etc.), and deploying the result directly to an S3 bucket, configured for static site serving, behind Cloudflare. That was an easy, reliable setup.

My site wouldn’t have gone down if I was still using that setup. However, I would also be looking at a rather large AWS bill right about now. (And I’d rather talk to AWS friends about anything other than billing).

Wait, a large AWS bill?

I thought you just said it was behind Cloudflare?

I know, I was surprised too. I’m fairly sure it wasn’t always that way, but if you consult the Cloudflare docs, it clearly says:

Cloudflare only caches based on file extension and not by MIME type. The Cloudflare CDN does not cache HTML by default. Additionally, Cloudflare caches a website’s robots.txt.

And then, further down:

To cache additional content, see Page Rules to create a rule to cache everything.

Over the years, Cloudflare has saved me a lot of bandwidth costs (well, it would’ve if I was paying for it), but “only” for images, CSS, JS, etc.

Not video, see Section 2.8 of their Self-Serve Subscription Agreement:

2.8 Limitation on Serving Non-HTML Content

The Services are offered primarily as a platform to cache and serve web pages and websites. Unless explicitly included as part of a Paid Service purchased by you, you agree to use the Services solely for the purpose of (i) serving web pages as viewed through a web browser or other functionally equivalent applications, including rendering Hypertext Markup Language (HTML) or other functional equivalents, and (ii) serving web APIs subject to the restrictions set forth in this Section 2.8.

Use of the Services for serving video or a disproportionate percentage of pictures, audio files, or other non-HTML content is prohibited, unless purchased separately as part of a Paid Service or expressly allowed under our Supplemental Terms for a specific Service. If we determine you have breached this Section 2.8, we may immediately suspend or restrict your use of the Services, or limit End User access to certain of your resources through the Services.

(So, pissing off a kid with a botnet will not get you booted off of Cloudflare, but building a video platform on top of it will. They want you to use their product for that)

And, as it turns out, not for HTML.

a graph showing cache status for requests over the past 72 hours there’s large orange spikes, and larger still grey spikes

Both top colors there are uncached requests: the orange spikes are the requests my origin did manage to serve, and the grey ones are the ones Cloudflare generated for me when the origin struggled too much (or when I stopped it altogether).

The first thing I tried to do was add a cache-control: public, max-age=120 response header. Still, the cache status was “dynamic”, so every request went to the origin.

I already had the cache-control header set for static assets (images, stylesheets, scripts), just not for HTML, because in two years of repeatedly hitting the front page of various sites, it was never a performance concern.

Creating a page rule didn’t immediately fix everything (but I’m convinced I was holding it wrong, since that’s the thing they want you to do), but even if it had, at least on the Free and Pro plans, you can only match by URL: so, this would break the website for any logged-in users (who support me financially, and have access to some articles in advance).

Cool Bear's hot tip

On the Business and Pro plans, there’s a “Bypass Cache on Cookie” feature, which even lets you specify regex patterns, like PHPSESSID=.*.

However, it wouldn’t take a very motivated attacker to figure out that they can hammer your origin with a well-formed cookie: there’s no way for Cloudflare’s edge to know whether the session cookie is actually valid or not, so it only helps for legitimate requests, not for attacks.

Long story short: without upgrading to the next plan over ($200/month), I can’t use Cloudflare to cache HTML in a way that doesn’t break my site. And that wouldn’t help protect against attacks.

Meanwhile, back in my origin server…

$ sudo ss | wc -l
352950

Oh, 350K connections.

That’s a lot of connections

That’s certainly more connections than there should EVER be between Cloudflare and my server. But Cloudflare expects origin servers to signal when they’re struggling.

Origins should return 429, or 503, or start refusing connections; they should do something, anything other than just accept a ridiculous number of concurrent connections and let them just sit there — that’s just a bad deal for everybody.

And my site didn’t do that! Because I half-assed that bit, and it worked for two years straight anyway. Almost as if… it was possible to move fast, delivering value, leaving some problems for later, even with Rust…

Amos…

Right, right, sorry.

So: the lack of caching wasn’t actually that surprising to me. I do consider my HTML content dynamic, just like Cloudflare does: it is different for logged-in users, there’s random bits (“what to read/watch next”), there’s a full-text search engine (really just SQLite’s, nothing fancy there yet).

However, I did expect Cloudflare’s DDoS protection, their #1 selling point, to kick in.

And it mostly didn’t.

Although Cloudflare did block/present a challenge for some fraction of the incoming traffic, it mostly did so when I turned on the I’m Under Attack mode, and even then, definitely not enough to let my origin serve legitimate requests again.

cloudflare security events graph, showing 10M total actions, 3.5M challenges, 2.5M legacy captchas, 2M js challenges, and 1.7M blocks

By the time the second wave happened, I was already in touch with an engineer at Cloudflare, who helped some, and let me know something interesting: the reason protection wasn’t kicking in was that the attacker was staying just below their detection threshold.

Which seems to indicate the attacker knew what they were doing. But then why use a fixed user-agent? And hit just the one endpoint?

Because that’s all it took?

Eh, fair enough.

Which brings us to another interesting point: the amount of requests is impressive in total, and very spikey, but it was still under Cloudflare’s detection threshold most of the time, which means… my server probably should’ve handled it like a champ!

And just to be clear: it didn’t, right?

Oh no, not at all. Memory usage was quite alright (hardly went past 5%), but all eight CPU cores were nearly maxed out, and, well, I couldn’t get a single 200 OK out of it during the attack, even locally.

But then again I didn’t sit there patiently for minutes waiting for my connection to be accepted.

See, at that time, my server was quite busy, doing a lot of stuff. Here’s a representative of what was going on, obtained with perf (the exact command is just sudo perf top):

a view of where the system was spending the most time, with functions from sqlite’s btree table, malloc, pthread_mutex_lock, lol_html for html rewrite, some syscall stuff, some utf-8 decoding stuff

As I mentioned, my website is mostly static content, but not just. And as I described in 2020, I’m not optimizing for “maximum server performance”, I’m optimizing for “maximum content authoring convenience” (and also nice features for readers).

And so, Rust maximalism be damned, most of my website is actually powered by liquid templates and SQL queries.

Only the very base: file watching, hashing, DB management, the HTTP stack, and some very useful custom liquid filters (to render markdown, including my custom extensions, etc.), is actually written in Rust.

The front page, for example, runs several SQL queries to retrieve the latest videos, articles, and series. It then runs several liquid filters, not to process markdown (I cache that), but to truncate the HTML body so I can show just an excerpt with a “Read more” button.

For the curious, it looks something like this:

{% include "html/prologue.html" %}

{% capture sql_start %}
    SELECT pages.*
    FROM pages
    JOIN revision_routes ON revision_routes.hapa = pages.hapa
    WHERE revision_routes.revision = ?1
{% endcapture %}

{% capture sql_end %}
    {% unless config.drafts %}
        AND NOT pages.draft
    {% endunless %}
    {% unless config.future %}
        AND datetime(pages.date) < datetime('now')
    {% endunless %}
    ORDER BY date DESC
{% endcapture %}

{% capture articles_sql %}
    {{sql_start}}
    AND revision_routes.parent_route_path = 'articles'
    {{sql_end}}
    LIMIT 3
{% endcapture %}
{% assign articles = articles_sql | query: revision %}

Later on, the page_listing partial is invoked, which does stuff like:

<div class="page-section post-list">
    {% for page in pages %}
        {% assign page_html = page | page_markup %}

        <div class="post-summary markup-container">
            {{ page_html | truncate_html: max: 120 }}
        </div>
    {% endfor %}

(A lot of code was removed - this is a very noisy template).

Doing it that way is awesome. Liquid/Rust is my Python/C. It lets me prototype a ton of stuff really quickly, without recompiling anything, and it’s absolutely been fast enough for two years of going viral.

The Rust codebase has no idea what the structure of content should be, nor does it care: all it knows is that there’s .md assets, and those can be rendered with pulldown-cmark, and some custom template markup processor, and KaTeX, and, and - those parts are built-in.

But series live entirely in the templates / SQL queries themselves: it’s all path matching (with regexps, which I add to sqlite as a Rust function).

So, at the time of the attack, the P95 (95th percentile latency) for my index page was 168ms.

That’s not fast: it’s the ping between France and Brazil.

But it’s also not slow: PageSpeed Insights seems very happy about a First Contentful Paint (FCP) of 1 second. There’s more to the performance of a website than how long it takes to generate HTML, and I’ve spent a fair bit of effort taking care of everything else.

For example, I’ve recently created font subsets for Iosevka, which cut down several megabytes per visit, even though the font assets were always WOFF2.

Less recently, I’ve added support for avif and webp for images, instead of always serving png or jpg (depending on the kind of images).

And it was never a problem whenever one of my articles got popular, since the P95 for any single article was more like 30ms (and only involved a couple of very fast SQL queries, and no HTML rewriting/truncation whatsoever).

Fixing my origin

Adding cache-control headers was a bust…

audible groan

…but I did make several other changes that should help next time, and I thought it’d be neat to show what they look like in code.

There’s a whole bunch of Rust code there, if you want to skip it, search for “After the storm”.

Yes, yes, I know, I should add anchor links for headers.

First, I added a limit to the maximum number of in-flight requests. My website uses warp as an HTTP framework (whereas my video platform uses axum), but they’re both based on hyper, which means I can use standard tower layers with it (think “middleware”).

Before in-flight requests limit:

async fn serve() {
    // (omitted: everything else)

    let addr: SocketAddr = config.address.parse()?;
    // this turns a warp `Filter` into a tower `Service`
    let svc = warp::service(all_routes.with(access_log));
    let make_svc = hyper::service::make_service_fn(|_: _| {
        // this gets called whenever a new connection is accepted: it gets its
        // own clone of our HTTP service (this mostly increases reference counts)
        async move { Ok::<_, Infallible>(svc.clone()) }
    });

    let server = hyper::Server::bind(&addr).serve(make_svc);
    server.await?;
}

After in-flight requests limit:

use tower::{limit::GlobalConcurrencyLimitLayer, ServiceBuilder}

async fn serve() {
    // (omitted: everything else)

    let addr: SocketAddr = config.address.parse()?;
    let svc = warp::service(all_routes.with(access_log));
    let limit = GlobalConcurrencyLimitLayer::new(512);
    let make_svc = hyper::service::make_service_fn(|_: _| {
        // This increases reference counts for `GlobalConcurrencyLimitLayer`'s
        // internal semaphore: the limit is
        let svc = ServiceBuilder::new().layer(limit.clone()).service(svc.clone());
        async move { Ok::<_, Infallible>(svc) }
    });
}

I deployed this between two attack waves, and it did not help at all.

It does exactly what it says on the tin, but during an attack, the abnormally high volume means requests just keep piling up, taking up resources, and utilization is still as high as… well, as it can go.

Adding a load shed layer would’ve been smarter, I just didn’t think of it at the time: it’s one more line of code, and now, if there’s more than 512 requests in-flight, the others just immediately error out.

Which feels like a bad thing at first, but once you’ve determined the maximum volume you can gracefully serve, that’s what you want. Especially when there’s an edge between you and the user-agent (in this case, Cloudflare) that’s capable of backing off, retrying, etc.

The next thing I did was limit concurrent connections: I know Cloudflare has a lot of PoPs, but 350K connections still feels excessive.

I’ve covered this in Request coalescing in async Rust before, but who’s got time for that, here’s some code:

use std::convert::Infallible;

use futures::future::{ready, Ready};
use hyper::server::conn::AddrStream;
use tokio::sync::OwnedSemaphorePermit;
use tokio_util::sync::PollSemaphore;
use tower::Service;

pub struct ServiceFactory<S> {
    pub inner: S,
    pub semaphore: PollSemaphore,
    pub permit: Option<OwnedSemaphorePermit>,
}

impl<S> Service<&AddrStream> for ServiceFactory<S>
where
    S: Clone,
{
    type Response = PermitService<S>;
    type Error = Infallible;
    type Future = Ready<Result<Self::Response, Self::Error>>;

    fn poll_ready(
        &mut self,
        cx: &mut std::task::Context<'_>,
    ) -> std::task::Poll<Result<(), Self::Error>> {
        if self.permit.is_none() {
            self.permit = Some(futures::ready!(self.semaphore.poll_acquire(cx)).unwrap());
        }
        Ok(()).into()
    }

    fn call(&mut self, _req: &AddrStream) -> Self::Future {
        let permit = self.permit.take().expect(
            "you didn't drive me to readiness did you? you know that's a tower crime right?",
        );
        ready(Ok(PermitService {
            inner: self.inner.clone(),
            _permit: permit,
        }))
    }
}

pub struct PermitService<S> {
    pub inner: S,
    pub _permit: OwnedSemaphorePermit,
}

impl<S, R> Service<R> for PermitService<S>
where
    S: Service<R>,
{
    type Response = S::Response;
    type Error = S::Error;
    type Future = S::Future;

    fn poll_ready(
        &mut self,
        cx: &mut std::task::Context<'_>,
    ) -> std::task::Poll<Result<(), Self::Error>> {
        self.inner.poll_ready(cx)
    }

    fn call(&mut self, req: R) -> Self::Future {
        self.inner.call(req)
    }
}

And it’s used like this:

    let conns_limit = Semaphore::new(128);
    let svc = ServiceBuilder::new()
        .layer(GlobalConcurrencyLimitLayer::new(512))
        .service(warp::service(all_routes.with(access_log)));

    let factory = ServiceFactory {
        inner: svc,
        semaphore: PollSemaphore::new(Arc::new(conns_limit)),
        permit: None,
    };

    let server = hyper::Server::bind(&addr).serve(factory);
    info!("listening on {}", addr);
    server.await?;

That one’s significantly more verbose, but it’s also very re-usable, that’s kind of tower’s deal. Which means it probably exists in a crate somewhere and I’m an idiot for rewriting it in every project. Or I should just publish my own crate.

But the elegance here is that… it’s a Service that takes an &AddrStream and returns a Service. That delicious meta bit can get confusing at times, but it turns out to be super convenient.

Shortly after I deployed that change, things broke: the site appeared unavailable again. The attack hadn’t resumed yet though — the available 512 connection slots had simply filled up, were idle, and Cloudflare edge nodes weren’t able to establish any more connections.

My solution was to simply enforce idle read and write timeouts: any connection that hasn’t been read from or written to in a few seconds gets reset.

Because I brought in a custom “acceptor” for that, it was also a good occasion to close another hole: my server was listening on 0.0.0.0:80. If you could guess the IP (which, there’s entire sites dedicated to that), you could hammer it directly, without proxying through Cloudflare.

That didn’t happen, but it sure could have.

Instead, we want the origin to only accept connections from Cloudflare IP ranges.

So, here’s an acceptor that does both, and updates the IP ranges every now and then:

use std::{
    collections::HashSet,
    io,
    net::{IpAddr, SocketAddr},
    pin::Pin,
    str::FromStr,
    sync::Arc,
    time::Duration,
};

use arc_swap::ArcSwap;
use futures::TryFutureExt;
use hyper::server::accept::{from_stream, Accept};
use ipnet::IpNet;
use tokio::net::TcpStream;
use tokio_io_timeout::{TimeoutReader, TimeoutWriter};
use tracing::{debug, info, warn};

pub const IPS_V4: &str = include_str!("ips-v4.txt");
pub const IPS_V6: &str = include_str!("ips-v6.txt");
pub const IPS_LOCAL: &str = "127.0.0.1/8";

pub fn timeout_acceptor(
    addr: SocketAddr,
) -> impl Accept<Conn = Pin<Box<TimeoutWriter<TimeoutReader<TcpStream>>>>, Error = io::Error> {
    let ip_nets = parse_ip_nets(&[IPS_V4, IPS_V6, IPS_LOCAL]).unwrap();
    info!("Loaded {} ip nets", ip_nets.len());
    let ip_nets = Arc::new(ArcSwap::from_pointee(ip_nets));

    tokio::spawn({
        let ip_nets = ip_nets.clone();
        async move {
            loop {
                if let Err(e) = update_ip_nets(&ip_nets).await {
                    warn!("Could not update IP nets: {e}");
                };
            }
        }
    });

    let localhost = IpAddr::from([127, 0, 0, 1]);

    from_stream(
        async move {
            let ln = tokio::net::TcpListener::bind(addr).await?;
            let stream = async_stream::stream! {
                loop {
                    let (stream, addr) = ln.accept().await?;

                    if let Some(net) = ip_nets.load()
                        .iter()
                        .find(|net| net.contains(&addr.ip()))
                    {
                        debug!("Allowing {} through net {net}", addr.ip());
                    } else {
                        debug!("Disallowing {}", addr.ip());
                        continue;
                    }

                    let should_timeout = addr.ip() != localhost;

                    let mut stream = TimeoutReader::new(stream);
                    if should_timeout {
                        stream.set_timeout(Some(Duration::from_secs(5)));
                    }
                    let mut stream = TimeoutWriter::new(stream);
                    if should_timeout {
                        stream.set_timeout(Some(Duration::from_secs(5)));
                    }
                    yield Ok(Box::pin(stream))
                }
            };
            Ok(stream)
        }
        .try_flatten_stream(),
    )
}

fn parse_ip_nets(sources: &[&str]) -> color_eyre::Result<HashSet<IpNet>> {
    let mut set: HashSet<IpNet> = Default::default();
    for input in sources {
        for line in input.lines() {
            let line = line.trim();
            if line.is_empty() || line.starts_with('#') {
                continue;
            }
            let ip_net = IpNet::from_str(line).unwrap();
            set.insert(ip_net);
        }
    }
    Ok(set)
}

async fn update_ip_nets(ip_nets_handle: &ArcSwap<HashSet<IpNet>>) -> color_eyre::Result<()> {
    let client = reqwest::Client::new();
    tokio::time::sleep(Duration::from_secs(15)).await;

    let mut sources = vec![IPS_LOCAL.to_string()];
    for url in [
        "https://www.cloudflare.com/ips-v4",
        "https://www.cloudflare.com/ips-v6",
    ] {
        sources.push(client.get(url).send().await?.text().await?);
    }
    let sources: Vec<_> = sources.iter().map(|x| x.as_str()).collect();
    let ip_nets = parse_ip_nets(&sources[..])?;
    info!(
        "Loaded {} ip nets (had {})",
        ip_nets.len(),
        ip_nets_handle.load().len()
    );
    ip_nets_handle.store(Arc::new(ip_nets));

    Ok(())
}

Featured here: reqwest for easy async HTTP requests, arc-swap to avoid using an Arc<Mutex<T>> or Arc<RwLock<T>>, async-stream to be able to create an async Stream using generator syntax (yield etc.).

It’s pretty naive code, but I think it looks pretty neat.

A better way to only accept connection from Cloudflare IPs would be to set up firewall rules (and update them periodically). Then any disallowed connection could be simply refused (thus not taking space in the accept queue), or dropped (thus wasting the attacker’s time).

But hey, gotta leave stuff to do for later!

Also, using it is a breeze:

    let acceptor = timeout_acceptor(addr);
    // instead of `Server::bind`:
    let server = hyper::Server::builder(acceptor).serve(factory);
    info!("listening on {}", addr);
    server.await?;

Because it doesn’t return an AddrStream but instead a Pin<Box<TimeoutWriter<TimeoutReader<TcpStream>>>>…

Gesundheit.

…it broke ServiceFactory with a gnarly error message, but I’ve been doing this long enough that I knew exactly to turn:

impl<S> Service<&AddrStream> for ServiceFactory<S> {
    // etc.
}

Into:

impl<S, A> Service<&A> for ServiceFactory<S> {
    // etc.
}

So that ServiceFactory works for any acceptor (it doesn’t care what the actual IO type is).

I added some user-agent banning on my side too (in case I ever move away from Cloudflare), same tower boilerplate here:

use std::{collections::HashSet, sync::Arc};

use futures::future::{ready, Either, Ready};
use http::{Request, Response};
use tower::{Layer, Service};

// Layer
pub struct BanUserAgents {
    banned: Arc<HashSet<String>>,
}

impl BanUserAgents {
    pub fn new() -> Self {
        let mut banned = HashSet::new();
        banned.insert("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36".to_string());
        Self {
            banned: Arc::new(banned),
        }
    }
}

impl Default for BanUserAgents {
    fn default() -> Self {
        Self::new()
    }
}

impl<S> Layer<S> for BanUserAgents {
    type Service = BanUserAgentsService<S>;

    fn layer(&self, inner: S) -> Self::Service {
        BanUserAgentsService {
            inner,
            banned: self.banned.clone(),
        }
    }
}

#[derive(Clone)]
pub struct BanUserAgentsService<S> {
    inner: S,
    banned: Arc<HashSet<String>>,
}

impl<S, B> Service<Request<B>> for BanUserAgentsService<S>
where
    S: Service<Request<B>, Response = Response<hyper::Body>>,
{
    type Response = S::Response;
    type Error = S::Error;
    type Future = Either<S::Future, Ready<Result<S::Response, S::Error>>>;

    fn poll_ready(
        &mut self,
        cx: &mut std::task::Context<'_>,
    ) -> std::task::Poll<Result<(), Self::Error>> {
        self.inner.poll_ready(cx)
    }

    fn call(&mut self, req: Request<B>) -> Self::Future {
        let user_agent = req
            .headers()
            .get("user-agent")
            .and_then(|h| h.to_str().ok())
            .unwrap_or_default();
        if self.banned.contains(user_agent) {
            let res = Response::builder()
                .status(403)
                .body(hyper::Body::empty())
                .unwrap();
            Either::Right(ready(Ok(res)))
        } else {
            Either::Left(self.inner.call(req))
        }
    }
}

There’s no good reason to code up an entire tower layer by hand to do this, I should’ve used something like ServiceExt::filter and it would’ve been only a couple lines.

Next up, I configured Sentry and (after checking with them that I’d be okay if the attack resumed) Honeycomb for the main site. I had both of these set up for the video platform, but never needed to look too closely to the main site up until now.

    let config = Config::read(&args.path)?;

    // this reports panics
    let _guard = sentry::init((
        SENTRY_DSN,
        sentry::ClientOptions {
            release: sentry::release_name!(),
            environment: Some(config.env.to_string().into()),
            ..Default::default()
        },
    ));

    let service_name = "futile";
    // this allows tracking spans across services through known HTTP headers
    opentelemetry::global::set_text_map_propagator(TraceContextPropagator::new());

    // this'll error out locally unless I spin up an otlp-exporter. in
    // production, this points to an exporter I run myself, off-box.
    let otlp_endpoint = config.otlp_endpoint.as_deref().unwrap_or("localhost:4317");
    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_trace_config(
            opentelemetry::sdk::trace::config()
                // all traces will have the same service name
                .with_resource(Resource::new(vec![KeyValue::new(
                    opentelemetry_semantic_conventions::resource::SERVICE_NAME,
                    service_name,
                )]))
                // send 10% of traces
                .with_sampler(Sampler::TraceIdRatioBased(0.1)),
        )
        .with_exporter(
            // use GRPC to talk to the exporter
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint(otlp_endpoint),
        )
        // send all of that asynchronously, using tokio tasks
        .install_batch(opentelemetry::runtime::Tokio)?;
    info!(%otlp_endpoint, "Setting up telemetry");

    let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);

    // filter printed-out log statements according to the RUST_LOG environment
    // variable, default to info
    let rust_log_var = std::env::var("RUST_LOG").unwrap_or_else(|_| "info".to_string());
    let log_filter = Targets::from_str(&rust_log_var)?;
    // different filter for traces sent to honeycomb
    let trace_filter = Targets::from_str("futile=info")?;

    Registry::default()
        .with(
            tracing_subscriber::fmt::layer()
                .with_ansi(true)
                .with_filter(log_filter),
        )
        .with(
            ErrorLayer::default()
                .and_then(telemetry)
                .with_filter(trace_filter),
        )
        .init();

And started instrumenting some functions with tracing, one of my favorite things, since it’s so easy.

I re-used an IncomingHttpSpanLayer I had made for the video platform:

//! Make an http span

// TODO: deduplicate with tube

use std::{
    future::Future,
    pin::Pin,
    task::{Context, Poll},
};

use http::{Request, Response};
use hyper::Body;
use tower::{Layer, Service};
use tower_request_id::RequestId;
use tracing::{field, info_span, instrument::Instrumented, Instrument, Span};

/// Layer for [IncomingHttpSpanService]
#[derive(Default)]
pub struct IncomingHttpSpanLayer {}

impl<S> Layer<S> for IncomingHttpSpanLayer
where
    S: Service<Request<Body>> + Clone + Send + 'static,
{
    type Service = IncomingHttpSpanService<S>;

    fn layer(&self, inner: S) -> Self::Service {
        IncomingHttpSpanService { inner }
    }
}

/// Extracts opentelemetry context from HTTP headers
#[derive(Clone)]
pub struct IncomingHttpSpanService<S>
where
    S: Service<Request<Body>> + Clone + Send + 'static,
{
    inner: S,
}

impl<S, B> Service<Request<Body>> for IncomingHttpSpanService<S>
where
    S: Service<Request<Body>, Response = Response<B>> + Clone + Send + 'static,
{
    type Response = S::Response;
    type Error = S::Error;
    type Future = PostFuture<Instrumented<S::Future>, B, S::Error>;

    fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
        self.inner.poll_ready(cx)
    }

    fn call(&mut self, req: Request<Body>) -> Self::Future {
        let user_agent = req
            .headers()
            .get("user-agent")
            .and_then(|s| s.to_str().ok())
            .unwrap_or("");

        let host = req
            .headers()
            .get("host")
            .and_then(|s| s.to_str().ok())
            .unwrap_or("");

        let sec_ch_ua_mobile = req
            .headers()
            .get("sec-ch-ua-mobile")
            .and_then(|s| s.to_str().ok())
            .unwrap_or("");

        let sec_ch_ua_platform = req
            .headers()
            .get("sec-ch-ua-platform")
            .and_then(|s| s.to_str().ok())
            .unwrap_or("");

        let span = info_span!(
            "http request",
            otel.name = %req.uri().path(),
            otel.kind = "server",
            http.method = %req.method(),
            http.url = %req.uri(),
            http.status_code = field::Empty,
            http.user_agent = &user_agent,
            http.host = &host,
            http.sec_ch_ua_mobile = &sec_ch_ua_mobile,
            http.sec_ch_ua_platform = &sec_ch_ua_platform,
            request_id = field::Empty,
            user_id = field::Empty,
        );

        if let Some(id) = req.extensions().get::<RequestId>() {
            span.record("request_id", &id.to_string().as_str());
        }

        let fut = {
            let _guard = span.enter();
            self.inner.call(req)
        };
        PostFuture {
            inner: fut.instrument(span.clone()),
            span,
        }
    }
}

pin_project_lite::pin_project! {
    /// Future that records http status code
    pub struct PostFuture<F, B, E>
    where
        F: Future<Output = Result<Response<B>, E>>,
    {
        #[pin]
        inner: F,
        span: Span,
    }
}

impl<F, B, E> Future for PostFuture<F, B, E>
where
    F: Future<Output = Result<Response<B>, E>>,
{
    type Output = F::Output;

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        let this = self.project();
        let res = futures::ready!(this.inner.poll(cx));
        if let Ok(res) = &res {
            this.span.record("http.status_code", &res.status().as_u16());
        }
        res.into()
    }
}

…so that all HTTP requests have their own info-level span, and show up in Honeycomb like so:

Hey, those latencies don’t line up with the figures you mentioned earlier.

Haha, one thing at a time.

That lets me answer questions like “what RSS readers (that aren’t browsers) is my audience using?”

I’m not sure what to do with that particular answer, but hey, now we know.

Edit: by popular demand, here’s the list including user-agents that contain “Mozilla/5.0”:

honeycomb screenshot showing just top user agents: there’s now Firefox 99, miniflux.app, Safari, Explorer (not internet explorer), etc.

Also incredibly valuable, because spans are being sent instead of simple log messages, we get these nice visualizations of where we’re actually spending time.

This is for a request to the index:

honeycomb screenshot showing where time is spent: serve_template calls liquid.render, which calls sql a bunch of times, along with page_markup and truncate_html

As I explained earlier, there’s a few DB queries involved, it’s fetching page markup (which is pre-rendered, but also needs a DB query), and then it truncates, which takes… forever.

So, immediately, I jump back to the code and find a ton of actionable information there: for example, the size of my “connection pool” is only 10 r2d2-sqlite’s default: under load, that’s not enough.

How do I know? I can see it waiting for a checkout:

And this is all it took:

            let conn = info_span!("sql.checkout").in_scope(|| self.content_pool.get())?;

I also noticed a couple other things, like:

I’m running blocking code (SQLite queries) in an async context. I should be using spawn_blocking for that. (There’s no good lint for that)
I’m not caching prepared statements. This doesn’t show up as a hotspot, but switching from prepare to prepare_cached is trivial, so why not?
I don’t have any indexes in my SQLite database! What a great freebie I kept for myself.

And finally, I’m spending way more time on truncate_html than I thought.

It is non-trivial:

impl Filter for TruncateHtmlFilter {
    #[instrument(name = "truncate_html", skip(self, input, runtime), fields(html_len))]
    fn evaluate(
        &self,
        input: &dyn ValueView,
        runtime: &dyn liquid_core::Runtime,
    ) -> liquid_core::Result<liquid_core::Value> {
        let span = Span::current();

        convert_errors(|| {
            let kinput = input.to_kstr();
            let input = kinput.as_str();
            span.record("html_len", &input.len());

            let mut output: Vec<u8> = Vec::new();
            let output_sink = |c: &[u8]| {
                output.write_all(c).unwrap();
            };

            let max: u64 = self
                .args
                .max
                .as_ref()
                .and_then(|p| p.try_evaluate(runtime))
                .and_then(|l| l.as_scalar().and_then(|s| s.to_integer().map(|x| x as u64)))
                .unwrap_or(180);

            let char_count = AtomicU64::new(0);
            let mut skip: bool = false;

            let mut rewriter = HtmlRewriter::new(
                Settings {
                    element_content_handlers: vec![
                        element!("*", |el| {
                            if char_count.load(Ordering::SeqCst) > max && el.tag_name() == "p" {
                                skip = true;
                            }

                            if skip {
                                el.remove();
                            }
                            Ok(())
                        }),
                        text!("*", |txt| {
                            if matches!(txt.text_type(), TextType::Data) {
                                char_count.fetch_add(txt.as_str().len() as u64, Ordering::SeqCst);
                            }

                            Ok(())
                        }),
                    ],
                    ..Settings::default()
                },
                output_sink,
            );

            rewriter
                .write(input.as_bytes())
                .map_err(|e| liquid_core::Error::with_msg(format!("rewriting error: {:?}", e)))?;
            drop(rewriter);

            Ok(to_value(&std::str::from_utf8(&output[..])?)?)
        })
    }
}

…but also, it feeds the whole article through, well past the initial 120 characters of text it’s trying to keep. That’s wasteful, and it’s a very predictable transform that could be cached, just like I cache rendered markdown.

So what I did next…

You fixed all of these, right?

Absolutely not! I ignored all of these, and jumped straight to implementing caching.

Latency was always “fine” when the site wasn’t under attack. Knowing about these bothers me, but they can wait. What I really needed was some form of whole-page caching.

I’ve put that off for so long, and it was so easy.

I simply plugged in moka, a “fast, concurrent cache library for Rust”, that supports async, and time-based expiry.

So, boom, build a cache:

    let server_state = ServerState {
        config: Arc::clone(&config),
        revholder,
        content_pool,
        users_pool,
        broadcast_rev,
        // this is the only new field
        rendered_templates_cache: Cache::builder()
            // TTL: 5 minutes
            .time_to_live(Duration::from_secs(5 * 60))
            // TTI: 1 minute
            .time_to_idle(Duration::from_secs(60))
            .build(),
    };

And the serve_template function becomes:

    #[instrument(skip(self, globals), fields(cache.status))]
    async fn serve_template(
        self,
        template_name: &str,
        mut globals: Object,
        content_type: &'static str,
    ) -> Result<Box<dyn Reply + 'static>, Report> {
        let span = Span::current();

        let cache_key = format!("{}?{}", self.path, self.raw_query.as_str());

        let (creds, session_cookies) =
            FutileCredentials::load_from_cookies(self.config(), &self.cookies).await;
        let has_creds = creds.is_some();

        let cache_control = if has_creds {
            "no-cache"
        } else {
            "private, max-age=120"
        };

        let mut body: Option<Bytes> = None;
        let should_cache = self.state.config.env.is_prod() && !has_creds;
        if should_cache {
            // try to hit the cache first
            if let Some(cached_body) = self.state.rendered_templates_cache.get(&cache_key) {
                body = Some(cached_body);
                span.record("source", &"hit");
            } else {
                span.record("source", &"miss");
            }
        } else {
            span.record("cache.status", &"dynamic");
        }

        let body = if let Some(body) = body {
            body
        } else {
            let rev = self.state.revholder.get();
            let template = rev
                .template_repo
                .templates
                .get(template_name)
                .ok_or_else(|| Error::TemplateNotFound(template_name.into()))?;

            // omitted: insert a ton of "globals" my templates expect, stuff
            // like url params, the config (minus secrets), a friend URL, user
            // info (if logged in), user settings (ligatures enabled, theme),
            // etc.

            let rendered = {
                let render_span = info_span!("liquid.render");
                let _guard = render_span.enter();
                template
                    .render(&globals)
                    .map_err(|e| e.context("template", template_name.to_string()))?
            };
            // oh yeah, I forgot: I have live-reload on my website. Adding an
            // idle timeout on all connections broke it, so I just disabled it
            // for the development environment.
            let rendered = inject_livereload(&self.state.config, &rendered);

            let body = Bytes::copy_from_slice(rendered.as_ref().as_bytes());
            // an earlier version of this article (and my site) inserted pages
            // rendered _with credentials_ into the cache. woops!
            if should_cache {
                self.state
                    .rendered_templates_cache
                    .insert(cache_key, body.clone())
                    .await;
            }

            body
        };

        let res = Response::builder()
            .status(StatusCode::OK)
            .header("cache-control", cache_control)
            .header("content-type", content_type)
            .apply_session_cookies(session_cookies)
            .body(body);
        let res: Box<dyn Reply + 'static> = Box::new(res);
        Ok(res)
    }

Cloning a Bytes, which happens on cache hit, simply increments a reference count, so everything is nice and cheap.

The cache-control header here is technically wrong, since I don’t actually ever want Cloudflare to cache hit, but since it already doesn’t… I’ll fix it later.

With that little change (it rendered unconditionally beforehand), all pages for logged-out users are cached for 1m-5m, depending on whether they’re being requested a lot.

And that makes the difference between this:

$ oha -z 5s http://localhost -H '(valid cookies omitted)'
Summary:
  Success rate: 1.0000
  Total:        5.0011 secs
  Slowest:      1.0898 secs
  Fastest:      0.0791 secs
  Average:      0.5314 secs
  Requests/sec: 87.9810

  Total data:   14.61 MiB
  Size/request: 34.00 KiB
  Size/sec:     2.92 MiB

Response time histogram:
  0.171 [20] |■■■■■■■
  0.263 [18] |■■■■■■
  0.355 [48] |■■■■■■■■■■■■■■■■■
  0.447 [50] |■■■■■■■■■■■■■■■■■
  0.539 [89] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.630 [82] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.722 [63] |■■■■■■■■■■■■■■■■■■■■■■
  0.814 [37] |■■■■■■■■■■■■■
  0.906 [21] |■■■■■■■
  0.998 [7]  |■■
  1.090 [5]  |■

Latency distribution:
  10% in 0.2837 secs
  25% in 0.3926 secs
  50% in 0.5350 secs
  75% in 0.6570 secs
  90% in 0.7618 secs
  95% in 0.8596 secs
  99% in 0.9995 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0006 secs, 0.0001 secs, 0.0011 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs

Status code distribution:
  [200] 440 responses

(87 requests per second, painfully — P95 of 860ms)

And this:

$ oha -z 5s http://localhost
Summary:
  Success rate: 1.0000
  Total:        5.0017 secs
  Slowest:      0.0077 secs
  Fastest:      0.0002 secs
  Average:      0.0015 secs
  Requests/sec: 34073.2242

  Total data:   5.53 GiB
  Size/request: 34.00 KiB
  Size/sec:     1.10 GiB

Response time histogram:
  0.001 [7190]  |■■■■
  0.001 [22000] |■■■■■■■■■■■■■■■
  0.001 [40329] |■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.002 [46719] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.002 [32402] |■■■■■■■■■■■■■■■■■■■■■■
  0.002 [14861] |■■■■■■■■■■
  0.003 [4874]  |■■■
  0.003 [1364]  |
  0.004 [450]   |
  0.004 [130]   |
  0.004 [106]   |

Latency distribution:
  10% in 0.0008 secs
  25% in 0.0011 secs
  50% in 0.0014 secs
  75% in 0.0018 secs
  90% in 0.0022 secs
  95% in 0.0024 secs
  99% in 0.0029 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0008 secs, 0.0001 secs, 0.0025 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0003 secs

Status code distribution:
  [200] 170425 responses

(34K requests per second, easily — P95 of 2.4ms)

It’s important to note that this code correctly identifies valid session cookies. These are signed, so if you want to hammer an uncached endpoint, you now need to either:

Somehow reverse how cookies are signed + find the secret key (which is super easy to swap, it’ll just log out everyone once. No biggie)
Become a subscriber so you get a legitimate cookie (which will show up in traces, easy to block)

Is this pay to play?

Which leaves cached endpoints as durable attack targets: but because the maximum RPS (for the costliest page on the site) went from ~90 to 34K (a 37677% increase, for those keeping track), an attack would need to be above Cloudflare’s threshold to even make a dent, at which point their DDoS protection would kick in.

As for the video platform: I made no changes. None. It already had good observability, and I had planned for this eventuality from the start: the only thing I was trying to prevent wasn’t downtime, it was a huge AWS bill. And I’m happy to report that 100% of the requests that were served, were served from the SSD cache.

Also, most requests were served, since there’s eight separate instances of the video app in eight different regions (Paris, Tokyo, Washington, São Paulo, etc.)

After the storm: collateral damage

Another secondary objective of a DDoS is to generate collateral damage: to force the target to block legitimate traffic in response to the attack, losing out on potential business and/or generally annoying people.

Blocking an entire AS is often excessive (but oh-so-convenient), blocking Tor exit nodes is very tempting, and sometimes you just don’t want to hear from a certain country in a while. But these are all overshoots.

Ever since the attack, I’ve gotten messages from regular readers who could’t access my website, because they happened to be running the latest Google Chrome on Linux.

I tried everything: attack mode has been disabled for a while, I’ve started allowlisting the IPs of my readers, forcing a “managed challenge” for that user-agent (hoping it’ll override the automated block), but nothing seemed to be doing the trick, except for switching to another browser.

Eventually, I realized Cloudflare had nothing to do with it. During the attack, I had started banning that same user-agent (returning a 403) from my origin directly, and I just… forgot to disable that after the attack was over.

The confusing bit was that readers were seeing a Cloudflare error page saying “Sorry, you have been blocked”.

And even after I removed my counter-measure, they’re still seeing it. Living behind someone else’s edge is a mixed blessing.

This is also partly the point: you’re running left and right trying to make things better, you get sloppy. That’s how it’s supposed to work.

Anyway.

Performing a DDoS is technically an “illegal cybercrime”, but realistically, receiving one is just another Saturday.

Until next time, take excellent care of yourselves!

Update: hi again!

Minutes after I posted this article, the attack resumed. Same shit, different AS. Here are the newcomers:

AS45102: Alibaba US (Global, cloud provider)
AS16276: OVH (France, cloud provider)
AS141677: Nathosts Limited (Hong Kong)
AS8075: Microsoft Azure (US/HK/Brazil, cloud provider)
AS328386: Adnexus (South Africa)
AS7303: Telecom Argentina
AS24940: Hetzner (Germany, cloud provider)
AS45758: Triple T Broadband (Thailand, ISP)
AS37963: Alibaba Hangzhou (Global, cloud provider)
AS63949: Linode (US, VPS provider)

At first, the site immediately went down, returning 522 (Origin Connection Time-out).

Turns out a limit of 256 is way too low: Cloudflare has ~275 POPs, and each of them might establish a few connections to my origin. I raised the limit to 2048 (both for connections and in-flight requests), and immediately the 522 line went down, and a 200 line went up!

cloudflare requests graph, 522 spike, 200 spike, 403 spike

Shortly after, Cloudflare caught up with them and they saw a 403 spike. Then they went away. Site’s back up and fast as ever.

Comment on /r/fasterthanlime

(JavaScript is required to see this. Or maybe my stuff broke)

Here's another article just for you:

Understanding Rust futures by going way too deep

Jul 25, 2021

45 min #rust · #async

So! Rust futures! Easy peasy lemon squeezy. Until it’s not. So let’s do the easy thing, and then instead of waiting for the hard thing to sneak up on us, we’ll go for it intentionally.

Cool Bear's hot tip

That’s all-around solid life advice.

Choo choo here comes the easy part 🚂💨

We make a new project:

$ cargo new waytoodeep
     Created binary (application) `waytoodeep` package