Thanks For Breaking cargo-dist! (I Rewrote It)

by Gankra, 28 February 2023 | Permalink | RSS

dark 🌙

This article is a retrospective on the development of cargo-dist 0.0.3, which introduces better support for complex workspaces and redesigns some interfaces. If you just want details on that, checkout:
The changelog (migration guide at the top)
The new cargo-dist book
Introduction
Install
Way-Too-Quickstart
Guide
Reference

At the start of the month we announced the first release of cargo-dist, and wow the reception was better than I expected! People immediately started trying it out on their projects and while it worked well for some folks... it also completely exploded for others!

The initial version of cargo-dist was narrowly targeting simple Cargo Workspaces and would mess up if you had multiple binaries. I didn't highlight that well enough, and probably should have added assertions to disable those partially-developed code paths. I figured most people wouldn't want to touch something that had that many "THIS IS REALLY EXPERIMENTAL" signs on it, but I'm glad I was wrong because my issue tracker very quickly filled up with lots of great stuff!

So I started working through the issues and a couple weeks later...

screenshot of a Github PR "Massively rework task computation/execution while introducing configs" that fixes about 20 issues and touches 3500 lines of code

Oops I Accidentally Rewrote Everything!™️

So for some context here, at this point in my career I've accepted that I can't fit that much in my head at once. I Write All The Docs because if I don't I'll just forget as soon as I move on to something else. In the context of a tool like cargo-dist, this also means that I can't imagine the full scope of the design all at once. To some extent I have to sketch the design out in code to even begin to fathom its shape.

Knowing this about myself, I have to actively avoid trying to look ahead and designing the absolutely correct approach that my experience suggests. I sketch out a few tentative abstractions and then mostly try to write a bunch of really simple straight-line code. Usually this ends up producing a single "do the thing" function that balloons out of control until I feel comfortable with certain details enough to admit that they deserve to be factored out to a dedicated function.

cargo-dist was no different, and that "do the thing" function was gather_work. That function is responsible for looking at your Cargo Workspace and computing a graph of everything cargo-dist should do with it. Even if I couldn't fit everything in my head at once, I knew it was important that the design look something like that. To whatever extent possible, cargo-dist should know everything that will happen before it happens. I don't want the extension of a tarball to be decided in the middle of the code that's actually doing I/O to build it!

By 0.0.2, gather_work was up to 500 lines long and it was... Not Good. You might think the terrible part was a 500 line function, but actually the real issues were:

Bad Abstractions: I had to commit to some abstractions early on, but as predicted they were Incorrect. There was a huge impedance mismatch between the types I had made and the actual implementation I found myself settling into. The code was constantly deciding it wanted to do something, doing part of it, forgetting about it, and then trying to remember it somewhere else.
Not Precomputed Enough: Although I was computing all the most important details in gather_work, I was still putting too much logic in the subsequent steps. For instance I might precompute that something should be a .tar.xz, but then leave it to the tarring code to compute certain file paths. This lead to rampant inconsistencies as I added more features which interacted with each other, as they all had to repeat the same logic.

Both of these together made every change a huge cross-cutting mess full of special cases. It's no surprise that the extra complexity of multiple binaries in one workspace was broken! To be clear, I had actually factored multi-binary workspaces into the core of cargo-dist's design, it's just that every feature I added after that made its own simplifying assumptions while I narrowed in on a minimal implementation we could dogfood. Those assumptions needed to be fixed.

Most of the issues were, on the surface, relatively minor fixes. Things like "oops the title is repeated 5 times" or "oops I forgot to prefix installer.sh with the name of the application so if you have multiple applications they clobber each other". In principle I could fix each of those in a few minutes.

But as I triaged the issues and prepared to start fixing them, I couldn't help but see a pattern. The real problem wasn't that the newer features were making incorrect assumptions, the problem was that they were responsible for making those decisions at all! So I started working on computing those facts centrally in gather_work, making all the other parts of the code just do what they're told.

Centralizing more of the logic then exposed another problem: those simplifying assumptions had been dancing around a few fundamental flaws in cargo-dist's design! Trying to compute them properly forced me to reckon with those flaws. At that point I had to take a step back and reconsider more fundamental details of the design, and that +3500/-1500 PR is the result of me working through it. 😅

Despite what it might look like, as far as I'm concerned this is me knocking it out of the park. I knew I couldn't fit everything in my head, so I kept the abstractions minimal and the code simple while I prototyped something that seemed reasonable. This helped me find the Right solution and made it way easier to do the necessary rewrites.

Even when I started to understand what the new design should be, I still couldn't fit it all in my head. I had to write a faux RFC to gather my thoughts coherently. Once I started implementing it, I still couldn't picture everything, and had to do it piece by piece. There was a distressingly long period where I was worried that I was going to hit some detail that broke everything but... it all worked!

I was able to rework the entire codebase in only two weeks! While adding a bunch of features and fixing a ton of bugs! But the biggest reward of all came afterwards: all the follow up PRs went from "changing a bunch of random places that made no sense to anyone" to "added a couple lines in this one spot". That's how I really knew the effort had been worth it, and that the new design was a good one!

Plus once I was done I got to Treat Myself and spend the next week knocking out an entire book of documentation to vent all of that knowledge from my brain! I Write All The Docs, and I like it that way. 😸

Don't get me wrong: I'm fucking exhausted (even working 4 day work weeks), and I wouldn't want to do this kind of refactor all the time. But for my strengths and weaknesses, when starting up a new project with too many unknowns, I stand by this approach. Also I'm writing Rust in VSCode so I can lean on great refactoring tools and compiler diagnostics.

I'm not gonna dig through the code, but I do want to look at some of the high level design issues and how they were addressed. And just to be clear, despite all the overhauls I'm about to describe, the Way-Too-Quickstart is exactly the same: You setup with cargo dist init --ci=github, you can run simple local builds with cargo dist build, and you run CI builds by pushing a git tag to your repo. Now you just have more knobs to turn, and all the knobs make more sense, so more complex situations work.

Many Releases, One Announcement

This is the most fundamental problem, and the primary subject of the faux RFC: I had designed cargo-dist to find all the binaries in your workspace and treat each one as its own "App" that would be handled independently. As a matter of efficiency I would of course have them share a build and other work, but the final output of cargo-dist was basically an array of "Releases" with nothing shared between them.

If you're running cargo-dist locally, that seems like a perfectly reasonable solution, so I didn't think much of it. But a problem arises once you put cargo-dist in CI: you want your CI run to be triggered by pushing a Git Tag, and 1 Git Tag = 1 Github Release = 1 Download URL. Trying to shove 5 completely unrelated applications with 5 completely unrelated versions into that... doesn't really make sense?

Here's one of the nastiest places where that blew up: we knew we needed to provide the ability to generate shell/powershell scripts that would fetch and unpack tarballs from the Github Release (or whatever other infra). This is something Ashley had done a few times before on other projects, so I grabbed one of her more recent examples and quickly added some templating to generate it for arbitrary projects.

I believe these scripts have a long heritage that descends from things like the rustup installer, so they had a bunch of features that allowed you to request a specific version to install. To do this, they would build up the download URL as something like:

{repo_url}/releases/download/v{version}/{app-name}-v{version}-{platform}.tar.xz

That looks super reasonable on the surface, but is Absolutely Incorrect if you're willing to shove multiple Applications with different version numbers into the same Github Release! The tag is completely decoupled from the version, so download/v{version}/ isn't gonna work!

This kind of problem is what really pushed me to introduce a new major concept in cargo-dist's design: An Announcement. If we're planning to put a bunch of different things under one Github Announcement or Git Tag, we can't just do that at the last second. We need every part of cargo-dist to be aware of The Announcement, and we need the Announcement to make sense.

Determining The Announcement is now one of the first things cargo-dist does, and it will absolutely refuse to continue if it can't:

A screen shot of running 'cargo dist manifest' on a workspace with multiple binaries and getting an error that it can't be coherently Announced. It then prompts you to pass --tag to specify what you meant, providing examples like v0.5.0 or my-app-v0.5.0

I'm not going to reiterate the docs, read them for more details, but basically your workspace either needs to be boring enough for the git tag to be obvious (everything has the same version), or you need to tell us what tag format you're using (one-tag-per-workspace or one-tag-per-app).

I don't want you hitting these issues in CI, so I'm making you deal with them early and often.

Now the model is 1 Git Tag = 1 cargo-dist Announcement = 1 Github Release. cargo-dist's JSON output now also includes top level fields for announcement details like title and github body, where previously those things were nonsensically squirreled away in the individual app releases.

This model required me to up out some of those fancy installer features. No more building up URLs from several variables, only exact URLs that gather_work centrally computed and baked into the installer. In the future we might add some of this stuff back, but only when we have a coherent model for it!

What Should I Be Building?

In 0.0.2, there was a hacky wart you might have noticed in the Github CI: there were two different tasks for computing artifacts. One of them was run once for each OS with a Matrix like you'd expect, and the other one just did... "other" stuff, and used the dubious --no-builds flag to uh, build stuff. Yeah.

This weird wart was trying to solve a problem that I hit with the installer scripts. Geez, those installer scripts really made a lot of messes!

Most cargo-dist Artifacts are naturally per-platform. You want to support Linux, macOS, and Windows? Cool I'll make you a zip/tarball for each one. Installer scripts, on the other hand, are unified across all platforms (although maybe only supporting a subset of them). How do you decide what task should build them, and only build and upload them once?

I didn't want a jank solution like "installer scripts only get built as part of the linux build". I wanted to be able to clearly separate these "local" and "global" Artifacts. --no-builds was the first thing I thought of to say "don't run Cargo builds, just generate the other stuff".

This wasn't a terribly well-defined concept, and ultimately didn't make sense.

The installers want to know about all the different builds they can fetch, and what things are included in that build (there could be multiple binaries in one zip!). So I needed cargo-dist to know everything that could happen while still also telling it to do only some of the work. That's two different sets of artifacts: "all artifacts" and "artifacts we're dealing with now".

I increasingly felt like CLI flags alone only lent themselves to talking about one set of artifacts at a time, and --no-builds was a really narrow hack that wasn't going to scale (and was impossible to document).

So my totally internal refactor suddenly blended into a ton of actual user-facing feature work: introduce persistent configuration (which is all generated for you by cargo dist init):

# Config for 'cargo dist'[workspace.metadata.dist]# The preferred cargo-dist version to use in CI (Cargo.toml SemVer syntax)cargo-dist-version = "0.0.3-prerelease12"# The preferred Rust toolchain to use in CI (rustup toolchain syntax)rust-toolchain-version = "1.67.1"# CI backends to support (see 'cargo dist generate-ci')ci = ["github"]# The installers to generate for each appinstallers = ["shell", "powershell"]# Target platforms to build apps for (Rust target-triple syntax)targets = ["x86_64-unknown-linux-gnu", "x86_64-apple-darwin", "x86_64-pc-windows-msvc"]

Again, see the docs for more details, but the basic idea is that if we want everything to coherently agree on what "the full build" is, we should store it persistently in your Cargo.toml. Various CLI flags are then defined to select a subset of that build. That way we can coherently say "build just the installers" while still remembering the things the installers want to refer to.

The way we select those subsets is with the new --artifacts flag, which selects various Artifact Modes. Again, see the docs.

Artifact Modes also helped me resolve a conflict between the two ways I expect people to use cargo-dist:

Locally, just messing around and testing it
In CI, doing things For Real

If someone runs bare cargo dist on their machine, I want to do something useful for them right away so they can understand the tool and get their project setup properly. Build for their platform, generate global installers, etc. It can be a little magical (for instance, if they're running on a platform that isn't covered by the "targets" array we should still try to build an executable for their platform), but ideally it should be as consistent as possible with the way CI works, and it should be easy to locally check what CI would do.

One of the things I'm very happy with is cargo dist manifest --artifacts=all, which says "hey get me a manifest of what should be produced if you were to build everything". Here let's run it on my Evil Workspace:

A screenshot of running cargo dist manifest --artifacts=all on a really complicated workspace with multiple binary-having-packages and packages-with-multiple-binaries. It starts with a printout of what cargo-dist thinks your apps are, and then concludes with a printout of the announcement tag and all the artifacts that will be built.

That printout needs some UX work, but damn if it doesn't make me happy to be able to debug/test things so easily (and that really is the same work that's done in CI, the only difference is CI uses --output-format=json).

Oh and the more I worked on the persistent config, the more it just made everything else make more sense. For instance, the installers were previously prefixed with "github" (github-shell, github-powershell), because I needed to know to assume Github Releases when generating them (and wanted to be future-proof for other backends). But with ci = ["github"] baked into persistent config, the code could know we're Doing Github even if you're not explicitly talking about it.

The MVP didn't ship with a persistent config because I wanted to focus on Good Defaults, but it caused a lot of problems that I'm glad are gone.

Symbol Struggles

This one is kind of disconnected but the support for gathering up your debuginfo/symbols was simply a bit under-developed. I wanted to include it in the MVP to make sure it worked and to establish it as A Thing. I only included support for pdbs, and as far as I know the pdbs I was generating were solid... but Cargo doesn't really make it easy to enable/disable debuginfo on a per-platform basis, so I couldn't completely isolate it. This caused problems in various situations, so I decided to turn it off for now, pending a rework.

As a result, the default [profile.dist] is now just lto = "thin". As recommended by the migration guide, you should just delete your [profile.dist] and rerun cargo dist init to get the new one.

The old profile notably caused problems for people on linux, because dwp is... Very New and Very Weird.

I'm gonna get this stuff to work eventually, but it needs more time to.

Still More Work To Do

So overall, I'm pretty happy with the time I put into this refactor. Everything is a lot more consistent and reliable. Everything makes a lot more sense. I have a proper base to build on. There's definitely still some sharp edges, but I tried to document/check them a lot more. If your project still hits them, and you know what you want to happen in those situations, please let us know!

(Oh and if you're wondering, gather_work is now "only" 400 lines... as the entry point for a 2300 line tasks.rs file. tasks.rs is pretty decently factored and abstracted to make it easy to build up the task graph incrementally. Much nicer to work with. 😸)

So what's next? Now that I'm comfortable with the base, 0.0.4 will be focused on filling in features that are blocking people from adopting cargo-dist: arm64 macOS builds, linux musl builds, generating npm packages that fetch your rust binaries, maybe even code signing?

I also really really need to take some time to properly build out our test suite. There's too many weird workspaces and workflows for me to manually test anymore, and now I actually have a coherent internal model to test them against!

Oh and you might have noticed that I keep linking something that sure isn't a Github Release. You didn't think Github Releases were the final destination of cargo-dist's output, did you? We're not quite ready to announce the tool that's generating those pages, but we're already dogfooding it in cargo-dist's docs, so it's Coming Soon. 😽