Back Original

Operational burden is a choice

Time for devlog #5! Last week's was here.

This one isn't really a devlog. It's my thoughts about how software is made, and how I think we should have a shift.

Types of software

I don't like being oncall.

The Internet doesn't have opening hours. It's a distributed system that unfortunately comes with the expectation of being always available 24/7.

As a result, software companies have broadly shifted from selling standalone applications to selling online applications.

I think we should build peer to peer applications.

The perils of always online applications

Online applications let businesses observe & control the data the app consumes and produces.

This is not always a bad thing. Some observation & control enables features that help people.

Search and recommendation features often require a decent amount of data storage and processing that is typically performed within the walls of the business.

Recommendations may also require collecting, mixing, and processing data sourced from many different users to infer general user behaviors. This sort of thing is harder to do well with standalone applications.

But online applications come with costs to the business & consequences to users:

GitHub (centralized) ≠ git (decentralized)

Consider GitHub, a business built on the shoulders of a free and open source distributed system: git.

You don't need GitHub to use git. Every repository (by default) contains the full history of changes and may pull and push additional changes to any other repository. A git repo is just a bunch of files in a .git/ directory. They're all inert while not being used. There's no backend needed, each repo is self-contained and may freely interoperate with peers as long as there's a bidirectional channel between them.

But there are good reasons to use GitHub: issues, pull requests, and actions make it easy to build software with others.

A screenshot of GitHub when it goes down, featuring an angry looking unicorn

But easy features can cause lazy thinking.

Nearly every company I've worked for in the past decade had GitHub in the critical path to production. If GitHub went down (which it does, often), there was an incident—people needed to be oncall.

This is an architectural choice. And this kind of choice leads to a 24/7 oncall support rotation—an operational burden.

If instead, git and a few shell scripts was used in the critical path for deploying to production, these companies would be mostly immune to these sorts of outages—and the overall need for oncall would be reduced.

GitHub could still be used for the typical workflow, it just wouldn't be in the critical path.

Ironically, GitHub itself has also chosen to have a high operational burden. (I've never worked there, but can make observations about their architecture.)

Many of the business objects that GitHub offers (issues, pull requests, actions, etc...) could themselves be stored as primitives inside of git's distributed object model (as blobs, trees, commits, branches, tags, etc...).

If this choice were to be made, then GitHub would not need to run as many online services to manipulate these remote business objects. They would be stored them the same as ordinary source files tracked within git. And they could build UI to present them in the same, pleasant way.

There still would need to be online services, but there'd be fewer of them, and they'd be more limited in scope.

And fewer online services means less operational burden.

They probably don't do this because it would give away a piece of their value proposition: managing this data for you.

In a way, they're exchanging a higher operational burden for your data.

What's the alternative?

I'm not suggesting we go offline and stop building online applications.

Working on the same thing with people who aren't in the same room is table stakes these days.

Instead, we should start building peer-to-peer applications.

Running and maintaining a 24/7 centralized server to handle data and relay communication between parties is a drag. With a peer-to-peer application, each client can act as a server managing its own data.

This can even be done in a web application today.

WebRTC allows web pages to establish peer-to-peer connections without¹ an intermediate server handling the data.

Multiplayer peer-to-peer collaboration

It's hard to add "multiplayer" support to documents.

When I look around for best practices to implement this, I often see folks talk about CRDTs as a solution to manage the complexity of multiple parties manipulating the same document without conflicts.

CRDTs are fascinating, but to me feel like overengineered "solutions" to fundamentally social problems.

Every time I've worked with people on the same document, there's always the same power structure: one person "hosts" the session, and everyone else is a "guest" acting politely while changing the document.

It can get a bit messy when multiple people are trying to change the same thing. (i.e. everyone adding items to the same list)

But that messiness is social, we end up talking with each other to come up with an ad-hoc strategy to work together. (i.e. split the list so each person has their own; merge afterward)

And once the session is over, the host typically does a bit of tidying up of the loose ends once everyone has left.

Build "multiplayer" support that embraces this power structure

Instead of reaching for CRDTs and distributed state synchronization (where all parties are "equal" in ownership and intent), let's reach for the same tools used when a client makes requests of data on a server.

There then is no difference between a client/server structure and a peer-to-peer multiplayer structure. The client is a guest and the server is the host.

Document coordination could be as simple as:

Event Response
A guest joins The host tells them the state of the world.
Host makes a change The host tells all guests what has changed.
A guest makes a change The guest asks the host to accept a change.
Guests may optimistically update their local view (but be prepared to roll back).
The host may automatically or manually accept/reject this change depending on the host's state.
Host accepts guest change The host makes the change and tells all guests what has changed.
Host rejects guest change The host rejects the change, resulting in no state changes.

Does this actually work?

Probably! I'm not sure, but I'm going to try.

Hopefully, I'll have an implementation of the above soon, which will allow multiple people to operate as guests on a document hosted by one of the users—without any servers requiring heavy maintenance involved.