JavaScript streams and partitioning

October 12th, 2023


I’ve been building a tool for JavaScript stream processing and time and time again, I have to return to the question of “Why would you use this thing?” Unless you’re building room temperature superconductors, your target user is coming to your product with the attitude, “Why wouldn’t I just keep using X?” (even if X is just excel or sqlite or google analytics).

Your product will have tradeoffs because otherwise…why didn’t the incumbent just do it? Jira sucks because it’s complicated because it’s a one-size-fits-all-departments kind of product. For me, I’ve realized that I’ve been wrestling against CAP. Trying to achieve consistency across partitioned streams is slow. Consistency requires that destination streams check with origin streams on whether it’s safe to processing beyond X. Whether or not it’s safe depends not only on where the origin head is, but also whether it’s behind the tip. The answer to the second question changes as new events roll in.

This is a massive coordination problem and if you have thousands of streams, you might be making requests to hundreds of other machines every single processing loop. On top of that, the processed chunk is much smaller, because you need to await the tardiest head. Kafka solves this with join windows, a preset lag on downstream processing to wait for tardy events. This is complicated configuration in my opinion and also works when your streams are largely self-hosted and the recommended architecture is “one big stream.” User-submitted JavaScript is a lot slower than native Java, so streams will be a lot smaller and piping between streams a bigger part of their use case. Especially when you might be piping a small number of events from a large number of streams, waiting becomes quite complicated and…more importantly, not terribly valuable. For the most part, order under a second doesn’t matter that much.

So one choice is to do nothing. First come first record. But, you do want some semblance of order when merging streams. Like if you’re merging two streams and you know approximately the order the events were received on the global time scale, you’d rather have them in that order, even at the cost of some milliseconds of “slippage”. That’s consistency and partitioning at the cost of availability.

There’s also the problem of backfill. Say you have streams A and B going into X. Later, you realize you want stream C also piping into X and you’d like historical events in there too? What is stream X? Is it a historical record? Or just notifications going forward? You definitely want it to serve the purpose of a historical record.

What if you recreate (reindex) stream X? What about its downstream effects? I think you can have splice requests for handling any downstream effects. Reindexing assigns a new stream id and rebuilds the order based on create and not append time. Is create time the original time or the parent’s time? Probably original, based on the producer time. When reindexed stream also points all pipes and reducers.

Why? Going back to the question of “Why would you use this thing?” You use it for integrating distributed systems, where consistency is already impossible (unless your third party tool allows for locks) or simple analytics or onboarding, where order matters much less.