So you want to build a job processing system?

We're building Farcaster, a decentralized social network. This post discusses the job system we now use to power Warpcast, the most popular client in the Farcaster ecosystem.

We migrated from Faktory Enterprise to our own background job processing system in August 2024. Our two primary motivations were:

Faktory was beginning to exhibit instability resulting in brief periods of downtime at the loads we were subjecting it to.
As a single point of failure, any issue would have a user-visible impact (delayed feeds, etc.), and Faktory cannot (yet) run in a highly available configuration.

Our Faktory dashboard right before decommission. Over 9 billion jobs processed!

Faktory had served us incredibly well since our adoption of it in January 2022. We're grateful to Mike Perham who has been pioneering this space since the launch of Sidekiq in 2012. We highly recommend you use a tool like Faktory before building your own job system.

With that said, suppose you've come to the conclusion that you need to build your own job processing solution. The rest of this post discusses factors we considered and lessons learned along the way.

Alternatives Considered

Before making this decision, we surveyed the background job system landscape circa H1 2024. Available options fell into roughly three categories:

SaaS providers: AWS SQS, Google Cloud Tasks, Cloudflare Worker Queues, etc.
Rejected because we wanted to minimize vendor lock-in.
Ready-to-go open source frameworks: BullMQ, Celery, etc.
Almost went with BullMQ, but ultimately rejected because it requires different queue names per shard rather than transparently managing shards on your behalf. Maintaining your own sharding system doesn't seem like the right level of abstraction to provide for a job processing framework.
Build your own framework on top of open source tools such as Redis, RabbitMQ, Kafka, etc.

We opted to build our own solution with Node.js on top of Redis because we felt it would give us the most control going forward and provide the greatest flexibility, and it would integrate tightly with our existing TypeScript codebase. We also were able to use very mature libraries like ioredis for interacting with Redis via Node.js.

Features

At a high level, our new system supports all the Faktory features that we made heavy use of:

Delayed jobs (schedule a job to start at a specific time)
Periodic jobs defined via cron syntax
Per-job retry configuration logic
Dead letter queue
Queue throttling / rate-limiting
Job uniqueness enforcement
Job expiration
Job execution timeouts

We also added support for a few features specific to our use case:

Replayable/skippable queues: fast forward or rewind a queue to a specific point in time and re-process or skip jobs as needed
Per queue dead letter queues: Faktory had a global DLQ, which could sometimes make it difficult to sift through and clean up dead jobs if you didn't want to clear the whole DLQ.
Detailed metrics not available in Faktory Enterprise (e.g. segfaults, OOMKills, etc.)

All of these features were implemented using a combination of built-in Redis data structures including:

Data structure	Features
Sorted sets	Delayed jobs, job uniqueness enforcement, dead letter queues, distributed semaphore
Streams	Replayable/skippable queues
Hashes	Job metadata

These features ultimately powered a job system where each job had the following possible state transitions:

The difference between retries exhausted and dead is that dead jobs are saved for potential debugging and re-enqueuing (resurrection). The vast majority of exhausted/dead jobs don’t need to be preserved, so the default is to simply delete the job when its retries have exhausted. This is a key difference between Faktory and the new system—Faktory always kept a dead job unless the number of retries was set explicitly to 0.

Worth it?

Our current job system is fundamentally different from Faktory in that it allows for horizontal scaling and is highly available in cases of single node failures. While you could run multiple Faktory servers and partition your job queues among them, this is painful to manage and you still suffer from any one of those Faktory servers being a SPOF for their respective queues. Our current system handles all of this for us transparently, and we've resized our cluster multiple times as the size of our workload has increased.

A subset of metrics we track in the new job system

Where we previously would have to limit the number of concurrent jobs we processed with the system, now we can continue to scale out our cluster to support larger workloads by simply increasing the number of shards in our Redis Cluster: a simple button click in the AWS console. This was not possible with Faktory since it didn't support clustered Redis.

In short—for our specific use cases and what we needed from our job framework—the effort was worth it. That being said, we learned plenty during the course of implementation.

Lessons Learned

redis-semaphore requires independent/standalone Redis servers, not a Redis Cluster
One of the most important tools we needed was a distributed semaphore to limit concurrent access for features like queue throttling. redis-semaphore is an excellent library that has a variety of working implementations, but we spent a long time debugging a strange issue only to realize that the Redlock algorithm it implements requires independent nodes, which is captured in this comment. Once we switched from running 3 Redis nodes in clustered mode to 3 standalone Redis servers for our distributed semaphore, our mysterious race conditions were resolved.

Pipelining commands in Redis is a necessity if you want to increase throughput
Our earlier implementations did not use pipelining so that we could get a working implementation quickly. However, even with sub-millisecond latency between our workers and the Redis cluster, the round-trip time executing each command serially rather than running commands in a pipeline (batch) quickly becomes noticeable at high volume. Fortunately the ioredis library has an ergonomic pattern for expressing pipelines that we made heavy use of.

Weighted round robin queueing is a simple way to avoid queue starvation
Faktory's queueing system made it possible for you to express queue weights such that one queue would starve other queues. For example, you could have two queues, priority and default, and if you have one worker it will always pull from the priority queue before the default queue. With weighted round robin queuing, each queue is assigned a weight such that you'll never fully starve any other queue—you'll just end up pulling jobs from queues with higher weights more often than lower weight queues, but you are guaranteed to eventually process jobs from lower weight queues.

Redis Streams are incredibly powerful, but require significant up-front thought
Redis gives you a high degree of control over how to process a stream of events. With that said, you need to think about what kind of delivery semantics you want to implement (at least once, at most once, etc.) so that you aren't surprised during exceptional scenarios (such as a segfault or OOMKill) which can lead to lost/stuck jobs. Account for failure at every step, and you won't be surprised.

Stephan

Commented 6 months ago

any off the shelf message queue solutions for typescript+redis that handle this type of thing? bullmq pro has this but costs $95/month 🫠 https://docs.bullmq.io/bullmq-pro/groups/rate-limiting

Haole

Commented 6 months ago

what's your use case now?

Stephan

Commented 6 months ago

Sending notifications in a way that won’t get rate limited

Varun Srinivasan

Commented 6 months ago

We used Faktory for quite a while before rolling our own

Stephan

Commented 6 months ago

hm seems like their OSS tier doesn't support it either

Samuel ツ

Commented 6 months ago

what prompted you running your own and how's it built? chatted with @downshift.eth over him looking into extending/building an opinionated bullmq version end of last year to handle microsub

downshift. 🏎️💨

Commented 6 months ago

still what i'm using, but running it myself on a managed Redis instance need to re-write it all in effect

horsefacts

Commented 6 months ago

I've been happy with Upstash Qstash for Yoink notifs

Shane da Silva

Commented 6 months ago

Big reason why we went with our own solution was to have this kind of flexibility (for something that's really not that hard to implement directly in Redis). Unfortunately we don't have a simple to extract open source-able solution, otherwise we would offer it. https://paragraph.xyz/new/@sds.eth/building-a-job-processing-system

Stephan

Commented 6 months ago

makes sense. surprised there isn't a defacto open source solution for this already

limone.eth 🍋

Commented 6 months ago

we use qstash from upstash.com

Naeem Ul Hassan 🎩🎩👑🎩

Commented 6 months ago

Consider BullMQ (free) or RedisSMQ, both Typescript-friendly with Redis support. Alternatively, explore Bee-Queue or task-workflow libraries.

Hundredorum RollsⓂ️🎩🍖

Commented 6 months ago

For TypeScript + Redis, consider alternatives like Bee-Queue, Arena (for job management), or open-source BullMQ. These can offer robust features without the $95/month BullMQ Pro cost.@stephancill

Shane da Silva

Commented 7 months ago

A new year's resolution is to write more technical blog posts about challenges we're working on for Warpcast + Farcaster. Here's a post discussing our migration from Faktory to our own hand-rolled job processing framework, and the lessons learned. https://paragraph.xyz/@sds.eth/building-a-job-processing-system

hellno the optimist

Commented 7 months ago

very cool to read about decision making behind the scenes! pls share more

Greg

Commented 7 months ago

love this type of stuff, would read more!

Shane da Silva

Commented 7 months ago

Count on it

Manan

Commented 7 months ago

did you consider temporal at any point? too heavy?

Shane da Silva

Commented 7 months ago

Seemed like overkill for the vast majority of jobs we ran, but we do use Temporal for anything that touches money, generally speaking.

Zach

Commented 7 months ago

hell yes

Tayyab - d/acc

Commented 7 months ago

Thanks because I feel absolutely smarter, but relatively dumber. 💜💜💜

Daniel - Bountycaster

Commented 7 months ago

Nice! we had something similar at Coffee meets Bagel, sorted sets + LUA scripts to maintain atomicity between commands. Was a ton of fun to build and worked pretty well (especially once Redis added sharding)

Alex Loukissas 🍉

Commented 7 months ago

Did you consider Temporal? Also in OSS land, this is a good one to take inspo from (we're happily using it in production, only dependency is postgres) https://github.com/oban-bg/oban

Shane da Silva

Commented 7 months ago

We use Temporal for a subset of high-value jobs (read: money processing), but it felt like overkill for all jobs, especially at the volume we were doing. If you really care about what the exact state of a job is and it’s not an inherently idempotent operation that can simply be retried, Temporal seems to be a reliable choice. Haven’t heard of oban—thanks for putting it on the radar. In our case, it’s really nice to have the job framework also written in TypeScript, since our entire backend codebase is TypeScript.

Alex Loukissas 🍉

Commented 7 months ago

Oh absolutely. I just shared for potential code reading to take ideas. It’s a really mature project. So it may save you time in the future- maybe!

Maryam 🎩 🔵

Commented 7 months ago

Great insights

L3MBDA

Commented 7 months ago

very great 😊

Nomi 🎩🍖⚡🎭✨

Commented 7 months ago

Brother Sand degan tips 🌹❣️ and fallow back 🔙

Data structure

Features

Sorted sets

Delayed jobs, job uniqueness enforcement, dead letter queues, distributed semaphore

Streams

Replayable/skippable queues

Hashes

Job metadata

So you want to build a job processing system?

Shane da Silva

So you want to build a job processing system?

So you want to build a job processing system?

Sharing lessons learned during a migration from Faktory Enterprise

Alternatives Considered

Features

Worth it?

Lessons Learned

So you want to build a job processing system?

Shane da Silva

So you want to build a job processing system?

So you want to build a job processing system?

Sharing lessons learned during a migration from Faktory Enterprise

Alternatives Considered

Features

Worth it?

Lessons Learned