We migrated from Faktory Enterprise to our own background job processing system in August 2024. Our two primary motivations were:
Faktory was beginning to exhibit instability resulting in brief periods of downtime at the loads we were subjecting it to.
As a single point of failure, any issue would have a user-visible impact (delayed feeds, etc.), and Faktory cannot (yet) run in a highly available configuration.

Faktory had served us incredibly well since our adoption of it in January 2022. We're grateful to Mike Perham who has been pioneering this space since the launch of Sidekiq in 2012. We highly recommend you use a tool like Faktory before building your own job system.
With that said, suppose you've come to the conclusion that you need to build your own job processing solution. The rest of this post discusses factors we considered and lessons learned along the way.
Alternatives Considered
Before making this decision, we surveyed the background job system landscape circa H1 2024. Available options fell into roughly three categories:
SaaS providers: AWS SQS, Google Cloud Tasks, Cloudflare Worker Queues, etc.
Rejected because we wanted to minimize vendor lock-in.Ready-to-go open source frameworks: BullMQ, Celery, etc.
Almost went with BullMQ, but ultimately rejected because it requires different queue names per shard rather than transparently managing shards on your behalf. Maintaining your own sharding system doesn't seem like the right level of abstraction to provide for a job processing framework.Build your own framework on top of open source tools such as Redis, RabbitMQ, Kafka, etc.
We opted to build our own solution with Node.js on top of Redis because we felt it would give us the most control going forward and provide the greatest flexibility, and it would integrate tightly with our existing TypeScript codebase. We also were able to use very mature libraries like ioredis for interacting with Redis via Node.js.
Features
At a high level, our new system supports all the Faktory features that we made heavy use of:
Delayed jobs (schedule a job to start at a specific time)
Periodic jobs defined via cron syntax
Per-job retry configuration logic
Dead letter queue
Queue throttling / rate-limiting
Job uniqueness enforcement
Job expiration
Job execution timeouts
We also added support for a few features specific to our use case:
Replayable/skippable queues: fast forward or rewind a queue to a specific point in time and re-process or skip jobs as needed
Per queue dead letter queues: Faktory had a global DLQ, which could sometimes make it difficult to sift through and clean up dead jobs if you didn't want to clear the whole DLQ.
Detailed metrics not available in Faktory Enterprise (e.g. segfaults, OOMKills, etc.)
All of these features were implemented using a combination of built-in Redis data structures including:
Data structure | Features |
---|---|
Delayed jobs, job uniqueness enforcement, dead letter queues, distributed semaphore | |
Replayable/skippable queues | |
Job metadata |
These features ultimately powered a job system where each job had the following possible state transitions:

The difference between retries exhausted and dead is that dead jobs are saved for potential debugging and re-enqueuing (resurrection). The vast majority of exhausted/dead jobs don’t need to be preserved, so the default is to simply delete the job when its retries have exhausted. This is a key difference between Faktory and the new system—Faktory always kept a dead job unless the number of retries was set explicitly to 0.
Worth it?
Our current job system is fundamentally different from Faktory in that it allows for horizontal scaling and is highly available in cases of single node failures. While you could run multiple Faktory servers and partition your job queues among them, this is painful to manage and you still suffer from any one of those Faktory servers being a SPOF for their respective queues. Our current system handles all of this for us transparently, and we've resized our cluster multiple times as the size of our workload has increased.

Where we previously would have to limit the number of concurrent jobs we processed with the system, now we can continue to scale out our cluster to support larger workloads by simply increasing the number of shards in our Redis Cluster: a simple button click in the AWS console. This was not possible with Faktory since it didn't support clustered Redis.
In short—for our specific use cases and what we needed from our job framework—the effort was worth it. That being said, we learned plenty during the course of implementation.
Lessons Learned
redis-semaphore requires independent/standalone Redis servers, not a Redis Cluster
One of the most important tools we needed was a distributed semaphore to limit concurrent access for features like queue throttling. redis-semaphore
is an excellent library that has a variety of working implementations, but we spent a long time debugging a strange issue only to realize that the Redlock algorithm it implements requires independent nodes, which is captured in this comment. Once we switched from running 3 Redis nodes in clustered mode to 3 standalone Redis servers for our distributed semaphore, our mysterious race conditions were resolved.
Pipelining commands in Redis is a necessity if you want to increase throughput
Our earlier implementations did not use pipelining so that we could get a working implementation quickly. However, even with sub-millisecond latency between our workers and the Redis cluster, the round-trip time executing each command serially rather than running commands in a pipeline (batch) quickly becomes noticeable at high volume. Fortunately the ioredis
library has an ergonomic pattern for expressing pipelines that we made heavy use of.
Weighted round robin queueing is a simple way to avoid queue starvation
Faktory's queueing system made it possible for you to express queue weights such that one queue would starve other queues. For example, you could have two queues, priority
and default
, and if you have one worker it will always pull from the priority
queue before the default
queue. With weighted round robin queuing, each queue is assigned a weight such that you'll never fully starve any other queue—you'll just end up pulling jobs from queues with higher weights more often than lower weight queues, but you are guaranteed to eventually process jobs from lower weight queues.
Redis Streams are incredibly powerful, but require significant up-front thought
Redis gives you a high degree of control over how to process a stream of events. With that said, you need to think about what kind of delivery semantics you want to implement (at least once, at most once, etc.) so that you aren't surprised during exceptional scenarios (such as a segfault or OOMKill) which can lead to lost/stuck jobs. Account for failure at every step, and you won't be surprised.
any off the shelf message queue solutions for typescript+redis that handle this type of thing? bullmq pro has this but costs $95/month 🫠 https://docs.bullmq.io/bullmq-pro/groups/rate-limiting
what's your use case now?
Sending notifications in a way that won’t get rate limited
We used Faktory for quite a while before rolling our own
hm seems like their OSS tier doesn't support it either
what prompted you running your own and how's it built? chatted with @downshift.eth over him looking into extending/building an opinionated bullmq version end of last year to handle microsub
still what i'm using, but running it myself on a managed Redis instance need to re-write it all in effect
we use qstash from upstash.com
Big reason why we went with our own solution was to have this kind of flexibility (for something that's really not that hard to implement directly in Redis). Unfortunately we don't have a simple to extract open source-able solution, otherwise we would offer it. https://paragraph.xyz/new/@sds.eth/building-a-job-processing-system
makes sense. surprised there isn't a defacto open source solution for this already
I've been happy with Upstash Qstash for Yoink notifs
A new year's resolution is to write more technical blog posts about challenges we're working on for Warpcast + Farcaster. Here's a post discussing our migration from Faktory to our own hand-rolled job processing framework, and the lessons learned. https://paragraph.xyz/@sds.eth/building-a-job-processing-system
did you consider temporal at any point? too heavy?
Seemed like overkill for the vast majority of jobs we ran, but we do use Temporal for anything that touches money, generally speaking.
hell yes
Nice! we had something similar at Coffee meets Bagel, sorted sets + LUA scripts to maintain atomicity between commands. Was a ton of fun to build and worked pretty well (especially once Redis added sharding)
love this type of stuff, would read more!
Count on it
Thanks because I feel absolutely smarter, but relatively dumber. 💜💜💜
very cool to read about decision making behind the scenes! pls share more
Did you consider Temporal? Also in OSS land, this is a good one to take inspo from (we're happily using it in production, only dependency is postgres) https://github.com/oban-bg/oban
We use Temporal for a subset of high-value jobs (read: money processing), but it felt like overkill for all jobs, especially at the volume we were doing. If you really care about what the exact state of a job is and it’s not an inherently idempotent operation that can simply be retried, Temporal seems to be a reliable choice. Haven’t heard of oban—thanks for putting it on the radar. In our case, it’s really nice to have the job framework also written in TypeScript, since our entire backend codebase is TypeScript.
Oh absolutely. I just shared for potential code reading to take ideas. It’s a really mature project. So it may save you time in the future- maybe!
very great 😊