Table of Contents
In the last six months, the open source Mastodon platform has attracted millions of new users and made organizations contemplate creating their own servers (called instances, in Mastodon parlance). It’s not hard to set up a Mastodon instance to support a handful of users. However, it is hard to set up a Mastodon server that can handle a lot of traffic because the default configuration leaves much to be desired.
In my previous article, “How to Boost Mastodon Server Performance with Redis,” I noted that one chokepoint in Mastodon servers is its Sidekiq queues, which depend in turn on Redis queues.
The Mastodon tech stack is built using Redis open source (Redis OSS), and it works great for the purpose. The usual way to configure Mastodon is to run it and Redis OSS on the same machine, and to scale that setup with Redis Sentinel if needed.
“Free” is wonderful, as is the support of the open source community. Redis OSS makes sense for ordinary workloads (whatever “ordinary” means to you) on a basic Mastodon instance. But you might want to consider additional options.
Using Redis OSS with Mastodon is free if you only look at licensing costs. It may not be optimal in the larger context of application performance, or even in the context of total cost of ownership. Where should you put your resources — technical, financial and human?
In this article, we explore the practicalities of using Redis Enterprise Cloud to power the queues for Sidekiq. Redis Enterprise Cloud is a fully managed database-as-a-service and offers enterprise capabilities such as Active-Active clustering topology that scales linearly. Since both of us are benchmark writers, and Filipe is the performance guru at Redis, you’re about to see lots of numbers and graphs. Fear not: We explain everything as we go.
Since we learned from other Mastodon administrators’ experience that the job queues often are the bottleneck, exhibiting 100% CPU load in the Redis process during high-traffic periods, we theorized that we could improve the results by removing Redis from the Mastodon server. We realized that we needed to connect Mastodon and Sidekiq to an external Redis instance, most conveniently to a Redis Enterprise Cloud instance.
We discovered along the way that Mastodon wasn’t designed for that plan, although we found a pull request in its GitHub repository to fix the problem. We also discovered that conventional HTTP load testing wouldn’t help much with Mastodon, but we could adapt a Sidekiq benchmark to compare the performance of Redis OSS and Redis Enterprise Cloud.
Connecting Mastodon and Sidekiq to Redis Enterprise Cloud
Our first step was to demonstrate that we could connect Mastodon and its Sidekiq job queue to an external Redis Enterprise Cloud instance. And, of course, we needed to have such an instance to test.
Filipe created a four-shard Redis Enterprise Cloud cluster in AWS rated for 100,000 operations per second (ops/sec) and 10 GB deployment using Terraform.
At the time we performed these tests, Mastodon didn’t support connecting the Sidekiq queues to anything but a local Redis database. There was a pull request to enable this in the Mastodon GitHub repository, which we applied to our Mastodon instance.
The patch adds Ruby code after line 500 of
# With the environment now reloaded, update Sidekiq to use the Redis config that was provided earlier interactively, in case it differs from the default localhost:6379.
# When the admin user is created, User dispatches an ‘account.created’ event to Sidekiq, which connects to Redis.
Sidekiq.configure_client do |config|
new_params = REDIS_SIDEKIQ_PARAMS.dup
new_params[‘url’] = “redis://:#env[‘REDIS_PASSWORD’]@#env[‘REDIS_HOST’]:#env[‘REDIS_PORT’]/0”
config.redis = new_params
We needed to verify that the Sidekiq queues were running against our Redis Enterprise Cloud instance. To do so, we monitored the database.
The following console log is running on the Redis Enterprise Cloud cluster. The “queue” entries prove that the Mastodon/Sidekiq job queues reach the correct database instance rather than running locally on the Mastodon server.
1680648013.303105 [0 184.108.40.206:46734] “brpop” “queue:default” “queue:pull” “queue:ingress” “queue:push” “queue:mailers” “queue:scheduler” “2”
1680648014.047110 [0 220.127.116.11:46898] “evalsha” “f4a8a5467f9f4697a26fdfb839476b9ee52e897c” “1” “retry” “1680648014.0486577”
1680648014.047110 [0 18.104.22.168:46898] “evalsha” “f4a8a5467f9f4697a26fdfb839476b9ee52e897c” “1” “schedule” “1680648014.0491111”
1680648014.047110 [0 22.214.171.124:46898] “scard” “processes”
1680648014.679114 [0 126.96.36.199:46910] “brpop” “queue:default” “queue:mailers” “queue:ingress” “queue:push” “queue:scheduler” “queue:pull” “2”
1680648014.679114 [0 188.8.131.52:46872] “brpop” “queue:mailers” “queue:default” “queue:pull” “queue:ingress” “queue:push” “queue:scheduler” “2”
1680648014.679114 [0 184.108.40.206:46816] “brpop” “queue:default” “queue:ingress” “queue:mailers” “queue:push” “queue:pull” “queue:scheduler” “2”
1680648014.879116 [0 220.127.116.11:46808] “brpop” “queue:mailers” “queue:push” “queue:default” “queue:ingress” “queue:pull” “queue:scheduler” “2”
1680648015.079117 [0 18.104.22.168:46854] “brpop” “queue:push” “queue:ingress” “queue:default” “queue:pull” “queue:mailers” “queue:scheduler” “2”
1680648015.083117 [0 22.214.171.124:46782] “brpop” “queue:pull” “queue:scheduler” “queue:default” “queue:ingress” “queue:mailers” “queue:push” “2”
1680648015.083117 [0 126.96.36.199:46862] “brpop” “queue:push” “queue:ingress” “queue:default” “queue:pull” “queue:mailers” “queue:scheduler” “2”
1680648015.083117 [0 188.8.131.52:45206] “brpop” “queue:ingress” “queue:default” “queue:mailers” “queue:push” “queue:scheduler” “queue:pull” “2”
The four-shard cluster seemed like overkill, so we tried again with a smaller, single-shard Redis Enterprise Cloud database (rated for 25,000 ops/sec and 5 GB deployment), which showed 15 to 30 ops/sec load from the Sidekiq queues coming from Mastodon.
Modifying the Sidekiq Load Testing Tool
Sidekiq has two benchmarking tools in its repository. We chose the simpler one, which resides at
The Sidekiq load test tool creates 100,000 no-op jobs and drains them as fast as possible. As the code is written, it also uses
toxiproxy to simulate network latency against a local instance of Redis. Since we were testing against a remote Redis Enterprise Cloud cluster, we didn’t need
toxiproxy; we commented out that code.
Then we added the following Ruby code to read the Redis password, port and host from the environment, and we used it to configure the Redis connection for the benchmark.
Sidekiq.configure_server do |config|
config.options[:concurrency] = 10
redis_pass = ENV[‘REDIS_PASSWORD’] || ”
redis_port = ENV[‘REDIS_PORT’] || 6380
redis_host = ENV[‘REDIS_HOST’] || “127.0.0.1”
config.redis = password: redis_pass, host: redis_host, port: redis_port
# config.redis = db: 13, port: 6380, driver: :hiredis
config.options[:queues] << “default”
config.logger.level = Logger::ERROR
config.average_scheduled_poll_interval = 2
config.reliable! if defined?(Sidekiq::Pro)
Performing Sidekiq Benchmarks against a Single Shard on Redis Enterprise Cloud
Running that (modified) Sidekiq load test showed about 13,000 ops/sec and a latency of 0.06 milliseconds.
In our experiments, we increased the number of jobs from 100,000 to 5 million. As the screenshot illustrates, the throughput is about the same (about 13,000 ops/sec) and the latency is about the same (about 0.06 ms), although the Redis memory usage increased to about 1.3 GB. The increased Redis memory usage from the larger queue is not a surprise.
The load test tool reports that processing 5 million jobs took 400 seconds, so each job took 80 microseconds to complete, very slightly higher than the smaller queue.
Clearly, the bottleneck for the Sidekiq queue is not the Redis Cloud shard, which never reached its 25,000 ops/sec capacity.
More Sidekiq benchmarks using Redis OSS and Redis Enterprise Cloud
In the previous tests, we load-tested Sidekiq against a single Redis Enterprise Cloud shard. What happens when we test against Redis OSS?
We set up a single Redis OSS database in an m5.large AWS instance. That should be roughly comparable to the single-shard Redis Enterprise Cloud even though it lacks the Redis on Flash and Active-Active features. We re-ran the Sidekiq load test with 5 million jobs. This time, the test was completed in 427 seconds, meaning that the average time to complete a job was 85 microseconds.
We also set up another four-shard Redis Enterprise Cloud database to the same specifications as the 10 GB, 100K ops/sec cluster configuration we first showed you. This configuration completed the 5M-job Sidekiq load test in 387 seconds, giving us an average time to complete a job of 77μs. It also showed lower latency.
To summarize: Redis OSS is a little slower and has higher latency than a similarly sized single-shard Redis Enterprise Cloud instance, while a four-shard Redis Enterprise Cloud instance is a little faster, has higher capacity and has lower latency.
What all that tells us: Removing the Redis database from the virtual machine that runs Mastodon and Sidekiq should make Mastodon handle high loads more gracefully, with fewer stalls and posting failures.
To prove that conclusively, we plan to set up a production Mastodon node with an external Redis Enterprise Cloud cluster to handle the job queues and perhaps the PostgreSQL cache as well, and we will monitor how it scales with lots of users.
If you’d like to get ready to try all this yourself, you should start by exploring Redis Enterprise Cloud. A free tier instance might not be enough to use for your own high-capacity Mastodon server, but it certainly allows you to become familiar with setting up and using the database, and it will only cost you a little time.