As the DevOps Engineer behind toot.community, a Mastodon instance, I’ve had my fair share of challenges in keeping our services running smoothly. The decentralized nature of the fediverse, while an asset in many ways, brought a challenge to our server infrastructure. This post is a dive into how we navigated one of these challenges and eventually found a solution that worked for us.
The Problem: Spikes in Server Demand
Our server faced a consistent load of about 30 requests per second, but this could spike dramatically – up to 400-500 requests per second – whenever a large account on our instance was interacted with. This is a common scenario in the fediverse, where different servers query each instance, leading to these unpredictable surges.
Our existing setup, a DigitalOcean Cluster Autoscaler, is able to scale up & down our infrastructure given a change in demand, but, struggles to keep up with these abrupt increases. And, for financial reasons, maintaining resources to handle peak demand at all times was not feasible.
When zooming into the type of traffic, the problem becomes clear:
In under three minutes, we received over 30,000 requests on one single path. Later, it turned out this user replied to one of George Takei posts: A popular account on the Fediverse. Mr. Takei has nearly half a million followers, and all the instances holding those followers, started requesting information about our user.
The Initial Solution: Cloudflare Page Caching
Back then, we were already hosted by Cloudflare. Cloudflare is popular for its affordability and effectiveness in content delivery network (CDN) services. We started out by trying out their Page Caching solution.
The Vary Header Issue
Vary header in HTTP responses plays a critical role in caching. It tells a caching system to consider certain request headers when deciding whether to serve a cached response. The Mastodon software, which powers toot.community, intelligently uses this header, as seen in this example:
$ curl -vvv https://toot.community/@support/110769156286181747 -H "Accept: application/json" 2>&1 | grep --extended-regexp --ignore-case "cache-control|vary"
< cache-control: max-age=180, public
< vary: Accept, Accept-Language, Cookie
Here, responses vary based on the
Accept header. The issue arose when Cloudflare, our caching solution, did not support the
Vary header for API and web traffic. This meant that if a JSON response was cached first, subsequent requests expecting HTML content would incorrectly receive the cached JSON, leading to faulty behavior.
The Shift to Fastly
Our search for a robust caching solution led us to Fastly. Fastly’s support for the
Vary header (Best Practices Using Vary Header) meant that we could cache responses effectively based on the content format requested by the user.
Since implementing Fastly, our backend servers have stabilized significantly. The dynamic spikes in traffic, a common occurrence in the Fediverse, are now effectively managed through Fastly’s caching mechanism. This has allowed us to maintain a consistent and reliable service for our users.
A Peek Behind the Scenes
For those interested in the nitty-gritty of how I manage our infrastructure, the code is open for you to explore. Check out our GitHub organisation at toot-community on GitHub for a deeper dive into how we keep toot.community running smoothly.
This experience has highlighted a few key lessons in managing server infrastructure:
- Understand your traffic: The decentralized and dynamic nature of the fediverse means traffic can be unpredictable.
- Choose the right tools: Not all CDN tools are equal. It’s crucial to understand the limitations and strengths of each option.
- Be flexible in solutions: Be prepared to reevaluate when a chosen solution doesn’t work out as expected.
If you have insights or experiences to share about server scalability in the fediverse, I’d love to hear them. You can find me on Mastodon at https://toot.community/@jorijn