Tackling Scalability Challenges at toot.community: From Cloudflare to Fastly

As the DevOps Engineer behind toot.community, a Mastodon instance, I’ve had my fair share of challenges in keeping our services running smoothly. The decentralized nature of the fediverse, while an asset in many ways, brought a challenge to our server infrastructure. This post is a dive into how we navigated one of these challenges and eventually found a solution that worked for us.

The Problem: Spikes in Server Demand

Our server faced a consistent load of about 30 requests per second, but this could spike dramatically – up to 400-500 requests per second – whenever a large account on our instance was interacted with. This is a common scenario in the fediverse, where different servers query each instance, leading to these unpredictable surges.

Screenshot of our monitoring solution, showing these spikes in HTTP traffic.

Our existing setup, a DigitalOcean Cluster Autoscaler, is able to scale up & down our infrastructure given a change in demand, but, struggles to keep up with these abrupt increases. And, for financial reasons, maintaining resources to handle peak demand at all times was not feasible.

When zooming into the type of traffic, the problem becomes clear:

Screenshot of the type of traffic on toot.community.

In under three minutes, we received over 30,000 requests on one single path. Later, it turned out this user replied to one of George Takei posts: A popular account on the Fediverse. Mr. Takei has nearly half a million followers, and all the instances holding those followers, started requesting information about our user.

The Initial Solution: Cloudflare Page Caching

Back then, we were already hosted by Cloudflare. Cloudflare is popular for its affordability and effectiveness in content delivery network (CDN) services. We started out by trying out their Page Caching solution.

Screenshot showing spikes being served by Cloudflare, instead of our servers.

The Vary Header Issue

The Vary header in HTTP responses plays a critical role in caching. It tells a caching system to consider certain request headers when deciding whether to serve a cached response. The Mastodon software, which powers toot.community, intelligently uses this header, as seen in this example:

$ curl -vvv https://toot.community/@support/110769156286181747 -H "Accept: application/json" 2>&1 | grep --extended-regexp --ignore-case "cache-control|vary"
< cache-control: max-age=180, public
< vary: Accept, Accept-Language, Cookie

Here, responses vary based on the Accept header. The issue arose when Cloudflare, our caching solution, did not support the Vary header for API and web traffic. This meant that if a JSON response was cached first, subsequent requests expecting HTML content would incorrectly receive the cached JSON, leading to faulty behavior.

This limitation of Cloudflare is well-documented (Cloudflare Cache Control, Cloudflare Community Discussion). Thus, eliminating Cloudflare from being a viable solution for dynamic content caching.

The Shift to Fastly

Our search for a robust caching solution led us to Fastly. Fastly’s support for the Vary header (Best Practices Using Vary Header) meant that we could cache responses effectively based on the content format requested by the user.

Since implementing Fastly, our backend servers have stabilized significantly. The dynamic spikes in traffic, a common occurrence in the Fediverse, are now effectively managed through Fastly’s caching mechanism. This has allowed us to maintain a consistent and reliable service for our users.

Screenshot from Fastly, showing a distinction between cache hits & misses. About 60% of all HTTP traffic is handled by Fastly’s cache.
Our monitoring solution now clearly shows a more steady stream of HTTP traffic over the course of a month.

A Peek Behind the Scenes

For those interested in the nitty-gritty of how I manage our infrastructure, the code is open for you to explore. Check out our GitHub organisation at toot-community on GitHub for a deeper dive into how we keep toot.community running smoothly.

Lessons Learned

This experience has highlighted a few key lessons in managing server infrastructure:

  1. Understand your traffic: The decentralized and dynamic nature of the fediverse means traffic can be unpredictable.
  2. Choose the right tools: Not all CDN tools are equal. It’s crucial to understand the limitations and strengths of each option.
  3. Be flexible in solutions: Be prepared to reevaluate when a chosen solution doesn’t work out as expected.

If you have insights or experiences to share about server scalability in the fediverse, I’d love to hear them. You can find me on Mastodon at https://toot.community/@jorijn


Leave a Reply

Your email address will not be published. Required fields are marked *