Tackling Scalability Challenges at toot.community: A Journey from Cloudflare to Fastly

As the DevOps Engineer behind toot.community, a Mastodon instance, I've had my fair share of challenges in keeping our services running smoothly. The decentralized nature of the fediverse, while an asset in many ways, brought a challenge to our server infrastructure. This post is a dive into how we navigated one of these challenges and eventually found a solution that worked for us.

The problem: Spikes in server demand

Our server experiences a consistent load of about 30 requests per second, but this could spike dramatically to 400-500 requests per second whenever a large account on our instance was interacted with. This is a common scenario in the Fediverse, where different servers query each instance, leading to these unpredictable surges.

Our current setup, a DigitalOcean Cluster Autoscaler, can automatically scale our infrastructure up and down in response to changing demand, but it struggles to accommodate these sudden increases in demand. Due to financial constraints, it was not feasible to maintain resources that could handle peak demand at all times.

The initial solution: Cloudflare page caching

We initially turned to Cloudflare. Cloudflare is well-known for its affordability and effectiveness in content delivery network (CDN) services, making it seem like a promising solution.

The Vary header issue

The Vary header in HTTP responses is critical in caching. It instructs a caching system to consider specific request headers when determining whether to serve a cached response. The Mastodon software, which powers toot.community, intelligently utilizes this header, as illustrated in this example:

$ curl -vvv https://toot.community/@support/110769156286181747 -H "Accept: application/json" 2>&1 | grep --extended-regexp --ignore-case "cache-control|vary"
< cache-control: max-age=180, public
< vary: Accept, Accept-Language, Cookie

Here, responses differ based on the Accept header. The issue occurred when Cloudflare, our caching solution, did not support the Vary header for API and web traffic. This meant that if a JSON response was cached first, subsequent requests expecting HTML content would incorrectly receive the cached JSON, resulting in faulty behavior.

This limitation of Cloudflare is well documented (Cloudflare Cache Control, Cloudflare Community Discussion). Thus, Cloudflare was not a viable solution for caching dynamic content.

The shift to Fastly

Our quest for a robust caching solution led us to Fastly. Fastly's support for the Vary header (Best Practices Using Vary Header) enabled us to cache responses effectively based on the content format requested by the user.

Incorporating Fastly into our infrastructure was made smoother with the help of Terraform, a tool for building, changing, and versioning infrastructure safely and efficiently. We used a Terraform module (mastodon/terraform-fastly-service), designed for Mastodon instances, encapsulating all the necessary settings for our use case.

Since we implemented Fastly, our backend servers have stabilized significantly. The dynamic spikes in traffic, which are a common occurrence in the Fediverse, are now effectively managed through Fastly's caching mechanism. This improvement has allowed us to provide a consistent and reliable service for our users.

A peek behind the scenes

For those interested in the nitty-gritty of how I manage our infrastructure, the code is open for you to explore. Check out our GitHub organization at toot-community on GitHub for a deeper dive into our efforts to keep toot.community running smoothly.

Lessons learned

This experience has reinforced a few key lessons in managing server infrastructure:

  1. Understand Your Traffic: The decentralized and dynamic nature of the fediverse means traffic can be unpredictable. Understanding this assists in choosing the right tools and strategies.
  2. Choose the Right Tools: Not all tools are created equal. It's crucial to recognize each option's limitations and strengths.
  3. Be Flexible in Solutions: Be ready to pivot when a chosen solution doesn’t work out as expected.
  4. Leverage Community Resources: Tools like Terraform and community-developed modules can significantly streamline the implementation of complex solutions.

Toot.community's journey in managing scalability challenges highlights the importance of understanding both the technical and operational aspects of operating a server in the ever-evolving landscape of the fediverse. It's a continuous learning process, but one that is undeniably rewarding.