Building Braze


How Braze Leverages Ruby at Scale

Zach McCormick By Zach McCormick Aug 18, 2022

If you’re an engineer who reads Hacker News, Developer Twitter, or any other similar information sources out there, you’ve almost certainly come across a thousand articles with titles like the “Speed of Rust vs C”, “What Makes Node.js Faster Than Java?”, or “Why You Should Use Golang and How to Get Started.” These articles generally make the case that there’s this one specific language that’s the obvious choice for scalability or speed—and that the only thing for you to do is embrace it.

While I was in college and my first year or two as an engineer, I would read these articles and immediately spin up a pet project to learn the new language or framework du jour. After all, it was guaranteed to work “at global scale” and “faster than anything you’ve ever seen,” and who can resist that? Eventually I figured out that I didn’t actually need either of these very specific things for most of my projects. And as my career progressed, I came to realize that no language or framework choice would actually give me these things for free.

Instead, I discovered that it’s architecture that is actually the biggest lever when you’re looking to scale systems, not languages or frameworks.

Here at Braze, we operate at an immense global scale. And yes, we use Ruby and Rails as two of our primary tools to do it. However, there’s no “global_scale = true” configuration value that makes all that possible—it’s the result of a well-thought-out architecture spanning deep within applications all the way to deployment topologies. The engineers at Braze are constantly examining scaling bottlenecks and figuring out how to make our system faster, and the answer usually isn’t “move away from Ruby”: It’s almost certainly going to be a change to the architecture.

So let’s take a look at how Braze leverages thoughtful architecture to actually solve for speed and a massive global scale—and where Ruby and Rails fit it (and don’t)!

The Power of Best-in-Class Architecture

A Simple Web Request

Because of the scale we operate at, we know that the devices associated with our customers' user bases will make billions of web requests every single day that will have to be served by some Braze web server. And even in the simplest of websites, you're going to have a relatively complex flow associated with a request from a client to the server and back:

  1. It starts with the client’s DNS resolver (usually their ISP) figuring out what IP address to go to, based on the domain in your website’s URL.

  2. Once the client has an IP address, they’ll send the request to their gateway router, which will send it to the “next hop” router (which may happen several times), until the request makes its way to the destination IP address.

  3. From there, the operating system on the server receiving the request will handle the networking details and notify the web server’s waiting process that an incoming request was received on the socket/port it was listening on.

  4. The web server will write the response (the requested resource, maybe an index.html) to that socket, which will travel backwards through the routers back to the client.

Pretty complicated stuff for a simple website, no? Luckily, many of these things are taken care of for us (more on that in a second). But our system still has data stores, background jobs, concurrency concerns, and more that it has to deal with! Let’s dive into what that looks like.

The First Systems That Support Scale

DNS and name servers typically don’t require a ton of attention in most cases. Your Top-Level Domain name server will probably have a few entries to map “yourwebsite.com” to the name servers for your domain, and if you’re using a service like Amazon Route 53 or Azure DNS, they’ll handle the name servers for your domain (e.g. managing A, CNAME, or other type of records). You usually don’t have to think about scaling this part, since that will be handled automatically by the systems you’re using.

The routing part of the flow can get interesting, however. There are a few different routing algorithms, like Open Shortest Path First or Routing Information Protocol, all of them designed to find the fastest/shortest route from client to server. Because the internet is effectively a giant connected graph (or, alternately, a flow network), there may be multiple paths that can be leveraged, each with a corresponding higher or lower cost. It’d be prohibitive to do the work to find the absolute fastest route, so most algorithms use reasonable heuristics to get an acceptable route. Computers and networks aren’t always reliable, so we rely on Fastly to enhance our client’s ability to route to our servers more quickly.

Fastly works by providing points-of-presence (POPs) all over the world with very fast, reliable connections between them. Think of them as the interstate highway of the Internet. Our domains’ A and CNAME records point to Fastly, which causes our clients’ requests to go directly to the highway. From there, Fastly can route them to the right place.

The Front Door to Braze

Okay, so our client’s request has gone down the Fastly highway and is right at the Braze platform’s front door—what happens next?

In a simple case, that front door would be a single server accepting requests. As you can imagine, that wouldn’t scale very well, so we actually point Fastly to a set of load balancers. There are all kinds of strategies that load balancers can use, but imagine that, in this scenario, Fastly round-robins requests to a pool of load balancers evenly. These load balancers will queue up requests, then distribute those requests to web servers, which we can also imagine are being dealt client requests in a round-robin fashion. (In practice there may be advantages for certain kinds of affinity, but that’s a topic for another time.)

This allows us to scale up the number of load balancers and the number of web servers depending on the throughput of requests we’re getting and the throughput of requests we can handle. So far, we’ve built an architecture that can handle a giant onslaught of requests without breaking a sweat! It can even handle bursty traffic patterns via the elasticity of load balancers’ request queues—which is awesome!

The Web Servers

Finally, we get to the exciting (Ruby) part: The web server. We use Ruby on Rails, but that’s just a web framework—the actual web server is Unicorn. Unicorn works by starting a number of worker processes on a machine, where each worker process listens on an OS socket for work. It handles process management for us, and defers load balancing of requests to the OS itself. We just need our Ruby code to process the requests as fast as possible; everything else is effectively optimized outside of Ruby for us.

Because the majority of requests either made by our SDK inside of our customers’ applications or via our REST API are asynchronous (i.e. we don’t need to wait for the operation to complete to return a specific response to clients), the majority of our API servers are extraordinarily simple—they validate the structure of the request, any API key constraints, then toss the request on a Redis queue and return a 200 response to the client if everything checks out.

This request/response cycle takes roughly 10 milliseconds for Ruby code to process—and a portion of that is spent waiting on Memcached and Redis. Even if we were to rewrite all of this in another language, it’s not really possible to squeeze much more performance out of this. And, ultimately, it’s the architecture of everything you’ve read so far that enables us to scale this data ingestion process to meet our customers’ ever-growing needs.

The Job Queues

This is a topic we’ve explored in the past, so I won’t get into this aspect as deeply—to learn more about our job queueing system, check out my post on Achieving Resiliency With Queues. On a high-level, what we do is leverage numerous Redis instances that act as job queues, further buffering work that needs to be done. Similar to our web servers, these instances are split across availability zones—to provide higher availability in the case of an issue in a particular availability zone—and they come in primary/secondary pairs using Redis Sentinel for redundancy. We can also scale these both horizontally and vertically to optimize for both capacity and throughput.

The Workers

This is certainly the most interesting part—how do we get workers to scale?

First and foremost, our workers and queues are segmented by a number of dimensions: Customers, types of work, data stores needed, etc. This allows us to have high availability; for instance, if a particular data store is having difficulties, other functions will continue to work perfectly fine. It also allows us to autoscale worker types independently, depending on any of those dimensions. We end up being able to manage worker capacity in a horizontally scalable way—that is, if we have more of a certain type of work, we can scale up more workers.

Here’s the place where you might start to see language or framework choice matter. Ultimately, a more efficient worker is going to be able to do more work, more quickly. Compiled languages like C or Rust tend to be far faster at computational tasks than interpreted languages like Ruby, and that may lead to more efficient workers for some workloads. However, I spend a great deal of time looking at traces, and raw CPU processing is a surprisingly small amount of it in the big picture at Braze. Most of our processing time is spent waiting for responses from data stores or from external requests, not crunching numbers; we don’t need heavily optimized C code for that.

The Data Stores

So far, everything we’ve covered is pretty scalable. So let’s take a minute and talk about where our workers spend most of their time—data stores.

Anyone who has ever scaled up web servers or asynchronous workers that use a SQL database has probably run into a specific scale problem: Transactions. You might have an endpoint that takes care of completing an Order, which creates two FulfillmentRequests and a PaymentReceipt. If this doesn’t all happen in a transaction, you can end up with inconsistent data. Executing numerous transactions on a single database simultaneously can result in a lot of time spent on locks, or even deadlock. At Braze, we take that scaling problem head-on with the data models themselves, through object independence and eventual consistency. With these principles, we can squeeze a lot of performance out of our data stores.

Independent Data Objects

We leverage MongoDB heavily at Braze, for very good reasons: Namely, it makes it possible for us to substantially horizontally scale MongoDB shards and get near-linear increases in storage and performance. This works very well for our user profiles because of their independence from one another—there are no JOIN statements or constraint relationships to maintain between user profiles. As each of our customers grow or as we add new customers (or both), we can simply add new databases and new shards to existing databases to increase our capacity. We explicitly avoid features like multi-document transactions to maintain this level of scalability.

Aside from MongoDB, we often utilize Redis as a temporary data store for things like buffering analytics information. Because the source of truth for many of those analytics exists in MongoDB as independent documents for a period of time, we maintain a horizontally scalable pool of Redis instances to act as buffers; under this approach, the hashed document ID is used in a key-based sharding scheme, evenly spreading out the load due to independence. Periodic jobs flush those buffers from one horizontally-scaled data store to another horizontally-scaled data store. Scale achieved!

Furthermore, we utilize Redis Sentinel for these instances just like we do for the job queues mentioned above. We also deploy numerous “types” of these Redis clusters for different purposes, providing us with a controlled failure flow (i.e. if one particular type of Redis cluster has issues, we do not see unrelated features begin to fail concurrently).

Eventual Consistency

Braze also leverages eventual consistency as a tenet for most read operations. This allows us to leverage reading from both primary and secondary members of MongoDB replica sets in most cases, making our architecture more efficient. This principle in our data model allows us to heavily utilize caching all over our stack.

We use a multi-layer approach using Memcached—basically, when requesting a document from the database, we’ll first check a machine-local Memcached process with a very low time to live (TTL), then check a remote Memcached instance (with a higher TTL), before ever asking the database directly. This helps us cut down dramatically on database reads for common documents, such as customer settings or campaign details. “Eventual” may sound scary, but, in reality, it’s only a few seconds, and taking this approach cuts down an enormous amount of traffic from the source of truth. If you’ve ever taken a computer architecture class, you might recognize how similar this approach is to how a CPUs L1, L2, and L3 cache system works!

With these tricks, we can squeeze a lot of performance out of arguably the slowest part of our architecture, and then horizontally scale it as appropriate when our throughput or capacity needs increase.

Where Ruby and Rails Fit In

Here’s the thing: It turns out, when you spend a lot of effort building out a holistic architecture where each layer horizontally scales well, the speed of the language or runtime is a lot less important than you might think. That means the choices of languages, frameworks, and runtimes are made with an entirely different set of requirements and constraints.

Ruby and Rails had a proven track record of helping teams iterate fast when Braze was started in 2011—and they’re still used by GitHub, Shopify, and other leading brands because it continues to make that possible. They continue to be actively developed by the Ruby and Rails communities, respectively, and they both still have a great set of open-source libraries available for a variety of needs. The pair is a great choice for fast iteration, since they have an immense amount of flexibility, and maintain a significant amount of simplicity for common use cases. We find that to be overwhelmingly true every day we use it.

Now, this is not to say Ruby on Rails is a perfect solution that’s going to work well for everyone. But at Braze, we’ve found that it works very well to power a large part of our data ingestion pipeline, message sending pipeline, and our customer-facing dashboard, all of which require rapid iteration and are central to the success of the Braze platform as a whole.

When We Don’t Use Ruby

But wait! Not everything we do at Braze is in Ruby. There are a few places over the years where we’ve made the call to steer things toward other languages and technologies for a variety of reasons. Let’s take a look at three of them, just to provide some additional insight into when we do and don’t lean on Ruby.

1. Sender Services

As it turns out, Ruby isn’t great at handling a very high degree of concurrent network requests in a single process. That’s an issue because when Braze is sending messages on behalf of our customers, some end-of-the-line service providers might require one request per user. When we have a pile of 100 messages ready to send, we don’t want to wait on each of them to finish before moving on to the next. We’d much rather do all of that work in parallel.

Enter our “Sender Services”—that is, stateless microservices written in Golang. Our Ruby code in the example above can send all 100 messages to one of these services, which will execute all of the requests in parallel, wait for them to finish, then return a bulk response to Ruby. These services are substantially more efficient than what we could do with Ruby when it comes to concurrent networking.

2. Currents Connectors

Our Braze Currents high-volume data export feature allows Braze customers to continuously stream data to one or more of our many data partners. The platform is powered by Apache Kafka, and the streaming is done via Kafka Connectors. You can technically write these in Ruby, but the officially supported way is with Java. And because of the high degree of Java support, writing these connectors is far easier to do in Java than in Ruby.

3. Machine Learning

If you’ve ever done any work in machine learning, you know that the language of choice is Python. The numerous packages and tools for machine learning workloads in Python eclipse the equivalent Ruby support—things like TensorFlow and Jupyter notebooks are instrumental to our team, and those types of tools simply don’t exist or are not well established in the Ruby world. Accordingly, we’ve leaned into Python when it comes to building out elements of our product that leverage machine learning.

When Language Matters

Obviously, we have a few great examples above where Ruby was not the ideal choice. There are many reasons why you might choose a different language—here are a few that we think are particularly useful to consider.

Building New Things without Switching Costs

If you’re going to build an entirely new system, with a new domain model and no tightly-coupled integration with existing functionality, you might have an opportunity to use a different language if you so choose. Especially in cases where your organization is evaluating different opportunities, a smaller, isolated greenfield project could be a great real-world experiment in trying out a new language or framework.

Task-Specific Language Ecosystem and Ergonomics

Some tasks are far easier with a specific language or framework—we particularly like Rails and Grape for development of dashboard functionality, but machine learning code would be an absolute nightmare to write in Ruby, since the open-source tooling just doesn’t exist. You might want to use a specific framework or library to implement some kind of functionality or integration, and sometimes your language choice will be influenced by that, since it will almost certainly result in an easier or faster development experience.

Execution Speed

Occasionally, you need to optimize for raw execution speed, and the language used will heavily influence that. There’s a good reason that a lot of high-frequency trading platforms and autonomous driving systems are written in C++; natively-compiled code can be crazy fast! Our Sender Services exploit Golang’s parallelism/concurrency primitives that simply aren’t available in Ruby for that very reason.

Developer Familiarity

On the other hand, you may be building something isolated, or have a library in mind that you want to use, but your language choice is completely unfamiliar to the rest of your team. Introducing a new project in Scala with a heavy lean toward functional programming might introduce a familiarity barrier to the other developers on your team, which would ultimately result in knowledge isolation or decreased net velocity. We find this to be particularly important at Braze, as we put intense emphasis on fast iteration, so we tend to encourage the usage of tools, libraries, frameworks, and languages that are already in wide use at the organization.

Final Thoughts

If I could go back in time and tell myself one thing about software engineering in giant systems, it would be this: For most workloads, your overall architecture choices will define your scaling limits and speed more than a language choice ever will. That insight is proven every day here at Braze.

Ruby and Rails are incredible tools that, when part of a system that’s architected properly, scale incredibly well. Rails is also a highly mature framework, and it supports our culture at Braze of iterating and producing real customer value quickly. These make Ruby and Rails ideal tools for us, tools that we plan to continue using for years to come.

Interested in working at Braze? We’re hiring for a variety of roles across our Engineering, Product Management, and User Experience teams. Check out our careers page to learn more about our open roles and our culture.


Zach McCormick

Zach McCormick

Zach is a software engineer in New York City. His passions include working out the bugs in big distributed systems then blogging about it, playing electric guitar, and slowly learning the Hungarian language.

Related Content

Tales From Hack Day: How Braze Senior Site Reliability Engineers Brian Bernstein and Matt DiSipio Went “Off the Rails”

Read More

Tales From Hack Day: How Braze Product Engineering Manager Derek Schultz Solved a Campaign Copying Challenge

Read More

How Braze Embraced Internationalization

Read More

Developer

Tales From Hack Day: Braze Senior Software Engineer Hal Anil Calculates the Tax Impact of Exercising Options

Read More