Distributed ops for distributed apps
My talk at Velocity Amsterdam 2016, on scaling operations for microservices
In the world of microservices, when things are moving fast and constantly breaking, the accepted wisdom is that teams must own the whole stack and operate their services themselves. But how much stack is “the whole stack”? How do we ensure that operational standards are consistent across the organization? What’s the right balance between consistent standards and the ability to move fast and innovate?
Microservices are like mini-companies—they are operated by a single team independently(-ish) of the rest of the organization. Teams often own the whole stack. They make decisions about what programming languages or frameworks to use, what technologies to utilize, and how to operate the service in production.
But what happens when the team moves on to the next project? Or if it’s not sufficiently large to provide adequate on-call support? And won’t owning and being responsible for more of the things lead to less time and energy for product development?
Standardization helps unify and simplify operations. But it might reduce your ability to deliver a new product feature sooner. It is important to know what to standardize and how. As with software distributed systems, organizations must decide on the right balance between consistency and availability of operations.
This talk will cover:
Why standards are important
What to standardize and what to not
How to achieve a high level of consistency in operations without sacrificing the ability to deliver value to your customers
Takeaway for the audience
Attendees will learn how to distribute operations in a consistent and efficient way.
The transcript of the talk
A few years ago, I "metamorphosed" into a manager. What I discovered along the way is that scaling software systems and scaling organizations is remarkably similar. The same math applies, so to speak, whether we're talking about distributed software or distributed organizations.
When I joined SurveyMonkey in 2014, we had about 15 engineering teams and 1 ops team developing and operating approximately 30 Python service.
To deploy those Python services we used a home-grown tool cleverly named Doula. Get it? Continuous Delivery? Doula?
Anyway, Doula allowed teams to build their Python package, and deploy it along with all of its Python dependencies into a Virtualenv on a test or staging server.
Then, another script (6,000 lines of artisanal bash code) was used by the ops team to tar-gzip that Virtualenv from the staging server and rsync it to a handful of production servers.
This process had several problems, as you might imagine.
Virtualenv incapsulates a Python environment reasonably ok, as long as you stick to a few simple conventions. But any system-level dependencies, such as database drivers, had to be manually installed on the right servers at just the right time. And because production deployments relied on the staging environment being in a stable state, we had to institute deploy and code freezes. Which created bottlenecks and slowed down the development.
On average, teams were releasing once or twice a month.
Today, we have just over 100 services. And we're in the process of migrating them to a completely distributed pull-based configuration management system that scales sub-linearly.
Engineering teams own every step of the service delivery—from coding, to provisioning the virtual machines, deploying, responding to alerts and troubleshooting in production.
I want to point out these three books that had the biggest influence on how I think about scaling operations:
Thinking in Promises describes in plain English the Promise Theory developed by a physicist, Mark Burgess. Promise Theory, according to Wikipedia, studies "voluntary cooperation between individual, autonomous actors or agents who publish their intentions to one another in the form of promises." If you're building distributed systems (or scaling an organization) and haven't read this book yet, I highly recommend that you do it as soon as possible.
Turn the Ship Around is a story of a nuclear submarine commander, David Marquet, who succeeded to turn a consistently underperforming ship into one that received the highest scores ever seen in the history of the NAVY. And he has done that in less than a year by giving away control, and by creating leaders instead of followers out of his crew. It's a fascinating story! If you're going to read only one book on leadership, make it this one.
Team of Teams is a book by an ex-US Army General, Stanley McChrystal. In 2004, he took command of the Joint Special Operations Task Force where he quickly realized that the Army couldn't fight Al-Qaeda—the highly distributed and decentralized network of guerrilla fighters—with the conventional military tactics based on the century-old command-and-control model. McChrystal and his colleagues had to abandon the old ways and to embrace the transparent communication with decentralized decision-making authority.
Distributing decision-making authority across a number of small, highly skilled teams directly involved in service development and delivery.
US Marines developed the concept of Distributed Operations after they recognized that they are no longer capable of fighting the highly distributed and adaptive enemy using the good old centralized command and control methods. In the world of tech, we're dealing with our own highly distributed and adaptive "enemy" today—microservices. Microservices come with enormous benefits. Otherwise, we wouldn't be embracing them with such enthusiasm. But they also introduce many challenges that the traditional Ops are unable to meet.
Why do we want to distribute the decision-making across many teams? Why is it better than centralizing it in a single team that lives and breathes operations and, thanks to training and daily practice, is superb at it?
A lot could be said about the efficiency and consistency of centralized operations. But a lot has also been said about their inability to scale.
Centralized systems simply do not scale!
Because I'm going to be using the ideas from the Promise Theory in this talk, I thought I'd make a brief introduction for those of you who are unfamiliar with it. And at the same time hopefully demonstrate why distributed is better than centralized.
Promise Theory was proposed by Mark Burgess in 2004. It came out of the realization that "command and control" model of IT management struggles to keep up as our systems grow to a massive scale, often spanning multiple data centers and continents. He suggested that instead of looking at systems from the traditional and familiar point of view of obligations, we look at them as systems of autonomous agents promising each other certain behavior or outcomes.
The two basic tenets of the Promise Theory:
every agent promises only its own behavior
independently changing parts of the system exhibit separate agency and therefore should be modeled as separate agents
In other words, it studies distributed systems.
In contrast, the traditional model of obligations is an imperative, command-and-control model, that describes a system that executes each step of a process to achieve a desired result.
One of the biggest problems of this model is that it separates the intent from the implementation. Imagine if two instructors start sending conflicting instructions to the same component. The component doesn't know which set of instructions is the correct one. The instructors are not aware of each other. The outcome of the situation is entirely uncertain.
Those of you who had the dubious privilege of working in traditional ops can probably recognize this pattern.
Multiple teams submit potentially conflicting requests. Ops team doesn't have enough visibility into the business priorities, everything is presented as high priority. So, they react by dropping any planned work and throwing all available resources onto those requests.
Yes, they still miss the deadlines because it's impossible to plan other people's work. Other teams are frustrated with slow response because Ops is now a bottleneck and a Single Point Of Failure.
None of their own projects get completed, technical debt grows, the team becomes overworked, burned out and everybody rage-quits.
The alternative is to distribute operations (i.e., the implementation) to where the intent is—service delivery teams (i.e., the developers).
In the model of Distributed Operations, each team makes their own operational decisions.
But it's important to note that decision-making autonomy does not necessarily mean that the team performs all the tasks on their own. They can still rely on another team to do a specific job. BUT!
Since they cannot make promises on behalf of other teams, to keep their own promises, they better have a contingency plan in case another team breaks their promise. Either have an agreement with yet another team that performs a similar function. Or have their own, perhaps scaled down version of the service.
For example, at SurveyMonkey we have a CI system that continuously builds, tests and packages artifacts for all teams. However, each team is capable of performing the same steps outside our CI system, in case it becomes unavailable for whatever reason.
I like the analogy that Mark Burgess uses to talk about centralized vs. distributed systems.
In his analogy, the brain represents the centralized system: it receives information from numerous sensors through the nervous system, analyses the information and sends commands to various actuators in response.
Centralized systems are typically viewed as "smart", but as they grow the time it takes for signals to reach the brain, get processed and for commands to be sent back goes up, and the system becomes slow. Slow is the opposite of smart, slow is dumb.
On the other hand, society represents a distributed system—it is a number of loosely coupled autonomous agents that cooperate to achieve their goals.
Societies might include centralized components (for example, a library), but the society is capable of surviving even when a centralized component fails.
In distributed systems, the smart behavior emerges at the system level.
One of the examples of a distributed system slash society is our immune system. It's thousands of millions of defender cells—different types of white blood cells—that constantly patrol your body, destroying germs as soon as they enter.
Each individual immune cell is fairly dumb. It "recognizes" the germs it specializes in by matching their shape like a puzzle piece. But it is a blazingly fast process. And when you combine hundreds of millions of dumb autonomous cells, you get a very "fast" and "smart" system.
So, how do you distribute operations at your company?