What happens when your code reaches the end-user just once in a while?
Imagine. It’s a late Friday evening. A developer (still on his 3 month trial period) just finished manually deploying changes to production. He’s alone in the office, with everyone gone for the weekend. Checks the home page. Everything seems fine. Opens up Hacker News, skims a couple of articles and gets ready to head home. But before this developer goes out the door, “All Along the Watchtower” starts playing. That’s his phone’s ringtone and it’s the COO calling.
So let us not talk falsely now,
The hour is getting late.
Apparently, the whole forum is down. Uh-oh. The fix doesn’t seem very complicated. It’s late, he’s nervous, even if on the surface he looks calm and ready. So, it takes him two attempts to get it right. Since deployment is manual and takes roughly half an hour, more than an hour later everything is fine again.
This really happened in December, 2012.
That developer was me.
What’s wrong with this picture?
What kind of issues in how we’ve done development and deployment led to that kind of incident?
First, we didn’t have a lot of tests. Our code coverage was at measly 30%. But even those few tests took more than 10 minutes to run. Even if they failed, they had no effect on whether we could deploy or not. Even if all of the tests were failing, one could still deploy to production.
The deployment process itself was manual. If one wanted to deploy our Ruby on Rails app, one had to run a modified capistrano script, which asked a bunch of questions. For example:
- please select an application (we had two).
- please select a portal (multiple countries, each deployed separately).
- do you want to restart nginx?
- do you want to update gems?
- do you want to rebuild indexes?
- please select a release branch.
Our release branches were long-running too. We could keep our biggest portal on an older release for a while.
Such arrangement of multiple long-running release branches makes deployment and day-to-day development harder than it has to be. What if you need to fix a bug in all countries and deploy it as soon as possible? Not only you need to port it to every release branch. But you also have to deploy six times. Going through the misery of six manual deployments. Each deployment taking half an hour. And you can only deploy one thing at a time. Which ends up being three hours total.
And after you finish deploying? You can take a look at Graylog, open source log management tool, to figure out if there’s any exceptions. But that’s it. No other insights or metrics. The incident described above was not unique and we needed to find a way to improve.
Three problems are evident in the described situation:
- Slow Feedback. What could be more sluggish than someone noticing a bug, contacting the COO and then COO calling a developer?
- Fragile Releases. What could be more frail if a whole major feature can be completely broken?
- Cumbersome Process. What could be more inconvenient than manual deployment that takes half an hour?
Our goal became Continuous Deployment. Continuous Deployment is a process where all code is deployed to production as soon as it is accepted and tested. A steady pipeline of changes going into production is created. It could be considered the next step of Continuous Integration and Continuous Delivery.
Developers need to have as much visibility as possible into production services. If something is wrong, they should find out about it immediately. And it’s especially essential in a Continuous Deployment environment.
Monitorable things can be divided into two types:
- Errors, crashes, exceptions. Some part of the system doesn’t function properly. It’s either for one user or for every user, but there’s no doubt that it’s bad. Whenever the number of them goes up, it’s certain that it requires attention and you need to find out in detail what’s happening. For most of them, we use Graylog to log. But we don’t want to sit with Graylog window open all day. When errors exceeds a threshold, using a plugin, we get an alarm in Slack.
- Business metrics, performance stats. By itself, one data point doesn’t tell a story. It’s neither bad, nor good. Cache miss? It’s fine, we don’t expect 100% cache hit rate. But more of them can provide insights. Cache hit rate got to less than 90%? Uh-oh.
We use combination of StatsD, Graphite and Grafana to collect and visualize. You can find some examples of how we use it in this blog post. Sensu is also part of our stack. We’re only starting to tap into power of it. Metrics collected in StatsD could be used to create alarms. For example, if caching rate is down, Sensu could send an alarm to #backend in Slack.
Additionally, we use New Relic for performance stats.
Strengthening our monitoring got us from Slow Feedback to Instant Feedback. But it’s only one part of the solution and enablement of Continuous Deployment. The same results can also be reached with other tools or services. The key is to focus on making the feedback loop as fast as possible.
There are three things that made our releases less fragile:
- Tests. Obviously. When writing new code, we add tests. When modifying old code, we also add tests. Actually, we just always add tests. You don’t have to do it in TDD way, but just add tests.
- Pull requests. We moved away from our multiple long-running releases branches setup to GitHub Flow. It makes it easier to understand what’s going on and make changes. We also started using pull requests for everything. Pull requests provide a lightweight mechanism for code reviews. And code reviews, when done right, help share knowledge and increase code quality.
- Feature flags, AB testing, github/scientist. Three different tools that allow releasing code, but not actually running for every users. Depending on your needs, you can: test a feature yourself by flipping on a feature flag; enable new functionality for a percentage of users by running an AB test and comparing; or run a code path side by side with the old one in a science experiment and comparing.
Adding these practices and tools significantly reduced the amount of incidents we encounter in production.
We didn’t immediately go the fully automatic deployment route. At first, we did everything to make our process as fast possible: better hardware, parallel execution of tests, splitting up builds into more logical units. Simple improvements that made a huge difference.
Then we moved our deployments to Jenkins CI . The process was still manual. But the number of steps was reduced significantly. Most of the time, it was just one - clicking a button in Jenkins to run a deployment script. Only sometimes we would have to change parameters for the build.
Next, we enabled running deployments from our Chat. ChatOps is a term our team is well familiar with. We can just ask our friendly bot to
eve deploy 🇩🇪.
It took us a while before we gathered up the courage and switched our deploys to be always on during business hours. And we didn’t switch on every country immediately. We started with smaller countries and as our confidence grew in the process, we enabled them one by one.
We achieved our goal in less than a year from the aforementioned incident. In late 2013, we were deploying hundreds of times a week. Today, that number increased to hundreds of times a day. And there’s no way we would go back to less. Instead of cumbersome process, we have an automated one, always deploying to production. To enable that, we made our releases less frail and improved how we observe production.
This article gave a general overview of how we got to Continuous Deployment. In future articles, we’ll give more details about some of the particulars. If you have a topic you’d like us to expand on, drop us a line on Twitter!