I was chatting recently with a fellow Ops engineer at work about a project he was wrapping up. This colleague had written a large chunk of code for our config management tools to manage a new class of system. However, he was reluctant to merge his branch into the main codebase and release it to production. What if it doesn’t work properly, or behaves in an unexpected way? Just last week, someone pushed a bad change to config management and it nearly took down the whole site! Yes, better to leave the code disabled. If a change needs to be made, it can be done manually and then copied by hand into config management so it is not lost.
His proposal reminded me of the Waterfall model for software delivery and the extreme risk-aversion of traditional Operations teams. In a DevOps world, however, these practices don’t fly. In order to compete, businesses need the ability to put new features and bug fixes in front of users as quickly as possible. Without compromising quality. Development teams figured this out first, and came up with several techniques for achieving these goals. Let’s review a couple, and then look at how Operations can learn from them as well.
In one traditional development model, everyone works on a personal, long-lived “feature branch” which is only merged back into the the main codebase (“trunk”) much later.. Unfortunately, this can lead to all sorts of problems. One developer’s work renames a class that another’s branch relies on. Or each one introduces new contradictory config options. In any case, these integration problems cannot be caught until the end of the release cycle when all of the feature branches are merged. Bugs, crunch time and a frantic scramble to resolve conflicts ensue.
Continuous Integration (CI) is a newer workflow where changes are merged into the trunk very frequently. Some teams even forego the use of branches completely, doing all work on the trunk. You might think this would cause extreme chaos, with many people all working on the same code at the same time. And without new safeguards, you’d be right. Teams practicing CI run a centralized build server such as Jenkins, which performs frequent automated builds and tests. If the build fails, the source tree is closed to further commits until the problems are fixed. In this way, every developer is working on the same code base and any integration problems are caught early. This process is only as robust as the tests themselves, of course, so some up front work writing a battery of useful tests is critical. The payoff comes in the form of quicker releases with fewer defects and lower stress.
Continuous Delivery (CD) takes the concept of CI even further. Once you adopt CI, your code trunk is always in a state where it can be built without errors and pass basic tests. But traditionally, even a team that practices CI might only push a public release once a year. There’s so much packaging and testing to be done, not to mention the Ops work to actually deploy the code. Who has time to do that more often?
In the fast-paced world of web applications and SaaS offerings, the answer better be “you”. Rather than batching up changes for months or years, CD strives to get that code to users as quickly as possible. This is done by relentlessly automating the build and deploy process, working from a foundation of extensive automated tests (seeing a pattern yet?). The same build server that you set up for CI is extended to package up each finished build into a fully deployable release bundle. And, if all tests pass, it will automatically deploy that bundle to a staging environment for final approval–or even deploy straight to production! Building software this way has a number of benefits. When changes are delivered in small, easily understood batches, they’re simpler to debug when problems arise. And because the code is fresh in the developer’s mind, they’ll have less trouble coming up with a fix. It also gets the fruits of your labor out to users sooner. New features and fixes that sit undeployed for a year benefit nobody. With CD, as soon as the work is done, it can be deployed and start making your customers happy.
Putting a little Dev in your Ops
With those ideas in mind, let’s circle back to my conversation at work. My coworker had developed a batch of code in isolation from the rest of the team. It was “done” in his view, but it had not been well tested, to the point where he was afraid to merge it into trunk or deploy it to production. Call me crazy, but I have a hard time calling code “done” if you can’t or won’t run it! What lessons can we apply from CI/CD and how they improve development? We’ll take his concerns one at a time.
“What if it doesn’t work?” I can certainly appreciate not wanting to run untested code in production, but not running it at all is not the solution. CI/CD advocate rapidly iterating in small batches, testing each change as you go. In this way you gain confidence that you’ve built a very solid foundation where every piece works. Test early and often before pushing your changes live. Ideally, you can find a way to automate those tests. Peer review from your team is another great tool.
“What if I make a bad change?” This comes back to testing and confidence again. Development is churning out high quality features at breakneck speed, and it’s Ops’ job to match them. Trust your tests and automation, just as dev does. If those aren’t up to the task, it’s time to start beefing them up. If you’re not comfortable deploying your own code, how can you expect to do it for others?
“I’ll write the code but never merge the branch into trunk or deploy it.” Hoo boy. What do you call the polar opposite of CI/CD? The maxim “incorrect documentation is worse than no documentation” applies here. With missing docs, at least you know where you stand. But bad documentation is actively misleading and can do great harm when you follow it. Unmerged, untested code works the same way. Someone will eventually find and run it–with unpredictable results. You’ve also burned time on work that is not delivering any value. At best, it’s wasted effort. At worst, a time bomb waiting to go off. This configuration living off to the side is like a Waterfall developer’s feature branch. Isolated and unused, it’s just waiting to cause problems if and when it is finally merged and deployed.
“I’ll make sure to mirror every manual change back into config management.” …until you don’t. Nobody is perfect, and you are eventually going to miss something. Your config is now inaccurate, and you won’t find out until the server dies years later. Someone dutifully provisions a new one using the saved config, but now the service is behaving strangely because it is not set up correctly. Good luck tracking down that crucial missing change. This is analogous to a developer refusing to write or run any automated tests because they tested by hand and “it worked on my machine”. I think everyone’s heard that line before. Once again, trust your automation and leave the human error out of it.
Development teams have reinvented themselves with Agile techniques, Continuous Integration and Continuous Delivery. This allows them to write code with unprecedented speed and without compromising quality. Thankfully, many of those same lessons are directly applicable to Ops. Test everything. Once a process is well defined, automate relentlessly to ensure it’s done right every time. Work in small, easily digestible iterations and deploy them frequently. If a process is slow or painful, focus your efforts on that bottleneck first.
Modern system administration is described as “infrastructure as code”, and that’s not just a catch phrase. This type of work closely resembles software development, and there’s a large body of best practices that Ops can leverage to improve the service we deliver. Embrace that knowledge. Maybe even ask your favorite developer over lunch about how and why they use CI/CD. Dev and Ops collaborating… what’s the worst that could happen?
Are you using CI or CD in the field, whether it be in Dev or Ops? How’s it working out for you? I’d love to hear your comments.
If you couldn’t tell, I find this topic fascinating. In a future post I plan to talk in detail about tools and processes for automating tests of your infrastructure and configs.