New Tool: Docker RPM Builder

Being a CentOS shop, at my employer we maintain a library of homegrown RPM’s. We use these for a number of purposes, from packaging up software that doesn’t ship with its own RPM to rebuilding open source apps with custom patches applied. Historically, though, these… haven’t exactly been built and managed using best practices. RPM’s were built ad-hoc on any available machine. Source RPM’s were usually (but not always) dumped onto a file server, and could be a nightmare to rebuild in the future if there were a lot of dependencies.

I’ve been looking into tools to help our Ops team come into the 21st century with respect to software building and packaging. Running it all through Jenkins CI is a no-brainer. But what would do the heavy lifting of actually building the RPM’s? Conveniently, just as I was starting to explore this, the DevOps Weekly Newsletter came to the rescue! As an aside, DevOps Weekly is awesome and I highly encourage you to check it out.

That week’s issue highlighted a tool called docker-rpm-builder by Alan Franzoni. It leverages Docker to perform the RPM builds. Given a base image and an appropriate Dockerfile, the tool quickly installs all dependencies and runs the build inside of a container, spitting out a finished RPM in just a few seconds. This saves you from the joys of managing all of the build dependencies (and their inevitable conflicts), or needing to run dozens of VM’s each customized for an RPM’s specific needs.

I’m only just getting started with docker-rpm-builder, but it looks quite slick. As I work with it more seriously, I plan to post some hands-on tutorials and report on how it’s worked out for taming our Ops build environment mess.

If you have any experience with the tool, or have tackled this challenge before, I’d love to hear about it.

What Ops Can Learn from Agile and CI/CD

A Conversation

I was chatting recently with a fellow Ops engineer at work about a project he was wrapping up. This colleague had written a large chunk of code for our config management tools to manage a new class of system. However, he was reluctant to merge his branch into the main codebase and release it to production. What if it doesn’t work properly, or behaves in an unexpected way? Just last week, someone pushed a bad change to config management and it nearly took down the whole site! Yes, better to leave the code disabled. If a change needs to be made, it can be done manually and then copied by hand into config management so it is not lost.

His proposal reminded me of the Waterfall model for software delivery and the extreme risk-aversion of traditional Operations teams. In a DevOps world, however, these practices don’t fly. In order to compete, businesses need the ability to put new features and bug fixes in front of users as quickly as possible. Without compromising quality. Development teams figured this out first, and came up with several techniques for achieving these goals. Let’s review a couple, and then look at how Operations can learn from them as well.

Continuous Integration

In one traditional development model, everyone works on a personal, long-lived “feature branch” which is only merged back into the the main codebase (“trunk”) much later.. Unfortunately, this can lead to all sorts of problems. One developer’s work renames a class that another’s branch relies on. Or each one introduces new contradictory config options. In any case, these integration problems cannot be caught until the end of the release cycle when all of the feature branches are merged. Bugs, crunch time and a frantic scramble to resolve conflicts ensue.

Continuous Integration (CI) is a newer workflow where changes are merged into the trunk very frequently. Some teams even forego the use of branches completely, doing all work on the trunk. You might think this would cause extreme chaos, with many people all working on the same code at the same time. And without new safeguards, you’d be right. Teams practicing CI run a centralized build server such as Jenkins, which performs frequent automated builds and tests. If the build fails, the source tree is closed to further commits until the problems are fixed. In this way, every developer is working on the same code base and any integration problems are caught early. This process is only as robust as the tests themselves, of course, so some up front work writing a battery of useful tests is critical. The payoff comes in the form of quicker releases with fewer defects and lower stress.

Continuous Delivery

Continuous Delivery (CD) takes the concept of CI even further. Once you adopt CI, your code trunk is always in a state where it can be built without errors and pass basic tests. But traditionally, even a team that practices CI might only push a public release once a year. There’s so much packaging and testing to be done, not to mention the Ops work to actually deploy the code. Who has time to do that more often?

In the fast-paced world of web applications and SaaS offerings, the answer better be “you”. Rather than batching up changes for months or years, CD strives to get that code to users as quickly as possible. This is done by relentlessly automating the build and deploy process, working from a foundation of extensive automated tests (seeing a pattern yet?). The same build server that you set up for CI is extended to package up each finished build into a fully deployable release bundle. And, if all tests pass, it will automatically deploy that bundle to a staging environment for final approval–or even deploy straight to production! Building software this way has a number of benefits. When changes are delivered in small, easily understood batches, they’re simpler to debug when problems arise. And because the code is fresh in the developer’s mind, they’ll have less trouble coming up with a fix. It also gets the fruits of your labor out to users sooner. New features and fixes that sit undeployed for a year benefit nobody. With CD, as soon as the work is done, it can be deployed and start making your customers happy.

Putting a little Dev in your Ops

With those ideas in mind, let’s circle back to my conversation at work. My coworker had developed a batch of code in isolation from the rest of the team. It was “done” in his view, but it had not been well tested, to the point where he was afraid to merge it into trunk or deploy it to production. Call me crazy, but I have a hard time calling code “done” if you can’t or won’t run it! What lessons can we apply from CI/CD and how they improve development? We’ll take his concerns one at a time.

“What if it doesn’t work?” I can certainly appreciate not wanting to run untested code in production, but not running it at all is not the solution. CI/CD advocate rapidly iterating in small batches, testing each change as you go. In this way you gain confidence that you’ve built a very solid foundation where every piece works. Test early and often before pushing your changes live. Ideally, you can find a way to automate those tests. Peer review from your team is another great tool.

“What if I make a bad change?” This comes back to testing and confidence again. Development is churning out high quality features at breakneck speed, and it’s Ops’ job to match them. Trust your tests and automation, just as dev does. If those aren’t up to the task, it’s time to start beefing them up. If you’re not comfortable deploying your own code, how can you expect to do it for others?

“I’ll write the code but never merge the branch into trunk or deploy it.” Hoo boy. What do you call the polar opposite of CI/CD? The maxim “incorrect documentation is worse than no documentation” applies here. With missing docs, at least you know where you stand. But bad documentation is actively misleading and can do great harm when you follow it. Unmerged, untested code works the same way. Someone will eventually find and run it–with unpredictable results. You’ve also burned time on work that is not delivering any value. At best, it’s wasted effort. At worst, a time bomb waiting to go off. This configuration living off to the side is like a Waterfall developer’s feature branch. Isolated and unused, it’s just waiting to cause problems if and when it is finally merged and deployed.

“I’ll make sure to mirror every manual change back into config management.” …until you don’t. Nobody is perfect, and you are eventually going to miss something. Your config is now inaccurate, and you won’t find out until the server dies years later. Someone dutifully provisions a new one using the saved config, but now the service is behaving strangely because it is not set up correctly. Good luck tracking down that crucial missing change. This is analogous to a developer refusing to write or run any automated tests because they tested by hand and “it worked on my machine”. I think everyone’s heard that line before. Once again, trust your automation and leave the human error out of it.

Wrap Up

Development teams have reinvented themselves with Agile techniques, Continuous Integration and Continuous Delivery. This allows them to write code with unprecedented speed and without compromising quality. Thankfully, many of those same lessons are directly applicable to Ops. Test everything. Once a process is well defined, automate relentlessly to ensure it’s done right every time. Work in small, easily digestible iterations and deploy them frequently. If a process is slow or painful, focus your efforts on that bottleneck first.

Modern system administration is described as “infrastructure as code”, and that’s not just a catch phrase. This type of work closely resembles software development, and there’s a large body of best practices that Ops can leverage to improve the service we deliver. Embrace that knowledge. Maybe even ask your favorite developer over lunch about how and why they use CI/CD. Dev and Ops collaborating… what’s the worst that could happen?

Are you using CI or CD in the field, whether it be in Dev or Ops? How’s it working out for you? I’d love to hear your comments.

If you couldn’t tell, I find this topic fascinating. In a future post I plan to talk in detail about tools and processes for automating tests of your infrastructure and configs.

Load Balance All The Things

Load Balancing Basics

If you’ve done much work in Operations, you’ve probably encountered a load balancer. This dedicated network device sits between clients and pools of servers, spreading the incoming traffic between them to achieve a greater scale than any one server could handle alone. Perhaps the most obvious use case is web servers. A popular web site might get many millions of hits every day. There’s no way that one server, even a very expensive one, could stand up to that. Instead, many inexpensive servers are placed behind the load balancer and the requests are spread evenly among them. In a well-written web application, any server can handle any request. So this process is transparent to the user. They simply browse your site as they normally would, with no hint that each page they view might be returned by a different server.

There are other benefits, too. Hardware fails, software has bugs, and human operators make mistakes. These are facts of life in Ops, but load balancers can help. If you “overbuild” your pool with extra servers, your service can survive losing several machines with no impact to the user. Likewise, you could take them down one at a time for security patching or upgrades. Or deploy a new build of your application to only 5% of your servers as a smoke test or “canary” for catastrophic failures before rolling it out site-wide.

If your app needs 5 web servers to handle your peak workload, and you have 6 in the pool, you have 1 server worth of headroom for failure. This is known as “N + 1” redundancy, and is the bare minimum you should strive for when supporting any production service. Whether you want even more spare capacity depends on the marginal cost of each additional server vs the expense of an outage. In the age of virtual machines, these extra boxes may be very cheap indeed.

There are many options available for load balancing, both hardware and software. On the hardware side, some popular (and often extremely expensive) names are F5 BIG-IP, Citrix NetScaler, and Coyote Point. In software, the best known is probably HAProxy, although nginx and Apache have some limited load balancing services, too. And if you’re a cloud native, Amazon’s Elastic Load Balancer (ELB) product is waiting for you.

Load Balancing Internal Services

Load balancing public services is important. However, there are likely many internal services that are equally crucial to your app’s uptime. These are sometimes overlooked. I certainly didn’t think of them as candidates for load balancing at first. But to your users, an outage is an outage. It doesn’t matter whether it was because of a failure on a public web server or an internal DNS server. They needed you, and you were down.

Some examples of services you might load balance are DNS, SMTP for email, ElasticSearch queries and database reads. These might be able to run on a single machine from a sheer horsepower perspective, but load balancing them still gives you the advantages of redundancy to guard against failure and allow for maintenance.

You might even apply these techniques to your company’s internal or enterprise IT systems. If employees need to authenticate against an LDAP directory to do their jobs, it would be wise to load balance several servers to ensure business doesn’t grind to a halt with one failed hard drive.

Takeaway

Load balancing is a powerful tool for improving the performance, resiliency and operability of your services. It’s used as a matter of course on public systems, but give a thought to what it can do for your lower-profile ones, too.

That’s not to say that it’s a cure-all. Some services just aren’t suited to it, such as database writes (without special software designed for multiple masters). Or batch jobs that pull their work from a central queue. Other applications might not be “stateless” and misbehave if the user is routed to a different server on each request. As always, use the right tool for the job!