My Podcast Playlist for 2017

I have a lengthy train commute to work, so podcasts are a lifeline for me on several levels. They give me something to fill the time. But much more importantly, they keep me fresh with what’s going on in technology. If you’re looking for something to put in your ear this year, consider this list. I personally listen to and vouch for every one of them.

Datanauts, from the Packet Pushers network. Their stated mission is to “explore the latest data center innovations including storage, virtualization, networking and convergence”, as well as “bust silos”. And they do a pretty damn good job of it. The hosts are a CCIE and VCDX respectively, so they know their stuff. And yet they are also adept at getting out of the way and letting their equally interesting guests come on and do their thing. A recent episode with Charity Majors was particularly fascinating to me, and I have a ton of tabs open for followup reading. This might actually be my favorite tech podcast right now.

Speaking of the Packet Pushers, I also really enjoy their Network Break podcast. It’s a quick 30 minute hit of the week’s highlights in networking news. Product announcements, trends, acquisitions, etc. Superficial and mostly on the business side. Just the right level for anyone who doesn’t do networking full time but wants to keep up.

Software Engineering Radio is hit or miss for an Ops person. Sometimes it’s geeking out over the best features in the new C++ standard. Other times it’s solid gold discussion of things like salary negotiation, or Apache Spark, or some other new tech you’re going to have to support in production tomorrow. Their guests tend to be The Authority on whatever the subject is (like the inventor of PowerShell or Golang). So I subscribe to the feed, aggressively skip topics, and then listen with rapt attention when something good comes along because they are probably talking to the world’s foremost authority on it.

If you ever touch a Microsoft technology, RunAs Radio should be your very first stop for news. Host Richard Campbell is very plugged into that world, and the caliber of guests he gets every week reflects that.

Arrested DevOps is a great show on the eponymous topic of DevOps. There have been many podcasts in this space, but ADO is one of the last ones standing. And still one of the best. As “DevOps” is a broad and loaded term, the show covers a ton of different topics. Take a look at the episode backlog and see if a few tickle your fancy. And when you’re done with those, listen to the rest anyway!

Finally, I’ll throw out Software Defined Talk. Hosted by the one-and-only Michael Coté with a couple other dudes, it’s a hilarious roundtable of tech news and their takes on it. Plus useful recommendations on Costco deals. Just listen already. It’s highly informative, witty, and far better than I can make it sound.

I follow a few other shows, mostly hoping they return from limbo and post a new episode. I could do a whole other post of dead podcasts whose back catalog is must-hear stuff (RIP The Ship Show, and DevOps Cafe is an all-time-great but on life support). But the above shows are my weekly mandatory listening going forward.

How about you? What podcasts am I missing?

Nginx Load Balancer Improvements to proxy_next_upstream

This change happened in March of 2016, but was still news to me when I stumbled across it recently. So I wanted to share since it’s important but didn’t seem to be loudly broadcast. Nginx is no longer dangerously bad at load balancing!

Among the many features of the outstanding Nginx webserver is the ability to act as a load balancer. Using the built-in upstream module, you can define a pool of backend app servers that should take turns servicing requests. And in theory, you can tell nginx to skip a server if it is down or returning error (HTTP 5xx) responses.

In practice, however, Nginx’s handling of downed servers can be very dangerous. That Hacker News thread notes that when a server returns an error, Nginx will by default always retry the request on a second server. This is fine most of the time. But what if the request was “charge $10,000 to my credit card”? Maybe the server correctly applied the charge, but then failed while rendering the confirmation page and returned an error. Well, get ready for some real angry customer support calls. Nginx would have resubmitted that same $10,000 charge over and over until a server responded with an HTTP 200 OK.

For this reason, many admins recommend setting the value proxy_next_upstream off;. This makes a failed backend request simply return an error page instead of retrying it on another server. Definitely not ideal; who wants their users to see error pages? But better than handling a deluge of chargebacks from outraged customers who were billed multiple times. In reality, this often meant admins chose another, specialized tool for their load balancing needs, like HAProxy or an expensive hardware appliance from the likes of F5 or A10.

But wait! With the release of Nginx 1.9.13, things got better. Nginx will now never retry “non-idempotent” requests unless you explicitly tell it to. Idempotent means that no matter how many times you perform an action, it always has the same result. So this excludes POSTs, and a few more obscure methods.

So if you’re still running with proxy_next_upstream off; in your config because of those concerns, it’s time to test removing it. Nginx’s load balancing is much safer and saner than it was this time last year.

Rundeck Performance Tuning With MySQL

At my current job, we use a tool called Rundeck to automate a slew of tasks. I initially stood up a test instance on a small VM, so people could kick the tires and decide if it was useful. Before I knew it, five or six dev teams were running dozens of critical jobs out of there, raving about its power, flexibility, and visibility. It had been voted into production whether I liked it or not.

There was just one problem: Web performance was awful. I’m talking “click a link, go get a fresh cup of coffee, come back, and the page is still loading” slow. Literally. It had started out fine, but as popularity grew, it quickly became unusable.

Hours of fruitless troubleshooting later, I came across a GitHub issue mentioning some missing indices on MySQL tables. Hmm, we use MySQL… searching for that table name led to a couple more issues and mailing list threads homing in on one issue: Add these indices, and perf is fixed. I gave it a shot.

Backup your database before running these commands. They should be harmless, but better safe than sorry.

ALTER TABLE workflow_workflow_step ADD INDEX workflow_commands_id ( workflow_commands_id );
ALTER TABLE workflow_workflow_step ADD INDEX commands_idx ( commands_idx );

Performance immediately improved for us. Load times on problem pages went from minutes to a second or less. Again, literally.

Why these indices are not created by default, I couldn’t say. For all I know, there’s a great reason, but it’s never been articulated by the Rundeck developers. There are (ignored) GitHub issues and mailing list threads mentioning them dating back years. Which is a shame, because Rundeck is a great tool. But it borders on unusable until you fix this problem once you get beyond a handful of frequently-running jobs.

That is also unfortunately a metaphor for Rundeck itself. Hard to find–yet crucial–tweaks and confusing interfaces challenge operators and users at every turn. I think it’s an A+ idea that suffers from some unfortunate C- documentation, design, and operational decisions. It’s an outstanding tool for fostering a DevOps culture, which in a beautifully ironic twist reminds us of software a dev team threw over the wall to ops with no care in the world for those who had to operate or use it in production.

End rant. As I said, we really do use Rundeck extensively despite all that. I’ll try to post some more real-world stories of use cases, tweaks, and gotchas on this blog in the future.

DevOps Days Rockies Recap

I was fortunate enough to attend the first ever DevOps Days Rockies event this past week at Denver’s FORTRUST Data Center. It was an amazing two days packed full of insightful talks, open spaces, professional networking, and fun. I wanted to write up a recap of the event from my perspective. This is going to be a long post, so I’ll just hit the highlights that really stood out to me.

Major thanks to Photobucket for sponsoring me, a fellow sysadmin, and one of our DBA’s! I’m the nerd on the right in this photo.

What is DevOps Days anyway?

DevOps Days is a conference series that sprang up in 2009 in Belgium. It has since spread all over the world. And I do mean “all over”. There have been events from Tel Aviv to Bangalore to Chicago to Melbourne and, finally, in my own backyard in Denver! The goal is to spread the love of DevOps and help anyone interested in that topic hone their craft. What DevOps means to me could and probably will be a whole separate post. But for now, let’s just call it breaking down barriers between a company’s development and operations teams, and using learning from one side of the house (whether that be CI/CD, source control, config management, cloud/virtualization/containers, etc) to benefit the other.

DevOps Days are organized by local volunteers. Here, this was mostly members of the Denver DevOps Meetup Group, which I can testify is an awesome bunch of folks. The events span two days, with standard conference presentations each morning. After lunch, there’s a series of “lightning talks” which are quick-hitting 5 minute affairs great for introducing the audience to a new topic or helping a new speaker get their feet wet for the first time. Finally, the afternoon is devoted to “open spaces”, which are the real soul of DevOps Days. Anyone in the audience can submit a topic, and then everyone votes on what they’d like to discuss. Groups break off and form around each topic, and you’ll move through 3-4 of these face-to-face sessions for the rest of the day. Based on a show of hands, this was the first time using this style for 90% of attendees, myself included. I wasn’t sure what to expect, but any worries were quickly put to rest. It’s just you and other like-minded folks sitting down and having a discussion about something you’re all passionate about. Usually with several conference presenters sitting next to you. So it’s an outstanding opportunity to pick their brains and ask questions you might not be able to during their main talk. And to throw out ideas of your own.

Photobucket attendees

(The DevOps Days Rockies 2015 Organizers!)

Day 1

Thursday kicked off with a keynote by the Sober Build Engineer himself, J. Paul Reed. It was a good discussion of how you can’t simply copy the culture or practices of another organization wholesale, but have to select and adjust them to work in your own environment. I also learned that Paul grew up in Fort Collins where I currently live, which was some neat trivia. This was followed up by Matt Stratton from Chef talking about ways to manage your mental stack. This one really struck a chord with me since I love to drink from the learning fire hose, absorbing as many blogs, Twitter feeds, and podcasts as I can. It’s easy to burn so much time learning that you never actually get work done, and Matt had some suggestions for dealing with that problem. Now if only I can shut off Twitter long enough to implement them…

After lunch, we got our first taste of lightning talks. A couple that jumped out from Day 1 were Elizabeth Mintzas discussing DevOps Recruiting, and Joshua Timberman‘s “Stop Demonizing curl|bash”.

Finally, we selected and broke out for open spaces. I sat in on a discussion of post-mortems involving Etsy’s Ryan Frantz, Joshua Timberman from Chef and Josh Nichols of GitHub, among others. Lots of really awesome and immediately actionable advice. I moved on to several interesting sessions, including a good conversation about Best Practices in Config Management. The conference’s other token SaltStack user was there, too! High-five, buddy.

The day was capped off by a happy hour–perhaps more like happy 8 hours, based on the Twitter pics coming in until midnight 🙂 As an introvert, I was wiped and skipped this. But that’s easily my biggest regret of the event, and I will suck it up and deal next time. Too much fun and networking to be had.

DevOps Days Rockies Foosball tournament at Wynkoop Brewery
(If the crowd looks thin, consider it was pushing midnight local time!)

Day 2

Friday picked right back up with four excellent presentations. Royce Haynes discussed self-managed and self-organized teams. What that means, why you’d try it, what works, and what doesn’t. Ryan Frantz followed that up with my favorite and most immediately useful talk: The Value of Alert Context. He highlighted Etsy’s open-source nagios-herald tool which does all sorts of cool stuff to embed context and information directly into the alert email. So you can make a snap decision about whether that 3AM page can really wait til morning (spoiler: it usually can). I’m itching to implement something similar for our Zabbix monitoring system. Ryan also demonstrated his out-of-control harmonica skills, playing Mary Had a Little Lamb and Oh Susanna to wild applause. There was also a neat deep dive from Twitter’s Matt Getty on a bare metal provisioning system they wrote. It has a lot of similarities to a system we created here at Photobucket. Twitter’s is certainly slicker but it was fun to see that we encountered the same problems and worked toward a similar solution.

We then broke for lunch, which was provided by three food trucks outside the venue. Very fun, and very delicious. Pictured: @medieval1 and his rad utilikilt leaning on the counter. Not pictured: Biker Jim’s goddam amazing hot dog cart. Want to eat wild boar, or reindeer, or (horror of horrors) a vegan dog? No worries, Jim’s got your back!

DevOps Days Rockies food trucks including Biker Jim's

My highlight of day 2 was the “ChatOps” panel. I thought this was going to be lame, since my past experience with chat bots has been something that sits in IRC and takes requests for Yo Mama jokes. But these guys blew that out of the water. Tons of ideas on how to put a chat bot to work as a self-service interface, with the added benefit of automatic public broadcast of what’s being done and historical records. I can’t wait to get to work implementing this in our company.

DevOps Days Rockies ChatOps panel including GitHub Hubot, Google Errbot and Lita

The conference closed out with a second round of open spaces. I proposed one around using CI/CD for your infrastructure and config management, since I really want to improve that aspect of my own work. This lead to a great discussion and lots of useful takeaways. Including that almost no one is doing this as heavily as they’d like, which is either reassuring or slightly terrifying when you consider the brands making this confession 😉 There was also a roundtable on burnout in the DevOps community. What it looks like, and how to reduce it, both from the receiving end and as a manager. This is a hot and very important topic in the community right now, and I was really glad to see it addressed here. Paul Reed took the lead on this one, and guided the conversation in a way that directly addressed issues the session members brought up. Bravo.

Summary

All in all, it was easily the best professional conference I’ve attended. Almost all of the talks–full-length and lightning– were excellent, and the open spaces really brought it to a higher level. It’s so cool to be able to sit down and talk with people you admire personally and professionally and have a friendly discussion about stuff you both care about deeply. And “discussion” is definitely the word; I never felt like anyone was preaching or talking at me. They wanted to hear my thoughts, too.

There were a couple moments where I did feel like things went off the rails. One talk was technical in the extreme, to the point of large portions literally being code read aloud off GitHub. The high level ideas were very interesting, but the presentation got a bit too far down in the weeds. And another was very obviously “here is my company’s product and why you should use it”, which really stood out since it was the only such talk of the whole conference. This was in stark contrast to my experience at the Juno OpenStack Summit, which was much, much more commercial. So kudos to the organizers for vetting the speakers so well. If I could make one suggestion for next year, it would be to have fewer, longer open spaces. Almost every single time, the “5 minutes left!” call came just as we were starting to gel as a group and make serious headway.

What a DevOps Days Open Space looks like. Many were much more intimate than this
(What a popular Open Space looks like. Many were much more intimate than this.)

All in all, it was an awesome event from top to bottom. The organizers did an outstanding job, the presentations were interesting, I learned to stop worrying and love the Open Space, and FORTRUST was a gracious host (even if the scowling guards in fatigues made more than a few people wonder if they were entering the right place–thankfully by day 2 they seemed to have gotten the OK to cut loose and have fun with us!). The organizers also did a great job including women and minority speakers, which was awesome to see.

I’d recommend the DevOps Days events to anyone, and DevOps Days Rockies in particular. I can’t wait to come back next year. Maybe I’ll even dip my toe in with a lightning talk?

J. Paul Reed getting taken down by a FORTRUST security guard

(Pic is all in good fun from @soberbuildeng… “First #DevOpsDays I’ve seen w/ an armed presence; they subdue shady characters… (Thanks, @FORTRUST, for hosting!)”)

New Tool: Docker RPM Builder

Being a CentOS shop, at my employer we maintain a library of homegrown RPM’s. We use these for a number of purposes, from packaging up software that doesn’t ship with its own RPM to rebuilding open source apps with custom patches applied. Historically, though, these… haven’t exactly been built and managed using best practices. RPM’s were built ad-hoc on any available machine. Source RPM’s were usually (but not always) dumped onto a file server, and could be a nightmare to rebuild in the future if there were a lot of dependencies.

I’ve been looking into tools to help our Ops team come into the 21st century with respect to software building and packaging. Running it all through Jenkins CI is a no-brainer. But what would do the heavy lifting of actually building the RPM’s? Conveniently, just as I was starting to explore this, the DevOps Weekly Newsletter came to the rescue! As an aside, DevOps Weekly is awesome and I highly encourage you to check it out.

That week’s issue highlighted a tool called docker-rpm-builder by Alan Franzoni. It leverages Docker to perform the RPM builds. Given a base image and an appropriate Dockerfile, the tool quickly installs all dependencies and runs the build inside of a container, spitting out a finished RPM in just a few seconds. This saves you from the joys of managing all of the build dependencies (and their inevitable conflicts), or needing to run dozens of VM’s each customized for an RPM’s specific needs.

I’m only just getting started with docker-rpm-builder, but it looks quite slick. As I work with it more seriously, I plan to post some hands-on tutorials and report on how it’s worked out for taming our Ops build environment mess.

If you have any experience with the tool, or have tackled this challenge before, I’d love to hear about it.

What Ops Can Learn from Agile and CI/CD

A Conversation

I was chatting recently with a fellow Ops engineer at work about a project he was wrapping up. This colleague had written a large chunk of code for our config management tools to manage a new class of system. However, he was reluctant to merge his branch into the main codebase and release it to production. What if it doesn’t work properly, or behaves in an unexpected way? Just last week, someone pushed a bad change to config management and it nearly took down the whole site! Yes, better to leave the code disabled. If a change needs to be made, it can be done manually and then copied by hand into config management so it is not lost.

His proposal reminded me of the Waterfall model for software delivery and the extreme risk-aversion of traditional Operations teams. In a DevOps world, however, these practices don’t fly. In order to compete, businesses need the ability to put new features and bug fixes in front of users as quickly as possible. Without compromising quality. Development teams figured this out first, and came up with several techniques for achieving these goals. Let’s review a couple, and then look at how Operations can learn from them as well.

Continuous Integration

In one traditional development model, everyone works on a personal, long-lived “feature branch” which is only merged back into the the main codebase (“trunk”) much later.. Unfortunately, this can lead to all sorts of problems. One developer’s work renames a class that another’s branch relies on. Or each one introduces new contradictory config options. In any case, these integration problems cannot be caught until the end of the release cycle when all of the feature branches are merged. Bugs, crunch time and a frantic scramble to resolve conflicts ensue.

Continuous Integration (CI) is a newer workflow where changes are merged into the trunk very frequently. Some teams even forego the use of branches completely, doing all work on the trunk. You might think this would cause extreme chaos, with many people all working on the same code at the same time. And without new safeguards, you’d be right. Teams practicing CI run a centralized build server such as Jenkins, which performs frequent automated builds and tests. If the build fails, the source tree is closed to further commits until the problems are fixed. In this way, every developer is working on the same code base and any integration problems are caught early. This process is only as robust as the tests themselves, of course, so some up front work writing a battery of useful tests is critical. The payoff comes in the form of quicker releases with fewer defects and lower stress.

Continuous Delivery

Continuous Delivery (CD) takes the concept of CI even further. Once you adopt CI, your code trunk is always in a state where it can be built without errors and pass basic tests. But traditionally, even a team that practices CI might only push a public release once a year. There’s so much packaging and testing to be done, not to mention the Ops work to actually deploy the code. Who has time to do that more often?

In the fast-paced world of web applications and SaaS offerings, the answer better be “you”. Rather than batching up changes for months or years, CD strives to get that code to users as quickly as possible. This is done by relentlessly automating the build and deploy process, working from a foundation of extensive automated tests (seeing a pattern yet?). The same build server that you set up for CI is extended to package up each finished build into a fully deployable release bundle. And, if all tests pass, it will automatically deploy that bundle to a staging environment for final approval–or even deploy straight to production! Building software this way has a number of benefits. When changes are delivered in small, easily understood batches, they’re simpler to debug when problems arise. And because the code is fresh in the developer’s mind, they’ll have less trouble coming up with a fix. It also gets the fruits of your labor out to users sooner. New features and fixes that sit undeployed for a year benefit nobody. With CD, as soon as the work is done, it can be deployed and start making your customers happy.

Putting a little Dev in your Ops

With those ideas in mind, let’s circle back to my conversation at work. My coworker had developed a batch of code in isolation from the rest of the team. It was “done” in his view, but it had not been well tested, to the point where he was afraid to merge it into trunk or deploy it to production. Call me crazy, but I have a hard time calling code “done” if you can’t or won’t run it! What lessons can we apply from CI/CD and how they improve development? We’ll take his concerns one at a time.

“What if it doesn’t work?” I can certainly appreciate not wanting to run untested code in production, but not running it at all is not the solution. CI/CD advocate rapidly iterating in small batches, testing each change as you go. In this way you gain confidence that you’ve built a very solid foundation where every piece works. Test early and often before pushing your changes live. Ideally, you can find a way to automate those tests. Peer review from your team is another great tool.

“What if I make a bad change?” This comes back to testing and confidence again. Development is churning out high quality features at breakneck speed, and it’s Ops’ job to match them. Trust your tests and automation, just as dev does. If those aren’t up to the task, it’s time to start beefing them up. If you’re not comfortable deploying your own code, how can you expect to do it for others?

“I’ll write the code but never merge the branch into trunk or deploy it.” Hoo boy. What do you call the polar opposite of CI/CD? The maxim “incorrect documentation is worse than no documentation” applies here. With missing docs, at least you know where you stand. But bad documentation is actively misleading and can do great harm when you follow it. Unmerged, untested code works the same way. Someone will eventually find and run it–with unpredictable results. You’ve also burned time on work that is not delivering any value. At best, it’s wasted effort. At worst, a time bomb waiting to go off. This configuration living off to the side is like a Waterfall developer’s feature branch. Isolated and unused, it’s just waiting to cause problems if and when it is finally merged and deployed.

“I’ll make sure to mirror every manual change back into config management.” …until you don’t. Nobody is perfect, and you are eventually going to miss something. Your config is now inaccurate, and you won’t find out until the server dies years later. Someone dutifully provisions a new one using the saved config, but now the service is behaving strangely because it is not set up correctly. Good luck tracking down that crucial missing change. This is analogous to a developer refusing to write or run any automated tests because they tested by hand and “it worked on my machine”. I think everyone’s heard that line before. Once again, trust your automation and leave the human error out of it.

Wrap Up

Development teams have reinvented themselves with Agile techniques, Continuous Integration and Continuous Delivery. This allows them to write code with unprecedented speed and without compromising quality. Thankfully, many of those same lessons are directly applicable to Ops. Test everything. Once a process is well defined, automate relentlessly to ensure it’s done right every time. Work in small, easily digestible iterations and deploy them frequently. If a process is slow or painful, focus your efforts on that bottleneck first.

Modern system administration is described as “infrastructure as code”, and that’s not just a catch phrase. This type of work closely resembles software development, and there’s a large body of best practices that Ops can leverage to improve the service we deliver. Embrace that knowledge. Maybe even ask your favorite developer over lunch about how and why they use CI/CD. Dev and Ops collaborating… what’s the worst that could happen?

Are you using CI or CD in the field, whether it be in Dev or Ops? How’s it working out for you? I’d love to hear your comments.

If you couldn’t tell, I find this topic fascinating. In a future post I plan to talk in detail about tools and processes for automating tests of your infrastructure and configs.

Load Balance All The Things

Load Balancing Basics

If you’ve done much work in Operations, you’ve probably encountered a load balancer. This dedicated network device sits between clients and pools of servers, spreading the incoming traffic between them to achieve a greater scale than any one server could handle alone. Perhaps the most obvious use case is web servers. A popular web site might get many millions of hits every day. There’s no way that one server, even a very expensive one, could stand up to that. Instead, many inexpensive servers are placed behind the load balancer and the requests are spread evenly among them. In a well-written web application, any server can handle any request. So this process is transparent to the user. They simply browse your site as they normally would, with no hint that each page they view might be returned by a different server.

There are other benefits, too. Hardware fails, software has bugs, and human operators make mistakes. These are facts of life in Ops, but load balancers can help. If you “overbuild” your pool with extra servers, your service can survive losing several machines with no impact to the user. Likewise, you could take them down one at a time for security patching or upgrades. Or deploy a new build of your application to only 5% of your servers as a smoke test or “canary” for catastrophic failures before rolling it out site-wide.

If your app needs 5 web servers to handle your peak workload, and you have 6 in the pool, you have 1 server worth of headroom for failure. This is known as “N + 1” redundancy, and is the bare minimum you should strive for when supporting any production service. Whether you want even more spare capacity depends on the marginal cost of each additional server vs the expense of an outage. In the age of virtual machines, these extra boxes may be very cheap indeed.

There are many options available for load balancing, both hardware and software. On the hardware side, some popular (and often extremely expensive) names are F5 BIG-IP, Citrix NetScaler, and Coyote Point. In software, the best known is probably HAProxy, although nginx and Apache have some limited load balancing services, too. And if you’re a cloud native, Amazon’s Elastic Load Balancer (ELB) product is waiting for you.

Load Balancing Internal Services

Load balancing public services is important. However, there are likely many internal services that are equally crucial to your app’s uptime. These are sometimes overlooked. I certainly didn’t think of them as candidates for load balancing at first. But to your users, an outage is an outage. It doesn’t matter whether it was because of a failure on a public web server or an internal DNS server. They needed you, and you were down.

Some examples of services you might load balance are DNS, SMTP for email, ElasticSearch queries and database reads. These might be able to run on a single machine from a sheer horsepower perspective, but load balancing them still gives you the advantages of redundancy to guard against failure and allow for maintenance.

You might even apply these techniques to your company’s internal or enterprise IT systems. If employees need to authenticate against an LDAP directory to do their jobs, it would be wise to load balance several servers to ensure business doesn’t grind to a halt with one failed hard drive.

Takeaway

Load balancing is a powerful tool for improving the performance, resiliency and operability of your services. It’s used as a matter of course on public systems, but give a thought to what it can do for your lower-profile ones, too.

That’s not to say that it’s a cure-all. Some services just aren’t suited to it, such as database writes (without special software designed for multiple masters). Or batch jobs that pull their work from a central queue. Other applications might not be “stateless” and misbehave if the user is routed to a different server on each request. As always, use the right tool for the job!

Making Varnish 4 and SELinux Play Nice

Why do you hate productivity?

When standing up this blog, I chose CentOS 7 as the underlying OS to get some experience with systemd and other new tech in Red Hat’s latest release. With Red Hat, of course, comes the specter of SELinux. There’s an attitude among some Linux admins that SELinux is just a pain in the ass that prevents you from getting work done, and the “fix” is to disable it outright. I get it. It’s extremely confusing when something as simple as trying to access a file you appear to have read permissions on fails with a misleading error message. Or a service fails to start for no apparent reason.

But configured properly, SELinux can give you a real leg up when it comes to security. With a new exploit or high-profile corporate breach in the news every week these days, you don’t need to be a Level 10 UNIX Wizard to see the value in another layer of protection. For 2015, I’ve decided to suck it up, eat my veggies and learn to love (or at least deal with) SELinux.

Configuring SELinux for Varnish 4

I chose to install Varnish as a caching layer in front of Apache, for the day my little blog makes the front page of Reddit. It’s going to happen any minute now. Just watch. And naturally, since I was starting from scratch anyway, I installed the latest version (4.0.2 as of this post). Apparently Varnish 3.x is properly configured for SELinux out of the box, but that is not the case for the new hotness in Varnish 4. You can find some gory details in the Red Hat bug tracker, but basically a code change in Varnish 4 makes it require access to a few new system calls which have not been whitelisted in the CentOS 7 SELinux packages.

This issue shows itself when you attempt to start up Varnish, and it fails. Checking on why, you can see there’s a strange permissions problem. SELinux rears its ugly delightful head.

# systemctl status varnish
<snip>
varnishd[20364]: Failed to set permissions on ./vcl.UdrgPE5O.so: Operation not permitted

You can use the audit2allow tool to parse the SELinux logs in /var/log/audit/ and not only tell you why something was blocked, but how to fix it. Here, we’ll use the -M flag which generates a module file that you can then import into SELinux.

# grep varnishd /var/log/audit/audit.log | audit2allow -M varnishd2
# semodule -i varnishd2.pp
# systemctl restart varnish
# systemctl status varnish
varnish.service - Varnish a high-perfomance HTTP accelerator
   Loaded: loaded (/usr/lib/systemd/system/varnish.service; enabled)
   Active: active (running) since Sun 2015-02-01 04:09:15 UTC; 100ms ago

That’s it! If everything went correctly, varnish should now be running. The first command finds all Varnish activities that were blocked, and feeds them to audit2allow with the -M flag. This generates a SELinux module named varnishd2.pp which can be loaded to allow all of the calls that had previously been blocked. I’ve named it varnishd2 because there’s already a varnishd module shipped with CentOS. However, it’s the old Varnish 3.x edition that doesn’t work with version 4.

Was that so bad? (Ok, it was kind of bad). But now that you know this pattern, you can work your way around a lot of SELinux issues quickly the next time they crop up. Hopefully sometime down the line the SELinux packages will be updated, making this step unnecessary for Varnish 4.

Salt Multi-Master Bug in 2014.7

A word of warning about the Salt 2014.7 series if you run in multi-master mode. This past week, I tried rolling out Salt 2014.7.1 (aka Heliuim) to our production environment at work. The 2014.7 line has a lot of exciting new features and fixes, so we’ve been eager to get it out. Having been bitten by bugs in the past, though, we wanted to wait for the first point release to land. That recently dropped, and after a few days kicking the tires in our test environment I was confident that the upgrade would go smoothly

Sadly, it was not to be. We run a check in our Zabbix monitoring system periodically to make sure that every master is able to make a simple test.ping connection to every minion. Our production setup uses four masters in multi-master mode for redundancy. Shortly after completing the upgrade of all masters and minions, this check began to fail. Not in a very consistent way, but each master had a different subset of minions it could not reach. Normally when a minion loses touch with a master, it’s fixable by simply restarting the salt-minion service, but that did not work here. It just turned into a game of whack-a-mole where the minion would become unreachable from a different master instead.

Diving into the Salt issue tracker, I came upon issues 18322 and 19932 which were filed against 2014.7 and sounded very familiar. They both indicate minions failing to respond to commands from masters, seemingly at random. The common thread was use of multi-master mode. One user suggested a workaround of setting multiprocessing: False on the minions. I found that improved matters–fewer minions were randomly failing to respond–but did not fix it completely. It also seems that this issue is fixed in the upcoming 2015.2 “Lithium” release, but that is a long ways off and the fix is not easily backported.

Multi-master mode explained the nagging question of why testing had gone off without a hitch. Our test environment only uses one master, and would not have triggered the bug(s). So, shame on me for that! It’s a best practice to test in a configuration as close to production as possible, and I will be fixing that soon. In a virtualized world, there’s minimal cost to spinning up a second master. And the time spent will certainly be less than I spent chasing down this bug.

In the end, I was forced to roll back the upgrade from all of our servers, which was not a fun job. But at the end of it, Salt was running smoothly once again and our monitoring system was clear. I’ll continue to work with the SaltStack team to find a fix. They’re great folks and very committed to both their product and the community, so I am confident it will happen sooner rather than later.

Please leave a comment if you’ve encountered this issue, or have a workaround.

Introduction to Salt-cloud (Part 2)

In part 1 of this series, we got a 10,000 foot view of salt-cloud. What it is, why you might want to use it, and the pieces that make it up. Now, it’s time to get our hands dirty and boot some VM’s.

The salt-cloud Command

Once you’ve installed the appropriate packages for your operating system, you should have the salt-cloud utility available. This CLI app is your interface to salt-cloud. For some examples of what it can do, check out the abridged version of the help output below (from salt-cloud 2014.7.1 on OS X):

jhenry:~ jhenry$ salt-cloud -h
Usage: salt-cloud

Options:
  -c CONFIG_DIR, --config-dir=CONFIG_DIR
                        Pass in an alternative configuration directory.
                        Default: /etc/salt

  Execution Options:
    -p PROFILE, --profile=PROFILE
                        Create an instance using the specified profile.
    -m MAP, --map=MAP   Specify a cloud map file to use for deployment. This
                        option may be used alone, or in conjunction with -Q,
                        -F, -S or -d.
    -d, --destroy       Destroy the specified instance(s).
    -P, --parallel      Build all of the specified instances in parallel.
    -u, --update-bootstrap
                        Update salt-bootstrap to the latest develop version on
                        GitHub.

  Query Options:
    -Q, --query         Execute a query and return some information about the
                        nodes running on configured cloud providers
    -F, --full-query    Execute a query and return all information about the
                        nodes running on configured cloud providers
    --list-providers    Display a list of configured providers.

  Cloud Providers Listings:
    --list-locations=LIST_LOCATIONS
                        Display a list of locations available in configured
                        cloud providers. Pass the cloud provider that
                        available locations are desired on, aka "linode", or
                        pass "all" to list locations for all configured cloud
                        providers
    --list-images=LIST_IMAGES
                        Display a list of images available in configured cloud
                        providers. Pass the cloud provider that available
                        images are desired on, aka "linode", or pass "all" to
                        list images for all configured cloud providers
    --list-sizes=LIST_SIZES
                        Display a list of sizes available in configured cloud
                        providers. Pass the cloud provider that available
                        sizes are desired on, aka "AWS", or pass "all" to list
                        sizes for all configured cloud providers

I’ve trimmed out some poorly documented options to focus on what we’ll use in this post (dumpster diving through the source code to determine what some of those options do may turn into a future article).

As you can see, most salt-cloud actions require either a profile or a map (remember those from part 1?) to execute. Given nothing but a profile (-p) or map (-m), salt-cloud will attempt to boot the named instance(s) in the associated provider’s cloud. Paired with destroy (-d), it will–wait for it–terminate the instance. With -Q or -F, it will query the provider for running instances that match the profile or map and return information about their state. The final set of --list options may be used to view the various regions, images and instance sizes available from a given provider. Handy if you regularly work with several different vendors and can’t keep them all straight.

Configuring a Provider

Time for some concrete examples. Let’s set up Amazon EC2 as a salt-cloud provider, using a config very much like the one that booted the instance where my blog lives.

ec2-dealwithit:
  id: 'Your IAM ID'
  key: 'Your IAM key'
  keyname: centos
  private_key: ~/.ssh/centos.pem
  securitygroup: www
  provider: ec2
  del_root_vol_on_destroy: True
  del_all_vols_on_destroy: True

I’ve stripped out a couple advanced options, but that’s the gist. It’s plain YAML syntax, like all Salt config. To break it down:

ec2-dealwithit: This is an arbitrary ID that serves as the name of your provider. You’ll reference this in other configs, such as profiles (see next section).

id and key: your AWS credentials, specifically an IAM id:key pair. Pretty self explanatory.

keyname and private_key: The name of an SSH keypair you have previously configured at EC2, and the local path to the private key for that same keypair. This is what allows salt-cloud to log into your freshly booted instance and perform some bootstrapping.

securitygroup: controls which security group (sort of a simple edge firewall, if you are not familiar with EC2) your instances should automatically join.

provider maps to one of salt-cloud’s supported cloud vendors, so it knows which API to speak.

del_root_vol_on_destroy and del_all_vols_on_destroy: determine what should happen to any EBS volumes created alongside your instances. In my case, I want them cleaned up when my instances die so I don’t end up paying for them forever. But YMMV, be sure you’re not going to be storing any critical data on these volumes before you configure them to self-destruct! Confusingly, you need to specify both if you want all EBS volumes to be destroyed. Some instances, such as the newer t2.micro, automatically create an EBS root volume on boot. Setting del_all_vols does not destroy this volume. It only destroys any others you may later attach. So again, consider the behavior you want and set these appropriately. The default behavior depends on which AMI you’re using for your instance, so it’s best to set these explicitly.

Configuring a Profile

Armed with your provider config, it’s time to create a profile. This builds on the provider and describes the details of an individual VM.

ec2-www:
  provider: ec2-dealwithit
  image: ami-96a818fe
  size: t2.micro
  ssh_username:
    - centos
  location: us-east-1
  availability_zone: us-east-1b
  block_device_mappings:
    - DeviceName: /dev/sda1
      Ebs.VolumeSize: 30
      Ebs.VolumeType: gp2

Once again, a fairly straightforward YAML file.

ec2-www: An arbitrary identifier used to reference your profile in other configs or from the CLI.

provider: The name of a provider you’ve previously defined in /etc/salt/cloud.profiles.d/. In this case, the one we just set up earlier.

image: An AMI image ID which will be the basis for your VM.

size: The size or “flavor” for your instance. You can print a list of available sizes for a given provider with a command like this: salt-cloud --list-sizes ec2-dealwithit

ssh_username: The user that the salt-bootstrap code should use to connect to your instance, using the SSH keypair you defined earlier in the provider config. This is baked into your AMI image. If you work with several images that use different default users, you can list them all and salt-cloud will try them one by one.

Location and availability_zone: The region and AZ where your instance will live (if you care). You can print a list of locations for a provider with salt-cloud --list-locations ec2-dealwithit.

block_device_mappings: Create or modify an EBS volume to attach to your instance. In my case, I’m using a t2.micro instance which comes with a very small (~6GB) root volume. The AWS free tier allows up to 30GB of EBS storage for free, so I opted to resize the disk to take advantage of that. I also used the gp2 (standard SSD) volume type for better performance. You can map as many EBS volumes as you like, or leave it off entirely if it’s not relevant to you.

Configuring a Map

The final config file–which is optional–that I want to touch on is a map. Remember, a map lays out multiple instances belonging to one or more profiles, allowing you to boot a full application stack with one command. Here’s a quick example:

ec2-www:
  - web1
  - web2
  - staging:
      minion:
        master: staging-master.example.com

ec2-www: This is the name of a profile that you’ve previously defined. Here, I’m using the ec2-www profile that we created above.

web1, web2, ...: These are the names of individual instances that will be booted based on the parent profile.

staging: Here, I’m defining an instance and overriding some default settings. Because I can! Specifically, I changed the minion config that salt-bootstrap will drop onto the newly booted host in /etc/salt/mimion. For example, you could set up a staging server where you test code before deploying it fully. This server might be pointed at a different salt-master to keep it segregated from production. Nearly any setting from the Core, Provider and Profile level can be overwritten to suit your needs.

Making It Rain

Ok, I had to get one bad cloud joke in. Lighten up. Anyway, now that we’ve laid out our config files, we can go about the business of actually managing our cloud(s).

salt-cloud -p ec2-www web1

Boom! You just booted a VM named web1 based on the ec2-www profile we created earlier. If it seems like it’s taking a long time, that’s because the salt-bootstrap deploy script runs on first boot, loading salt onto the new minion for management. Depending on the log level you’ve configured in the core config (/etc/salt/cloud by default), salt-cloud will either sit silently and eventually report success, or spam your console with excruciating detail about its progress. But either way, when it’s done, you’ll get a nice YAML-formatted report about your new VM.

salt-cloud -a reboot web1
[INFO    ] salt-cloud starting
The following virtual machines are set to be actioned with "reboot":
  www2

Proceed? [N/y] y
... proceeding
[INFO    ] Complete
ec2-www:
    ----------
    ec2:
        ----------
        web1:
            ----------
            Reboot:
                Complete

In this example, we’re using the -a (action) option to reboot the instance we just created. Salt-cloud loops through all of your providers, querying them for an instance with the name you provide. Once found, it sends the proper API call to the cloud vendor to reboot the instance.

salt-cloud -p ec2-www -d web1
[INFO    ] salt-cloud starting
The following virtual machines are set to be destroyed:
  ec2-www:
    ec2:
      web1

Proceed? [N/y] y
... proceeding
[INFO    ] Destroying in non-parallel mode.
[INFO    ] [{'instanceId': 'i-e7800116', 'currentState': {'code': '48', 'name': 'terminated'}, 'previousState': {'code': '80', 'name': 'stopped'}}]
ec2-www:
    ----------
    ec2:
        ----------
        web1:
            ----------
            currentState:
                ----------
                code:
                    48
                name:
                    terminated
            instanceId:
                i-e7800116
            previousState:
                ----------
                code:
                    80
                name:
                    stopped

Now that we’re done playing, I’ve deleted the instance we just booted. Easy come, easy go.

salt-cloud -m /etc/salt/cloud.maps.d/demo.map -P

In this last example, we’re booting the map we created earlier. This should bring up 3 VM’s: web1, web2, and staging. The -P option makes this happen in parallel rather than one at a time. The whole point of working in the cloud is speed, so why wait around?

Wrapping Up

That pretty well covers the basics of salt-cloud. What it is, how to configure it, and how to turn those configs into real, live VM’s at your cloud vendor(s) of choice. There’s certainly more to salt-cloud than what I’ve covered so far. The official docs could also stand some improvement, to put it mildly. So I definitely plan to revisit salt-cloud in future posts. I’m already planning one to talk about deploy scripts such as the default salt-bootstrap.

If you’re wondering “why go to all this trouble writing configs just to boot a dang VM?”, it’s a fair point. But there are reasons! One major benefit of salt-cloud is the way it abstracts away vendor details. You write your configs once, and then use the same CLI syntax to manage your VM’s wherever they may live. It also gives you the advantages of infrastructure as code. You can keep these configs in version control systems like git. You can see at a glance what VM’s should exist, and how they should be configured. It gives you a level of consistency and repeatability you don’t get from ad-hoc work at the command line or a web GUI. These are all basic tenets of good, modern system administration.

I hope that this series was helpful! Please feel free to leave a comment with any questions, corrections or discussion.