DevOps Days Rockies Recap

I was fortunate enough to attend the first ever DevOps Days Rockies event this past week at Denver’s FORTRUST Data Center. It was an amazing two days packed full of insightful talks, open spaces, professional networking, and fun. I wanted to write up a recap of the event from my perspective. This is going to be a long post, so I’ll just hit the highlights that really stood out to me.

Major thanks to Photobucket for sponsoring me, a fellow sysadmin, and one of our DBA’s! I’m the nerd on the right in this photo.

What is DevOps Days anyway?

DevOps Days is a conference series that sprang up in 2009 in Belgium. It has since spread all over the world. And I do mean “all over”. There have been events from Tel Aviv to Bangalore to Chicago to Melbourne and, finally, in my own backyard in Denver! The goal is to spread the love of DevOps and help anyone interested in that topic hone their craft. What DevOps means to me could and probably will be a whole separate post. But for now, let’s just call it breaking down barriers between a company’s development and operations teams, and using learning from one side of the house (whether that be CI/CD, source control, config management, cloud/virtualization/containers, etc) to benefit the other.

DevOps Days are organized by local volunteers. Here, this was mostly members of the Denver DevOps Meetup Group, which I can testify is an awesome bunch of folks. The events span two days, with standard conference presentations each morning. After lunch, there’s a series of “lightning talks” which are quick-hitting 5 minute affairs great for introducing the audience to a new topic or helping a new speaker get their feet wet for the first time. Finally, the afternoon is devoted to “open spaces”, which are the real soul of DevOps Days. Anyone in the audience can submit a topic, and then everyone votes on what they’d like to discuss. Groups break off and form around each topic, and you’ll move through 3-4 of these face-to-face sessions for the rest of the day. Based on a show of hands, this was the first time using this style for 90% of attendees, myself included. I wasn’t sure what to expect, but any worries were quickly put to rest. It’s just you and other like-minded folks sitting down and having a discussion about something you’re all passionate about. Usually with several conference presenters sitting next to you. So it’s an outstanding opportunity to pick their brains and ask questions you might not be able to during their main talk. And to throw out ideas of your own.

Photobucket attendees

(The DevOps Days Rockies 2015 Organizers!)

Day 1

Thursday kicked off with a keynote by the Sober Build Engineer himself, J. Paul Reed. It was a good discussion of how you can’t simply copy the culture or practices of another organization wholesale, but have to select and adjust them to work in your own environment. I also learned that Paul grew up in Fort Collins where I currently live, which was some neat trivia. This was followed up by Matt Stratton from Chef talking about ways to manage your mental stack. This one really struck a chord with me since I love to drink from the learning fire hose, absorbing as many blogs, Twitter feeds, and podcasts as I can. It’s easy to burn so much time learning that you never actually get work done, and Matt had some suggestions for dealing with that problem. Now if only I can shut off Twitter long enough to implement them…

After lunch, we got our first taste of lightning talks. A couple that jumped out from Day 1 were Elizabeth Mintzas discussing DevOps Recruiting, and Joshua Timberman‘s “Stop Demonizing curl|bash”.

Finally, we selected and broke out for open spaces. I sat in on a discussion of post-mortems involving Etsy’s Ryan Frantz, Joshua Timberman from Chef and Josh Nichols of GitHub, among others. Lots of really awesome and immediately actionable advice. I moved on to several interesting sessions, including a good conversation about Best Practices in Config Management. The conference’s other token SaltStack user was there, too! High-five, buddy.

The day was capped off by a happy hour–perhaps more like happy 8 hours, based on the Twitter pics coming in until midnight 🙂 As an introvert, I was wiped and skipped this. But that’s easily my biggest regret of the event, and I will suck it up and deal next time. Too much fun and networking to be had.

DevOps Days Rockies Foosball tournament at Wynkoop Brewery
(If the crowd looks thin, consider it was pushing midnight local time!)

Day 2

Friday picked right back up with four excellent presentations. Royce Haynes discussed self-managed and self-organized teams. What that means, why you’d try it, what works, and what doesn’t. Ryan Frantz followed that up with my favorite and most immediately useful talk: The Value of Alert Context. He highlighted Etsy’s open-source nagios-herald tool which does all sorts of cool stuff to embed context and information directly into the alert email. So you can make a snap decision about whether that 3AM page can really wait til morning (spoiler: it usually can). I’m itching to implement something similar for our Zabbix monitoring system. Ryan also demonstrated his out-of-control harmonica skills, playing Mary Had a Little Lamb and Oh Susanna to wild applause. There was also a neat deep dive from Twitter’s Matt Getty on a bare metal provisioning system they wrote. It has a lot of similarities to a system we created here at Photobucket. Twitter’s is certainly slicker but it was fun to see that we encountered the same problems and worked toward a similar solution.

We then broke for lunch, which was provided by three food trucks outside the venue. Very fun, and very delicious. Pictured: @medieval1 and his rad utilikilt leaning on the counter. Not pictured: Biker Jim’s goddam amazing hot dog cart. Want to eat wild boar, or reindeer, or (horror of horrors) a vegan dog? No worries, Jim’s got your back!

DevOps Days Rockies food trucks including Biker Jim's

My highlight of day 2 was the “ChatOps” panel. I thought this was going to be lame, since my past experience with chat bots has been something that sits in IRC and takes requests for Yo Mama jokes. But these guys blew that out of the water. Tons of ideas on how to put a chat bot to work as a self-service interface, with the added benefit of automatic public broadcast of what’s being done and historical records. I can’t wait to get to work implementing this in our company.

DevOps Days Rockies ChatOps panel including GitHub Hubot, Google Errbot and Lita

The conference closed out with a second round of open spaces. I proposed one around using CI/CD for your infrastructure and config management, since I really want to improve that aspect of my own work. This lead to a great discussion and lots of useful takeaways. Including that almost no one is doing this as heavily as they’d like, which is either reassuring or slightly terrifying when you consider the brands making this confession 😉 There was also a roundtable on burnout in the DevOps community. What it looks like, and how to reduce it, both from the receiving end and as a manager. This is a hot and very important topic in the community right now, and I was really glad to see it addressed here. Paul Reed took the lead on this one, and guided the conversation in a way that directly addressed issues the session members brought up. Bravo.

Summary

All in all, it was easily the best professional conference I’ve attended. Almost all of the talks–full-length and lightning– were excellent, and the open spaces really brought it to a higher level. It’s so cool to be able to sit down and talk with people you admire personally and professionally and have a friendly discussion about stuff you both care about deeply. And “discussion” is definitely the word; I never felt like anyone was preaching or talking at me. They wanted to hear my thoughts, too.

There were a couple moments where I did feel like things went off the rails. One talk was technical in the extreme, to the point of large portions literally being code read aloud off GitHub. The high level ideas were very interesting, but the presentation got a bit too far down in the weeds. And another was very obviously “here is my company’s product and why you should use it”, which really stood out since it was the only such talk of the whole conference. This was in stark contrast to my experience at the Juno OpenStack Summit, which was much, much more commercial. So kudos to the organizers for vetting the speakers so well. If I could make one suggestion for next year, it would be to have fewer, longer open spaces. Almost every single time, the “5 minutes left!” call came just as we were starting to gel as a group and make serious headway.

What a DevOps Days Open Space looks like. Many were much more intimate than this
(What a popular Open Space looks like. Many were much more intimate than this.)

All in all, it was an awesome event from top to bottom. The organizers did an outstanding job, the presentations were interesting, I learned to stop worrying and love the Open Space, and FORTRUST was a gracious host (even if the scowling guards in fatigues made more than a few people wonder if they were entering the right place–thankfully by day 2 they seemed to have gotten the OK to cut loose and have fun with us!). The organizers also did a great job including women and minority speakers, which was awesome to see.

I’d recommend the DevOps Days events to anyone, and DevOps Days Rockies in particular. I can’t wait to come back next year. Maybe I’ll even dip my toe in with a lightning talk?

J. Paul Reed getting taken down by a FORTRUST security guard

(Pic is all in good fun from @soberbuildeng… “First #DevOpsDays I’ve seen w/ an armed presence; they subdue shady characters… (Thanks, @FORTRUST, for hosting!)”)

New Tool: Docker RPM Builder

Being a CentOS shop, at my employer we maintain a library of homegrown RPM’s. We use these for a number of purposes, from packaging up software that doesn’t ship with its own RPM to rebuilding open source apps with custom patches applied. Historically, though, these… haven’t exactly been built and managed using best practices. RPM’s were built ad-hoc on any available machine. Source RPM’s were usually (but not always) dumped onto a file server, and could be a nightmare to rebuild in the future if there were a lot of dependencies.

I’ve been looking into tools to help our Ops team come into the 21st century with respect to software building and packaging. Running it all through Jenkins CI is a no-brainer. But what would do the heavy lifting of actually building the RPM’s? Conveniently, just as I was starting to explore this, the DevOps Weekly Newsletter came to the rescue! As an aside, DevOps Weekly is awesome and I highly encourage you to check it out.

That week’s issue highlighted a tool called docker-rpm-builder by Alan Franzoni. It leverages Docker to perform the RPM builds. Given a base image and an appropriate Dockerfile, the tool quickly installs all dependencies and runs the build inside of a container, spitting out a finished RPM in just a few seconds. This saves you from the joys of managing all of the build dependencies (and their inevitable conflicts), or needing to run dozens of VM’s each customized for an RPM’s specific needs.

I’m only just getting started with docker-rpm-builder, but it looks quite slick. As I work with it more seriously, I plan to post some hands-on tutorials and report on how it’s worked out for taming our Ops build environment mess.

If you have any experience with the tool, or have tackled this challenge before, I’d love to hear about it.

What Ops Can Learn from Agile and CI/CD

A Conversation

I was chatting recently with a fellow Ops engineer at work about a project he was wrapping up. This colleague had written a large chunk of code for our config management tools to manage a new class of system. However, he was reluctant to merge his branch into the main codebase and release it to production. What if it doesn’t work properly, or behaves in an unexpected way? Just last week, someone pushed a bad change to config management and it nearly took down the whole site! Yes, better to leave the code disabled. If a change needs to be made, it can be done manually and then copied by hand into config management so it is not lost.

His proposal reminded me of the Waterfall model for software delivery and the extreme risk-aversion of traditional Operations teams. In a DevOps world, however, these practices don’t fly. In order to compete, businesses need the ability to put new features and bug fixes in front of users as quickly as possible. Without compromising quality. Development teams figured this out first, and came up with several techniques for achieving these goals. Let’s review a couple, and then look at how Operations can learn from them as well.

Continuous Integration

In one traditional development model, everyone works on a personal, long-lived “feature branch” which is only merged back into the the main codebase (“trunk”) much later.. Unfortunately, this can lead to all sorts of problems. One developer’s work renames a class that another’s branch relies on. Or each one introduces new contradictory config options. In any case, these integration problems cannot be caught until the end of the release cycle when all of the feature branches are merged. Bugs, crunch time and a frantic scramble to resolve conflicts ensue.

Continuous Integration (CI) is a newer workflow where changes are merged into the trunk very frequently. Some teams even forego the use of branches completely, doing all work on the trunk. You might think this would cause extreme chaos, with many people all working on the same code at the same time. And without new safeguards, you’d be right. Teams practicing CI run a centralized build server such as Jenkins, which performs frequent automated builds and tests. If the build fails, the source tree is closed to further commits until the problems are fixed. In this way, every developer is working on the same code base and any integration problems are caught early. This process is only as robust as the tests themselves, of course, so some up front work writing a battery of useful tests is critical. The payoff comes in the form of quicker releases with fewer defects and lower stress.

Continuous Delivery

Continuous Delivery (CD) takes the concept of CI even further. Once you adopt CI, your code trunk is always in a state where it can be built without errors and pass basic tests. But traditionally, even a team that practices CI might only push a public release once a year. There’s so much packaging and testing to be done, not to mention the Ops work to actually deploy the code. Who has time to do that more often?

In the fast-paced world of web applications and SaaS offerings, the answer better be “you”. Rather than batching up changes for months or years, CD strives to get that code to users as quickly as possible. This is done by relentlessly automating the build and deploy process, working from a foundation of extensive automated tests (seeing a pattern yet?). The same build server that you set up for CI is extended to package up each finished build into a fully deployable release bundle. And, if all tests pass, it will automatically deploy that bundle to a staging environment for final approval–or even deploy straight to production! Building software this way has a number of benefits. When changes are delivered in small, easily understood batches, they’re simpler to debug when problems arise. And because the code is fresh in the developer’s mind, they’ll have less trouble coming up with a fix. It also gets the fruits of your labor out to users sooner. New features and fixes that sit undeployed for a year benefit nobody. With CD, as soon as the work is done, it can be deployed and start making your customers happy.

Putting a little Dev in your Ops

With those ideas in mind, let’s circle back to my conversation at work. My coworker had developed a batch of code in isolation from the rest of the team. It was “done” in his view, but it had not been well tested, to the point where he was afraid to merge it into trunk or deploy it to production. Call me crazy, but I have a hard time calling code “done” if you can’t or won’t run it! What lessons can we apply from CI/CD and how they improve development? We’ll take his concerns one at a time.

“What if it doesn’t work?” I can certainly appreciate not wanting to run untested code in production, but not running it at all is not the solution. CI/CD advocate rapidly iterating in small batches, testing each change as you go. In this way you gain confidence that you’ve built a very solid foundation where every piece works. Test early and often before pushing your changes live. Ideally, you can find a way to automate those tests. Peer review from your team is another great tool.

“What if I make a bad change?” This comes back to testing and confidence again. Development is churning out high quality features at breakneck speed, and it’s Ops’ job to match them. Trust your tests and automation, just as dev does. If those aren’t up to the task, it’s time to start beefing them up. If you’re not comfortable deploying your own code, how can you expect to do it for others?

“I’ll write the code but never merge the branch into trunk or deploy it.” Hoo boy. What do you call the polar opposite of CI/CD? The maxim “incorrect documentation is worse than no documentation” applies here. With missing docs, at least you know where you stand. But bad documentation is actively misleading and can do great harm when you follow it. Unmerged, untested code works the same way. Someone will eventually find and run it–with unpredictable results. You’ve also burned time on work that is not delivering any value. At best, it’s wasted effort. At worst, a time bomb waiting to go off. This configuration living off to the side is like a Waterfall developer’s feature branch. Isolated and unused, it’s just waiting to cause problems if and when it is finally merged and deployed.

“I’ll make sure to mirror every manual change back into config management.” …until you don’t. Nobody is perfect, and you are eventually going to miss something. Your config is now inaccurate, and you won’t find out until the server dies years later. Someone dutifully provisions a new one using the saved config, but now the service is behaving strangely because it is not set up correctly. Good luck tracking down that crucial missing change. This is analogous to a developer refusing to write or run any automated tests because they tested by hand and “it worked on my machine”. I think everyone’s heard that line before. Once again, trust your automation and leave the human error out of it.

Wrap Up

Development teams have reinvented themselves with Agile techniques, Continuous Integration and Continuous Delivery. This allows them to write code with unprecedented speed and without compromising quality. Thankfully, many of those same lessons are directly applicable to Ops. Test everything. Once a process is well defined, automate relentlessly to ensure it’s done right every time. Work in small, easily digestible iterations and deploy them frequently. If a process is slow or painful, focus your efforts on that bottleneck first.

Modern system administration is described as “infrastructure as code”, and that’s not just a catch phrase. This type of work closely resembles software development, and there’s a large body of best practices that Ops can leverage to improve the service we deliver. Embrace that knowledge. Maybe even ask your favorite developer over lunch about how and why they use CI/CD. Dev and Ops collaborating… what’s the worst that could happen?

Are you using CI or CD in the field, whether it be in Dev or Ops? How’s it working out for you? I’d love to hear your comments.

If you couldn’t tell, I find this topic fascinating. In a future post I plan to talk in detail about tools and processes for automating tests of your infrastructure and configs.

Load Balance All The Things

Load Balancing Basics

If you’ve done much work in Operations, you’ve probably encountered a load balancer. This dedicated network device sits between clients and pools of servers, spreading the incoming traffic between them to achieve a greater scale than any one server could handle alone. Perhaps the most obvious use case is web servers. A popular web site might get many millions of hits every day. There’s no way that one server, even a very expensive one, could stand up to that. Instead, many inexpensive servers are placed behind the load balancer and the requests are spread evenly among them. In a well-written web application, any server can handle any request. So this process is transparent to the user. They simply browse your site as they normally would, with no hint that each page they view might be returned by a different server.

There are other benefits, too. Hardware fails, software has bugs, and human operators make mistakes. These are facts of life in Ops, but load balancers can help. If you “overbuild” your pool with extra servers, your service can survive losing several machines with no impact to the user. Likewise, you could take them down one at a time for security patching or upgrades. Or deploy a new build of your application to only 5% of your servers as a smoke test or “canary” for catastrophic failures before rolling it out site-wide.

If your app needs 5 web servers to handle your peak workload, and you have 6 in the pool, you have 1 server worth of headroom for failure. This is known as “N + 1” redundancy, and is the bare minimum you should strive for when supporting any production service. Whether you want even more spare capacity depends on the marginal cost of each additional server vs the expense of an outage. In the age of virtual machines, these extra boxes may be very cheap indeed.

There are many options available for load balancing, both hardware and software. On the hardware side, some popular (and often extremely expensive) names are F5 BIG-IP, Citrix NetScaler, and Coyote Point. In software, the best known is probably HAProxy, although nginx and Apache have some limited load balancing services, too. And if you’re a cloud native, Amazon’s Elastic Load Balancer (ELB) product is waiting for you.

Load Balancing Internal Services

Load balancing public services is important. However, there are likely many internal services that are equally crucial to your app’s uptime. These are sometimes overlooked. I certainly didn’t think of them as candidates for load balancing at first. But to your users, an outage is an outage. It doesn’t matter whether it was because of a failure on a public web server or an internal DNS server. They needed you, and you were down.

Some examples of services you might load balance are DNS, SMTP for email, ElasticSearch queries and database reads. These might be able to run on a single machine from a sheer horsepower perspective, but load balancing them still gives you the advantages of redundancy to guard against failure and allow for maintenance.

You might even apply these techniques to your company’s internal or enterprise IT systems. If employees need to authenticate against an LDAP directory to do their jobs, it would be wise to load balance several servers to ensure business doesn’t grind to a halt with one failed hard drive.

Takeaway

Load balancing is a powerful tool for improving the performance, resiliency and operability of your services. It’s used as a matter of course on public systems, but give a thought to what it can do for your lower-profile ones, too.

That’s not to say that it’s a cure-all. Some services just aren’t suited to it, such as database writes (without special software designed for multiple masters). Or batch jobs that pull their work from a central queue. Other applications might not be “stateless” and misbehave if the user is routed to a different server on each request. As always, use the right tool for the job!

Making Varnish 4 and SELinux Play Nice

Why do you hate productivity?

When standing up this blog, I chose CentOS 7 as the underlying OS to get some experience with systemd and other new tech in Red Hat’s latest release. With Red Hat, of course, comes the specter of SELinux. There’s an attitude among some Linux admins that SELinux is just a pain in the ass that prevents you from getting work done, and the “fix” is to disable it outright. I get it. It’s extremely confusing when something as simple as trying to access a file you appear to have read permissions on fails with a misleading error message. Or a service fails to start for no apparent reason.

But configured properly, SELinux can give you a real leg up when it comes to security. With a new exploit or high-profile corporate breach in the news every week these days, you don’t need to be a Level 10 UNIX Wizard to see the value in another layer of protection. For 2015, I’ve decided to suck it up, eat my veggies and learn to love (or at least deal with) SELinux.

Configuring SELinux for Varnish 4

I chose to install Varnish as a caching layer in front of Apache, for the day my little blog makes the front page of Reddit. It’s going to happen any minute now. Just watch. And naturally, since I was starting from scratch anyway, I installed the latest version (4.0.2 as of this post). Apparently Varnish 3.x is properly configured for SELinux out of the box, but that is not the case for the new hotness in Varnish 4. You can find some gory details in the Red Hat bug tracker, but basically a code change in Varnish 4 makes it require access to a few new system calls which have not been whitelisted in the CentOS 7 SELinux packages.

This issue shows itself when you attempt to start up Varnish, and it fails. Checking on why, you can see there’s a strange permissions problem. SELinux rears its ugly delightful head.

# systemctl status varnish
<snip>
varnishd[20364]: Failed to set permissions on ./vcl.UdrgPE5O.so: Operation not permitted

You can use the audit2allow tool to parse the SELinux logs in /var/log/audit/ and not only tell you why something was blocked, but how to fix it. Here, we’ll use the -M flag which generates a module file that you can then import into SELinux.

# grep varnishd /var/log/audit/audit.log | audit2allow -M varnishd2
# semodule -i varnishd2.pp
# systemctl restart varnish
# systemctl status varnish
varnish.service - Varnish a high-perfomance HTTP accelerator
   Loaded: loaded (/usr/lib/systemd/system/varnish.service; enabled)
   Active: active (running) since Sun 2015-02-01 04:09:15 UTC; 100ms ago

That’s it! If everything went correctly, varnish should now be running. The first command finds all Varnish activities that were blocked, and feeds them to audit2allow with the -M flag. This generates a SELinux module named varnishd2.pp which can be loaded to allow all of the calls that had previously been blocked. I’ve named it varnishd2 because there’s already a varnishd module shipped with CentOS. However, it’s the old Varnish 3.x edition that doesn’t work with version 4.

Was that so bad? (Ok, it was kind of bad). But now that you know this pattern, you can work your way around a lot of SELinux issues quickly the next time they crop up. Hopefully sometime down the line the SELinux packages will be updated, making this step unnecessary for Varnish 4.

Salt Multi-Master Bug in 2014.7

A word of warning about the Salt 2014.7 series if you run in multi-master mode. This past week, I tried rolling out Salt 2014.7.1 (aka Heliuim) to our production environment at work. The 2014.7 line has a lot of exciting new features and fixes, so we’ve been eager to get it out. Having been bitten by bugs in the past, though, we wanted to wait for the first point release to land. That recently dropped, and after a few days kicking the tires in our test environment I was confident that the upgrade would go smoothly

Sadly, it was not to be. We run a check in our Zabbix monitoring system periodically to make sure that every master is able to make a simple test.ping connection to every minion. Our production setup uses four masters in multi-master mode for redundancy. Shortly after completing the upgrade of all masters and minions, this check began to fail. Not in a very consistent way, but each master had a different subset of minions it could not reach. Normally when a minion loses touch with a master, it’s fixable by simply restarting the salt-minion service, but that did not work here. It just turned into a game of whack-a-mole where the minion would become unreachable from a different master instead.

Diving into the Salt issue tracker, I came upon issues 18322 and 19932 which were filed against 2014.7 and sounded very familiar. They both indicate minions failing to respond to commands from masters, seemingly at random. The common thread was use of multi-master mode. One user suggested a workaround of setting multiprocessing: False on the minions. I found that improved matters–fewer minions were randomly failing to respond–but did not fix it completely. It also seems that this issue is fixed in the upcoming 2015.2 “Lithium” release, but that is a long ways off and the fix is not easily backported.

Multi-master mode explained the nagging question of why testing had gone off without a hitch. Our test environment only uses one master, and would not have triggered the bug(s). So, shame on me for that! It’s a best practice to test in a configuration as close to production as possible, and I will be fixing that soon. In a virtualized world, there’s minimal cost to spinning up a second master. And the time spent will certainly be less than I spent chasing down this bug.

In the end, I was forced to roll back the upgrade from all of our servers, which was not a fun job. But at the end of it, Salt was running smoothly once again and our monitoring system was clear. I’ll continue to work with the SaltStack team to find a fix. They’re great folks and very committed to both their product and the community, so I am confident it will happen sooner rather than later.

Please leave a comment if you’ve encountered this issue, or have a workaround.

Introduction to Salt-cloud (Part 2)

In part 1 of this series, we got a 10,000 foot view of salt-cloud. What it is, why you might want to use it, and the pieces that make it up. Now, it’s time to get our hands dirty and boot some VM’s.

The salt-cloud Command

Once you’ve installed the appropriate packages for your operating system, you should have the salt-cloud utility available. This CLI app is your interface to salt-cloud. For some examples of what it can do, check out the abridged version of the help output below (from salt-cloud 2014.7.1 on OS X):

jhenry:~ jhenry$ salt-cloud -h
Usage: salt-cloud

Options:
  -c CONFIG_DIR, --config-dir=CONFIG_DIR
                        Pass in an alternative configuration directory.
                        Default: /etc/salt

  Execution Options:
    -p PROFILE, --profile=PROFILE
                        Create an instance using the specified profile.
    -m MAP, --map=MAP   Specify a cloud map file to use for deployment. This
                        option may be used alone, or in conjunction with -Q,
                        -F, -S or -d.
    -d, --destroy       Destroy the specified instance(s).
    -P, --parallel      Build all of the specified instances in parallel.
    -u, --update-bootstrap
                        Update salt-bootstrap to the latest develop version on
                        GitHub.

  Query Options:
    -Q, --query         Execute a query and return some information about the
                        nodes running on configured cloud providers
    -F, --full-query    Execute a query and return all information about the
                        nodes running on configured cloud providers
    --list-providers    Display a list of configured providers.

  Cloud Providers Listings:
    --list-locations=LIST_LOCATIONS
                        Display a list of locations available in configured
                        cloud providers. Pass the cloud provider that
                        available locations are desired on, aka "linode", or
                        pass "all" to list locations for all configured cloud
                        providers
    --list-images=LIST_IMAGES
                        Display a list of images available in configured cloud
                        providers. Pass the cloud provider that available
                        images are desired on, aka "linode", or pass "all" to
                        list images for all configured cloud providers
    --list-sizes=LIST_SIZES
                        Display a list of sizes available in configured cloud
                        providers. Pass the cloud provider that available
                        sizes are desired on, aka "AWS", or pass "all" to list
                        sizes for all configured cloud providers

I’ve trimmed out some poorly documented options to focus on what we’ll use in this post (dumpster diving through the source code to determine what some of those options do may turn into a future article).

As you can see, most salt-cloud actions require either a profile or a map (remember those from part 1?) to execute. Given nothing but a profile (-p) or map (-m), salt-cloud will attempt to boot the named instance(s) in the associated provider’s cloud. Paired with destroy (-d), it will–wait for it–terminate the instance. With -Q or -F, it will query the provider for running instances that match the profile or map and return information about their state. The final set of --list options may be used to view the various regions, images and instance sizes available from a given provider. Handy if you regularly work with several different vendors and can’t keep them all straight.

Configuring a Provider

Time for some concrete examples. Let’s set up Amazon EC2 as a salt-cloud provider, using a config very much like the one that booted the instance where my blog lives.

ec2-dealwithit:
  id: 'Your IAM ID'
  key: 'Your IAM key'
  keyname: centos
  private_key: ~/.ssh/centos.pem
  securitygroup: www
  provider: ec2
  del_root_vol_on_destroy: True
  del_all_vols_on_destroy: True

I’ve stripped out a couple advanced options, but that’s the gist. It’s plain YAML syntax, like all Salt config. To break it down:

ec2-dealwithit: This is an arbitrary ID that serves as the name of your provider. You’ll reference this in other configs, such as profiles (see next section).

id and key: your AWS credentials, specifically an IAM id:key pair. Pretty self explanatory.

keyname and private_key: The name of an SSH keypair you have previously configured at EC2, and the local path to the private key for that same keypair. This is what allows salt-cloud to log into your freshly booted instance and perform some bootstrapping.

securitygroup: controls which security group (sort of a simple edge firewall, if you are not familiar with EC2) your instances should automatically join.

provider maps to one of salt-cloud’s supported cloud vendors, so it knows which API to speak.

del_root_vol_on_destroy and del_all_vols_on_destroy: determine what should happen to any EBS volumes created alongside your instances. In my case, I want them cleaned up when my instances die so I don’t end up paying for them forever. But YMMV, be sure you’re not going to be storing any critical data on these volumes before you configure them to self-destruct! Confusingly, you need to specify both if you want all EBS volumes to be destroyed. Some instances, such as the newer t2.micro, automatically create an EBS root volume on boot. Setting del_all_vols does not destroy this volume. It only destroys any others you may later attach. So again, consider the behavior you want and set these appropriately. The default behavior depends on which AMI you’re using for your instance, so it’s best to set these explicitly.

Configuring a Profile

Armed with your provider config, it’s time to create a profile. This builds on the provider and describes the details of an individual VM.

ec2-www:
  provider: ec2-dealwithit
  image: ami-96a818fe
  size: t2.micro
  ssh_username:
    - centos
  location: us-east-1
  availability_zone: us-east-1b
  block_device_mappings:
    - DeviceName: /dev/sda1
      Ebs.VolumeSize: 30
      Ebs.VolumeType: gp2

Once again, a fairly straightforward YAML file.

ec2-www: An arbitrary identifier used to reference your profile in other configs or from the CLI.

provider: The name of a provider you’ve previously defined in /etc/salt/cloud.profiles.d/. In this case, the one we just set up earlier.

image: An AMI image ID which will be the basis for your VM.

size: The size or “flavor” for your instance. You can print a list of available sizes for a given provider with a command like this: salt-cloud --list-sizes ec2-dealwithit

ssh_username: The user that the salt-bootstrap code should use to connect to your instance, using the SSH keypair you defined earlier in the provider config. This is baked into your AMI image. If you work with several images that use different default users, you can list them all and salt-cloud will try them one by one.

Location and availability_zone: The region and AZ where your instance will live (if you care). You can print a list of locations for a provider with salt-cloud --list-locations ec2-dealwithit.

block_device_mappings: Create or modify an EBS volume to attach to your instance. In my case, I’m using a t2.micro instance which comes with a very small (~6GB) root volume. The AWS free tier allows up to 30GB of EBS storage for free, so I opted to resize the disk to take advantage of that. I also used the gp2 (standard SSD) volume type for better performance. You can map as many EBS volumes as you like, or leave it off entirely if it’s not relevant to you.

Configuring a Map

The final config file–which is optional–that I want to touch on is a map. Remember, a map lays out multiple instances belonging to one or more profiles, allowing you to boot a full application stack with one command. Here’s a quick example:

ec2-www:
  - web1
  - web2
  - staging:
      minion:
        master: staging-master.example.com

ec2-www: This is the name of a profile that you’ve previously defined. Here, I’m using the ec2-www profile that we created above.

web1, web2, ...: These are the names of individual instances that will be booted based on the parent profile.

staging: Here, I’m defining an instance and overriding some default settings. Because I can! Specifically, I changed the minion config that salt-bootstrap will drop onto the newly booted host in /etc/salt/mimion. For example, you could set up a staging server where you test code before deploying it fully. This server might be pointed at a different salt-master to keep it segregated from production. Nearly any setting from the Core, Provider and Profile level can be overwritten to suit your needs.

Making It Rain

Ok, I had to get one bad cloud joke in. Lighten up. Anyway, now that we’ve laid out our config files, we can go about the business of actually managing our cloud(s).

salt-cloud -p ec2-www web1

Boom! You just booted a VM named web1 based on the ec2-www profile we created earlier. If it seems like it’s taking a long time, that’s because the salt-bootstrap deploy script runs on first boot, loading salt onto the new minion for management. Depending on the log level you’ve configured in the core config (/etc/salt/cloud by default), salt-cloud will either sit silently and eventually report success, or spam your console with excruciating detail about its progress. But either way, when it’s done, you’ll get a nice YAML-formatted report about your new VM.

salt-cloud -a reboot web1
[INFO    ] salt-cloud starting
The following virtual machines are set to be actioned with "reboot":
  www2

Proceed? [N/y] y
... proceeding
[INFO    ] Complete
ec2-www:
    ----------
    ec2:
        ----------
        web1:
            ----------
            Reboot:
                Complete

In this example, we’re using the -a (action) option to reboot the instance we just created. Salt-cloud loops through all of your providers, querying them for an instance with the name you provide. Once found, it sends the proper API call to the cloud vendor to reboot the instance.

salt-cloud -p ec2-www -d web1
[INFO    ] salt-cloud starting
The following virtual machines are set to be destroyed:
  ec2-www:
    ec2:
      web1

Proceed? [N/y] y
... proceeding
[INFO    ] Destroying in non-parallel mode.
[INFO    ] [{'instanceId': 'i-e7800116', 'currentState': {'code': '48', 'name': 'terminated'}, 'previousState': {'code': '80', 'name': 'stopped'}}]
ec2-www:
    ----------
    ec2:
        ----------
        web1:
            ----------
            currentState:
                ----------
                code:
                    48
                name:
                    terminated
            instanceId:
                i-e7800116
            previousState:
                ----------
                code:
                    80
                name:
                    stopped

Now that we’re done playing, I’ve deleted the instance we just booted. Easy come, easy go.

salt-cloud -m /etc/salt/cloud.maps.d/demo.map -P

In this last example, we’re booting the map we created earlier. This should bring up 3 VM’s: web1, web2, and staging. The -P option makes this happen in parallel rather than one at a time. The whole point of working in the cloud is speed, so why wait around?

Wrapping Up

That pretty well covers the basics of salt-cloud. What it is, how to configure it, and how to turn those configs into real, live VM’s at your cloud vendor(s) of choice. There’s certainly more to salt-cloud than what I’ve covered so far. The official docs could also stand some improvement, to put it mildly. So I definitely plan to revisit salt-cloud in future posts. I’m already planning one to talk about deploy scripts such as the default salt-bootstrap.

If you’re wondering “why go to all this trouble writing configs just to boot a dang VM?”, it’s a fair point. But there are reasons! One major benefit of salt-cloud is the way it abstracts away vendor details. You write your configs once, and then use the same CLI syntax to manage your VM’s wherever they may live. It also gives you the advantages of infrastructure as code. You can keep these configs in version control systems like git. You can see at a glance what VM’s should exist, and how they should be configured. It gives you a level of consistency and repeatability you don’t get from ad-hoc work at the command line or a web GUI. These are all basic tenets of good, modern system administration.

I hope that this series was helpful! Please feel free to leave a comment with any questions, corrections or discussion.

Introduction to Salt-cloud (Part 1)

I’ll come right out with it: I’m a big fan of SaltStack–or Salt, for short. Salt is an open-source configuration management and remote execution tool that plays in the same sandbox as products like Puppet, Chef and Ansible. Written in Python, Salt actually started out as a tool purely for remote execution. Think of the infamous “SSH in a for-loop” that every sysadmin has written to automate repetitive tasks, on steroids. Config management was only added later as demand for those features grew. Because of that heritage, Salt has always excelled at orchestration and administration tasks.

One lesser-known member of the Salt family is salt-cloud, a tool for provisioning new VM’s that abstracts away the differences between vendors. This makes it easy to deal with multiple cloud providers without having to stop and learn a new API for each one. Write a short YAML configuration containing your credentials and detailing how many and what type of instances you want to boot, and salt-cloud will make it happen.

This is the first post in a short series on salt-cloud, and assumes some basic familiarity with Salt, such as how to write YAML states and execute simple commands from a CLI. If you need a refresher, the official documentation and tutorials are a great place to start.

Enter Salt-cloud

Salt-cloud is a relative newcomer to the Salt ecosystem, although it has been in development for a couple years now. It started out as a separate project, but was rolled into the main Salt release bundle for version 2014.1, aka “Hydrogen”. Salt-cloud’s humble mission is to take Salt’s config management and execution capabilities and scale them up to managing the instances that make up your cloud infrastructure. Instead of editing files and starting services on individual machines, salt-cloud defines which machines should exist at all, specifies their hardware profile, and lets you boot, reboot or terminate them at will. This takes infrastructure as code to a new level.

Like all Salt tools, salt-cloud runs from a CLI and takes its configuration from simple, YAML-formatted files. This config is made up of “providers”, “profiles”, and optional “maps” and “deploy scripts”. Let’s take a deeper look at each of these components.

Getting Started With Salt-Cloud

To play with salt-cloud, you’ll need a recent build of Salt on your machine. I’m working on Mac OS X, using the excellent Homebrew package manager. So in my case, a simple brew install saltstack was all it took. Several Linux distributions make Salt available out of the box, but it’s typically an ancient version so you will want to use a third-party repo. Ubuntu users can take advantage of SaltStack’s official PPA repo, while RHEL/CentOS folks can get it from EPEL (you may need to enable epel-testing to get the very latest and greatest). salt-cloud has its own package, though it depends on salt-master to function. So you must install both.

Configuring Your Cloud

By default, salt-cloud expects to find config files underneath /etc/salt/, although you can point that anywhere you like with the -c parameter. The Linux packages will create this by default; homebrew does not. Because I’d prefer to be able to edit these configs without constantly running sudo, I chose to mirror them in my home directory. You will need to store sensitive credentials in these files, so do what makes sense for your environment.

mkdir -p ~/salt/{cloud.conf.d,cloud.deploy.d,cloud.maps.d,cloud.profiles.d,cloud.providers.d}

There’s a mouthful. Let’s take a minute to chew.

Config Elements

Core Config contains a handful of top-level settings common to all Providers, Profiles and Maps. This is the place to put your default master and minion configs, and miscellaneous customizations like where salt-cloud should write log files. This is read from /etc/salt/cloud and /etc/salt/cloud.conf.d/*.conf by default.

Providers define top-level settings for a given cloud vendor (Amazon, Digital Ocean, OpenStack, Rackspace, and many more). Things like credentials, security groups, and common settings you want to apply to all VM’s you create at this provider. Any *.conf files underneath cloud.providers.d/ will automatically be parsed by salt-cloud. That pattern continues for the other config elements below.

Profiles are linked to a provider. They define an individual VM, and include settings such as the instance size, which region the VM should boot into, and what image or template it should be based upon.

Maps are an optional feature that let you string together a number of profiles to build a full-blown application stack. Say you’ve defined a small www profile and a second, beefier db profile. With a map, you can ask for three www servers and one db in Amazon US-East-1, with the same in US-West-2, and then have salt-cloud spin the whole bunch up with one command.

Deploy scripts are another optional piece. By default, Salt loads itself onto any cloud VM’s you boot so that you can manage and configure them with no additional work. Which is awesome. This is done using a torturous 5000 line Bash script (seriously!) named salt-bootstrap. If you need functionality that the built-in script does not provide, you can write your own deploy script instead.

Many configuration options can be passed at any of these levels (core, provider, profile, map) which is both a little confusing and very powerful. For example, you can provide a custom minion configuration that all of your VM’s will automatically boot with at the Core level. Which you can then override on an individual basis down in a profile or even a map, if you so choose.

That’s Great, But How Do I Actually Use It?

So, there’s an overview of the pieces that make up salt-cloud. In part 2 of this series, we’ll get into some concrete examples of how to actually write a config and boot your cloud.

Welcome to Dealing With IT!

Welcome to Dealing With IT! On this blog, I’ll be writing about topics in system administration and operations. I do not have a specific agenda for this site, but I will be posting tutorials, thoughts, technical deep-dives and random linkdumps as interesting topics come to me.

To kick things off, I’m planning a series of posts on how I used SaltStack and their salt-cloud tool to build and configure this site. Salt is a config management and remote execution framework that plays in the same space as Puppet, Chef, Ansible, and friends. We use it extensively at my employer, and it seemed only natural to put it to work powering this blog. More details can wait for the actual blog posts.

Who am I? Good question. To crib from my own About page, I’m Jon Henry. I’ve been working professionally in IT since 2006, spending most of that time as a sysadmin for startups and small businesses. I started this blog to write about my professional passions, share what I’ve learned, and interact with other practitioners. Those passions include system administration at scale, configuration management, server and app monitoring and analytics, working collaboratively with developers (dare I say it–DevOps?), public and private cloud computing and the field of Operations in general. I’m sure other topics will creep in but my goal is to keep things focused on those areas. I live in Colorado, and spend my working days as the Lead Sysadmin for Photobucket. When I’m not wrangling servers, I enjoy hiking, biking, blasting punk rock and metal, watching terrible amazing old zombie movies, and sampling Colorado’s endless selection of fine craft beers. Don’t tell the Denver locals, but I’m a huge Red Sox and Patriots fan, having grown up outside of Boston.

Thanks for dropping by! Meeting and interacting with other folks working in Ops is at least half the reason I’m starting this blog. So please feel free to leave comments, or hit me up on Twitter at @jdwithit or by email at jhenry@dealingwithit.net. I’m looking forward to getting started.