Nginx Load Balancer Improvements to proxy_next_upstream

This change happened in March of 2016, but was still news to me when I stumbled across it recently. So I wanted to share since it’s important but didn’t seem to be loudly broadcast. Nginx is no longer dangerously bad at load balancing!

Among the many features of the outstanding Nginx webserver is the ability to act as a load balancer. Using the built-in upstream module, you can define a pool of backend app servers that should take turns servicing requests. And in theory, you can tell nginx to skip a server if it is down or returning error (HTTP 5xx) responses.

In practice, however, Nginx’s handling of downed servers can be very dangerous. That Hacker News thread notes that when a server returns an error, Nginx will by default always retry the request on a second server. This is fine most of the time. But what if the request was “charge $10,000 to my credit card”? Maybe the server correctly applied the charge, but then failed while rendering the confirmation page and returned an error. Well, get ready for some real angry customer support calls. Nginx would have resubmitted that same $10,000 charge over and over until a server responded with an HTTP 200 OK.

For this reason, many admins recommend setting the value proxy_next_upstream off;. This makes a failed backend request simply return an error page instead of retrying it on another server. Definitely not ideal; who wants their users to see error pages? But better than handling a deluge of chargebacks from outraged customers who were billed multiple times. In reality, this often meant admins chose another, specialized tool for their load balancing needs, like HAProxy or an expensive hardware appliance from the likes of F5 or A10.

But wait! With the release of Nginx 1.9.13, things got better. Nginx will now never retry “non-idempotent” requests unless you explicitly tell it to. Idempotent means that no matter how many times you perform an action, it always has the same result. So this excludes POSTs, and a few more obscure methods.

So if you’re still running with proxy_next_upstream off; in your config because of those concerns, it’s time to test removing it. Nginx’s load balancing is much safer and saner than it was this time last year.

New Tool: Docker RPM Builder

Being a CentOS shop, at my employer we maintain a library of homegrown RPM’s. We use these for a number of purposes, from packaging up software that doesn’t ship with its own RPM to rebuilding open source apps with custom patches applied. Historically, though, these… haven’t exactly been built and managed using best practices. RPM’s were built ad-hoc on any available machine. Source RPM’s were usually (but not always) dumped onto a file server, and could be a nightmare to rebuild in the future if there were a lot of dependencies.

I’ve been looking into tools to help our Ops team come into the 21st century with respect to software building and packaging. Running it all through Jenkins CI is a no-brainer. But what would do the heavy lifting of actually building the RPM’s? Conveniently, just as I was starting to explore this, the DevOps Weekly Newsletter came to the rescue! As an aside, DevOps Weekly is awesome and I highly encourage you to check it out.

That week’s issue highlighted a tool called docker-rpm-builder by Alan Franzoni. It leverages Docker to perform the RPM builds. Given a base image and an appropriate Dockerfile, the tool quickly installs all dependencies and runs the build inside of a container, spitting out a finished RPM in just a few seconds. This saves you from the joys of managing all of the build dependencies (and their inevitable conflicts), or needing to run dozens of VM’s each customized for an RPM’s specific needs.

I’m only just getting started with docker-rpm-builder, but it looks quite slick. As I work with it more seriously, I plan to post some hands-on tutorials and report on how it’s worked out for taming our Ops build environment mess.

If you have any experience with the tool, or have tackled this challenge before, I’d love to hear about it.

What Ops Can Learn from Agile and CI/CD

A Conversation

I was chatting recently with a fellow Ops engineer at work about a project he was wrapping up. This colleague had written a large chunk of code for our config management tools to manage a new class of system. However, he was reluctant to merge his branch into the main codebase and release it to production. What if it doesn’t work properly, or behaves in an unexpected way? Just last week, someone pushed a bad change to config management and it nearly took down the whole site! Yes, better to leave the code disabled. If a change needs to be made, it can be done manually and then copied by hand into config management so it is not lost.

His proposal reminded me of the Waterfall model for software delivery and the extreme risk-aversion of traditional Operations teams. In a DevOps world, however, these practices don’t fly. In order to compete, businesses need the ability to put new features and bug fixes in front of users as quickly as possible. Without compromising quality. Development teams figured this out first, and came up with several techniques for achieving these goals. Let’s review a couple, and then look at how Operations can learn from them as well.

Continuous Integration

In one traditional development model, everyone works on a personal, long-lived “feature branch” which is only merged back into the the main codebase (“trunk”) much later.. Unfortunately, this can lead to all sorts of problems. One developer’s work renames a class that another’s branch relies on. Or each one introduces new contradictory config options. In any case, these integration problems cannot be caught until the end of the release cycle when all of the feature branches are merged. Bugs, crunch time and a frantic scramble to resolve conflicts ensue.

Continuous Integration (CI) is a newer workflow where changes are merged into the trunk very frequently. Some teams even forego the use of branches completely, doing all work on the trunk. You might think this would cause extreme chaos, with many people all working on the same code at the same time. And without new safeguards, you’d be right. Teams practicing CI run a centralized build server such as Jenkins, which performs frequent automated builds and tests. If the build fails, the source tree is closed to further commits until the problems are fixed. In this way, every developer is working on the same code base and any integration problems are caught early. This process is only as robust as the tests themselves, of course, so some up front work writing a battery of useful tests is critical. The payoff comes in the form of quicker releases with fewer defects and lower stress.

Continuous Delivery

Continuous Delivery (CD) takes the concept of CI even further. Once you adopt CI, your code trunk is always in a state where it can be built without errors and pass basic tests. But traditionally, even a team that practices CI might only push a public release once a year. There’s so much packaging and testing to be done, not to mention the Ops work to actually deploy the code. Who has time to do that more often?

In the fast-paced world of web applications and SaaS offerings, the answer better be “you”. Rather than batching up changes for months or years, CD strives to get that code to users as quickly as possible. This is done by relentlessly automating the build and deploy process, working from a foundation of extensive automated tests (seeing a pattern yet?). The same build server that you set up for CI is extended to package up each finished build into a fully deployable release bundle. And, if all tests pass, it will automatically deploy that bundle to a staging environment for final approval–or even deploy straight to production! Building software this way has a number of benefits. When changes are delivered in small, easily understood batches, they’re simpler to debug when problems arise. And because the code is fresh in the developer’s mind, they’ll have less trouble coming up with a fix. It also gets the fruits of your labor out to users sooner. New features and fixes that sit undeployed for a year benefit nobody. With CD, as soon as the work is done, it can be deployed and start making your customers happy.

Putting a little Dev in your Ops

With those ideas in mind, let’s circle back to my conversation at work. My coworker had developed a batch of code in isolation from the rest of the team. It was “done” in his view, but it had not been well tested, to the point where he was afraid to merge it into trunk or deploy it to production. Call me crazy, but I have a hard time calling code “done” if you can’t or won’t run it! What lessons can we apply from CI/CD and how they improve development? We’ll take his concerns one at a time.

“What if it doesn’t work?” I can certainly appreciate not wanting to run untested code in production, but not running it at all is not the solution. CI/CD advocate rapidly iterating in small batches, testing each change as you go. In this way you gain confidence that you’ve built a very solid foundation where every piece works. Test early and often before pushing your changes live. Ideally, you can find a way to automate those tests. Peer review from your team is another great tool.

“What if I make a bad change?” This comes back to testing and confidence again. Development is churning out high quality features at breakneck speed, and it’s Ops’ job to match them. Trust your tests and automation, just as dev does. If those aren’t up to the task, it’s time to start beefing them up. If you’re not comfortable deploying your own code, how can you expect to do it for others?

“I’ll write the code but never merge the branch into trunk or deploy it.” Hoo boy. What do you call the polar opposite of CI/CD? The maxim “incorrect documentation is worse than no documentation” applies here. With missing docs, at least you know where you stand. But bad documentation is actively misleading and can do great harm when you follow it. Unmerged, untested code works the same way. Someone will eventually find and run it–with unpredictable results. You’ve also burned time on work that is not delivering any value. At best, it’s wasted effort. At worst, a time bomb waiting to go off. This configuration living off to the side is like a Waterfall developer’s feature branch. Isolated and unused, it’s just waiting to cause problems if and when it is finally merged and deployed.

“I’ll make sure to mirror every manual change back into config management.” …until you don’t. Nobody is perfect, and you are eventually going to miss something. Your config is now inaccurate, and you won’t find out until the server dies years later. Someone dutifully provisions a new one using the saved config, but now the service is behaving strangely because it is not set up correctly. Good luck tracking down that crucial missing change. This is analogous to a developer refusing to write or run any automated tests because they tested by hand and “it worked on my machine”. I think everyone’s heard that line before. Once again, trust your automation and leave the human error out of it.

Wrap Up

Development teams have reinvented themselves with Agile techniques, Continuous Integration and Continuous Delivery. This allows them to write code with unprecedented speed and without compromising quality. Thankfully, many of those same lessons are directly applicable to Ops. Test everything. Once a process is well defined, automate relentlessly to ensure it’s done right every time. Work in small, easily digestible iterations and deploy them frequently. If a process is slow or painful, focus your efforts on that bottleneck first.

Modern system administration is described as “infrastructure as code”, and that’s not just a catch phrase. This type of work closely resembles software development, and there’s a large body of best practices that Ops can leverage to improve the service we deliver. Embrace that knowledge. Maybe even ask your favorite developer over lunch about how and why they use CI/CD. Dev and Ops collaborating… what’s the worst that could happen?

Are you using CI or CD in the field, whether it be in Dev or Ops? How’s it working out for you? I’d love to hear your comments.

If you couldn’t tell, I find this topic fascinating. In a future post I plan to talk in detail about tools and processes for automating tests of your infrastructure and configs.

Load Balance All The Things

Load Balancing Basics

If you’ve done much work in Operations, you’ve probably encountered a load balancer. This dedicated network device sits between clients and pools of servers, spreading the incoming traffic between them to achieve a greater scale than any one server could handle alone. Perhaps the most obvious use case is web servers. A popular web site might get many millions of hits every day. There’s no way that one server, even a very expensive one, could stand up to that. Instead, many inexpensive servers are placed behind the load balancer and the requests are spread evenly among them. In a well-written web application, any server can handle any request. So this process is transparent to the user. They simply browse your site as they normally would, with no hint that each page they view might be returned by a different server.

There are other benefits, too. Hardware fails, software has bugs, and human operators make mistakes. These are facts of life in Ops, but load balancers can help. If you “overbuild” your pool with extra servers, your service can survive losing several machines with no impact to the user. Likewise, you could take them down one at a time for security patching or upgrades. Or deploy a new build of your application to only 5% of your servers as a smoke test or “canary” for catastrophic failures before rolling it out site-wide.

If your app needs 5 web servers to handle your peak workload, and you have 6 in the pool, you have 1 server worth of headroom for failure. This is known as “N + 1” redundancy, and is the bare minimum you should strive for when supporting any production service. Whether you want even more spare capacity depends on the marginal cost of each additional server vs the expense of an outage. In the age of virtual machines, these extra boxes may be very cheap indeed.

There are many options available for load balancing, both hardware and software. On the hardware side, some popular (and often extremely expensive) names are F5 BIG-IP, Citrix NetScaler, and Coyote Point. In software, the best known is probably HAProxy, although nginx and Apache have some limited load balancing services, too. And if you’re a cloud native, Amazon’s Elastic Load Balancer (ELB) product is waiting for you.

Load Balancing Internal Services

Load balancing public services is important. However, there are likely many internal services that are equally crucial to your app’s uptime. These are sometimes overlooked. I certainly didn’t think of them as candidates for load balancing at first. But to your users, an outage is an outage. It doesn’t matter whether it was because of a failure on a public web server or an internal DNS server. They needed you, and you were down.

Some examples of services you might load balance are DNS, SMTP for email, ElasticSearch queries and database reads. These might be able to run on a single machine from a sheer horsepower perspective, but load balancing them still gives you the advantages of redundancy to guard against failure and allow for maintenance.

You might even apply these techniques to your company’s internal or enterprise IT systems. If employees need to authenticate against an LDAP directory to do their jobs, it would be wise to load balance several servers to ensure business doesn’t grind to a halt with one failed hard drive.

Takeaway

Load balancing is a powerful tool for improving the performance, resiliency and operability of your services. It’s used as a matter of course on public systems, but give a thought to what it can do for your lower-profile ones, too.

That’s not to say that it’s a cure-all. Some services just aren’t suited to it, such as database writes (without special software designed for multiple masters). Or batch jobs that pull their work from a central queue. Other applications might not be “stateless” and misbehave if the user is routed to a different server on each request. As always, use the right tool for the job!

Making Varnish 4 and SELinux Play Nice

Why do you hate productivity?

When standing up this blog, I chose CentOS 7 as the underlying OS to get some experience with systemd and other new tech in Red Hat’s latest release. With Red Hat, of course, comes the specter of SELinux. There’s an attitude among some Linux admins that SELinux is just a pain in the ass that prevents you from getting work done, and the “fix” is to disable it outright. I get it. It’s extremely confusing when something as simple as trying to access a file you appear to have read permissions on fails with a misleading error message. Or a service fails to start for no apparent reason.

But configured properly, SELinux can give you a real leg up when it comes to security. With a new exploit or high-profile corporate breach in the news every week these days, you don’t need to be a Level 10 UNIX Wizard to see the value in another layer of protection. For 2015, I’ve decided to suck it up, eat my veggies and learn to love (or at least deal with) SELinux.

Configuring SELinux for Varnish 4

I chose to install Varnish as a caching layer in front of Apache, for the day my little blog makes the front page of Reddit. It’s going to happen any minute now. Just watch. And naturally, since I was starting from scratch anyway, I installed the latest version (4.0.2 as of this post). Apparently Varnish 3.x is properly configured for SELinux out of the box, but that is not the case for the new hotness in Varnish 4. You can find some gory details in the Red Hat bug tracker, but basically a code change in Varnish 4 makes it require access to a few new system calls which have not been whitelisted in the CentOS 7 SELinux packages.

This issue shows itself when you attempt to start up Varnish, and it fails. Checking on why, you can see there’s a strange permissions problem. SELinux rears its ugly delightful head.

# systemctl status varnish
<snip>
varnishd[20364]: Failed to set permissions on ./vcl.UdrgPE5O.so: Operation not permitted

You can use the audit2allow tool to parse the SELinux logs in /var/log/audit/ and not only tell you why something was blocked, but how to fix it. Here, we’ll use the -M flag which generates a module file that you can then import into SELinux.

# grep varnishd /var/log/audit/audit.log | audit2allow -M varnishd2
# semodule -i varnishd2.pp
# systemctl restart varnish
# systemctl status varnish
varnish.service - Varnish a high-perfomance HTTP accelerator
   Loaded: loaded (/usr/lib/systemd/system/varnish.service; enabled)
   Active: active (running) since Sun 2015-02-01 04:09:15 UTC; 100ms ago

That’s it! If everything went correctly, varnish should now be running. The first command finds all Varnish activities that were blocked, and feeds them to audit2allow with the -M flag. This generates a SELinux module named varnishd2.pp which can be loaded to allow all of the calls that had previously been blocked. I’ve named it varnishd2 because there’s already a varnishd module shipped with CentOS. However, it’s the old Varnish 3.x edition that doesn’t work with version 4.

Was that so bad? (Ok, it was kind of bad). But now that you know this pattern, you can work your way around a lot of SELinux issues quickly the next time they crop up. Hopefully sometime down the line the SELinux packages will be updated, making this step unnecessary for Varnish 4.

Introduction to Salt-cloud (Part 2)

In part 1 of this series, we got a 10,000 foot view of salt-cloud. What it is, why you might want to use it, and the pieces that make it up. Now, it’s time to get our hands dirty and boot some VM’s.

The salt-cloud Command

Once you’ve installed the appropriate packages for your operating system, you should have the salt-cloud utility available. This CLI app is your interface to salt-cloud. For some examples of what it can do, check out the abridged version of the help output below (from salt-cloud 2014.7.1 on OS X):

jhenry:~ jhenry$ salt-cloud -h
Usage: salt-cloud

Options:
  -c CONFIG_DIR, --config-dir=CONFIG_DIR
                        Pass in an alternative configuration directory.
                        Default: /etc/salt

  Execution Options:
    -p PROFILE, --profile=PROFILE
                        Create an instance using the specified profile.
    -m MAP, --map=MAP   Specify a cloud map file to use for deployment. This
                        option may be used alone, or in conjunction with -Q,
                        -F, -S or -d.
    -d, --destroy       Destroy the specified instance(s).
    -P, --parallel      Build all of the specified instances in parallel.
    -u, --update-bootstrap
                        Update salt-bootstrap to the latest develop version on
                        GitHub.

  Query Options:
    -Q, --query         Execute a query and return some information about the
                        nodes running on configured cloud providers
    -F, --full-query    Execute a query and return all information about the
                        nodes running on configured cloud providers
    --list-providers    Display a list of configured providers.

  Cloud Providers Listings:
    --list-locations=LIST_LOCATIONS
                        Display a list of locations available in configured
                        cloud providers. Pass the cloud provider that
                        available locations are desired on, aka "linode", or
                        pass "all" to list locations for all configured cloud
                        providers
    --list-images=LIST_IMAGES
                        Display a list of images available in configured cloud
                        providers. Pass the cloud provider that available
                        images are desired on, aka "linode", or pass "all" to
                        list images for all configured cloud providers
    --list-sizes=LIST_SIZES
                        Display a list of sizes available in configured cloud
                        providers. Pass the cloud provider that available
                        sizes are desired on, aka "AWS", or pass "all" to list
                        sizes for all configured cloud providers

I’ve trimmed out some poorly documented options to focus on what we’ll use in this post (dumpster diving through the source code to determine what some of those options do may turn into a future article).

As you can see, most salt-cloud actions require either a profile or a map (remember those from part 1?) to execute. Given nothing but a profile (-p) or map (-m), salt-cloud will attempt to boot the named instance(s) in the associated provider’s cloud. Paired with destroy (-d), it will–wait for it–terminate the instance. With -Q or -F, it will query the provider for running instances that match the profile or map and return information about their state. The final set of --list options may be used to view the various regions, images and instance sizes available from a given provider. Handy if you regularly work with several different vendors and can’t keep them all straight.

Configuring a Provider

Time for some concrete examples. Let’s set up Amazon EC2 as a salt-cloud provider, using a config very much like the one that booted the instance where my blog lives.

ec2-dealwithit:
  id: 'Your IAM ID'
  key: 'Your IAM key'
  keyname: centos
  private_key: ~/.ssh/centos.pem
  securitygroup: www
  provider: ec2
  del_root_vol_on_destroy: True
  del_all_vols_on_destroy: True

I’ve stripped out a couple advanced options, but that’s the gist. It’s plain YAML syntax, like all Salt config. To break it down:

ec2-dealwithit: This is an arbitrary ID that serves as the name of your provider. You’ll reference this in other configs, such as profiles (see next section).

id and key: your AWS credentials, specifically an IAM id:key pair. Pretty self explanatory.

keyname and private_key: The name of an SSH keypair you have previously configured at EC2, and the local path to the private key for that same keypair. This is what allows salt-cloud to log into your freshly booted instance and perform some bootstrapping.

securitygroup: controls which security group (sort of a simple edge firewall, if you are not familiar with EC2) your instances should automatically join.

provider maps to one of salt-cloud’s supported cloud vendors, so it knows which API to speak.

del_root_vol_on_destroy and del_all_vols_on_destroy: determine what should happen to any EBS volumes created alongside your instances. In my case, I want them cleaned up when my instances die so I don’t end up paying for them forever. But YMMV, be sure you’re not going to be storing any critical data on these volumes before you configure them to self-destruct! Confusingly, you need to specify both if you want all EBS volumes to be destroyed. Some instances, such as the newer t2.micro, automatically create an EBS root volume on boot. Setting del_all_vols does not destroy this volume. It only destroys any others you may later attach. So again, consider the behavior you want and set these appropriately. The default behavior depends on which AMI you’re using for your instance, so it’s best to set these explicitly.

Configuring a Profile

Armed with your provider config, it’s time to create a profile. This builds on the provider and describes the details of an individual VM.

ec2-www:
  provider: ec2-dealwithit
  image: ami-96a818fe
  size: t2.micro
  ssh_username:
    - centos
  location: us-east-1
  availability_zone: us-east-1b
  block_device_mappings:
    - DeviceName: /dev/sda1
      Ebs.VolumeSize: 30
      Ebs.VolumeType: gp2

Once again, a fairly straightforward YAML file.

ec2-www: An arbitrary identifier used to reference your profile in other configs or from the CLI.

provider: The name of a provider you’ve previously defined in /etc/salt/cloud.profiles.d/. In this case, the one we just set up earlier.

image: An AMI image ID which will be the basis for your VM.

size: The size or “flavor” for your instance. You can print a list of available sizes for a given provider with a command like this: salt-cloud --list-sizes ec2-dealwithit

ssh_username: The user that the salt-bootstrap code should use to connect to your instance, using the SSH keypair you defined earlier in the provider config. This is baked into your AMI image. If you work with several images that use different default users, you can list them all and salt-cloud will try them one by one.

Location and availability_zone: The region and AZ where your instance will live (if you care). You can print a list of locations for a provider with salt-cloud --list-locations ec2-dealwithit.

block_device_mappings: Create or modify an EBS volume to attach to your instance. In my case, I’m using a t2.micro instance which comes with a very small (~6GB) root volume. The AWS free tier allows up to 30GB of EBS storage for free, so I opted to resize the disk to take advantage of that. I also used the gp2 (standard SSD) volume type for better performance. You can map as many EBS volumes as you like, or leave it off entirely if it’s not relevant to you.

Configuring a Map

The final config file–which is optional–that I want to touch on is a map. Remember, a map lays out multiple instances belonging to one or more profiles, allowing you to boot a full application stack with one command. Here’s a quick example:

ec2-www:
  - web1
  - web2
  - staging:
      minion:
        master: staging-master.example.com

ec2-www: This is the name of a profile that you’ve previously defined. Here, I’m using the ec2-www profile that we created above.

web1, web2, ...: These are the names of individual instances that will be booted based on the parent profile.

staging: Here, I’m defining an instance and overriding some default settings. Because I can! Specifically, I changed the minion config that salt-bootstrap will drop onto the newly booted host in /etc/salt/mimion. For example, you could set up a staging server where you test code before deploying it fully. This server might be pointed at a different salt-master to keep it segregated from production. Nearly any setting from the Core, Provider and Profile level can be overwritten to suit your needs.

Making It Rain

Ok, I had to get one bad cloud joke in. Lighten up. Anyway, now that we’ve laid out our config files, we can go about the business of actually managing our cloud(s).

salt-cloud -p ec2-www web1

Boom! You just booted a VM named web1 based on the ec2-www profile we created earlier. If it seems like it’s taking a long time, that’s because the salt-bootstrap deploy script runs on first boot, loading salt onto the new minion for management. Depending on the log level you’ve configured in the core config (/etc/salt/cloud by default), salt-cloud will either sit silently and eventually report success, or spam your console with excruciating detail about its progress. But either way, when it’s done, you’ll get a nice YAML-formatted report about your new VM.

salt-cloud -a reboot web1
[INFO    ] salt-cloud starting
The following virtual machines are set to be actioned with "reboot":
  www2

Proceed? [N/y] y
... proceeding
[INFO    ] Complete
ec2-www:
    ----------
    ec2:
        ----------
        web1:
            ----------
            Reboot:
                Complete

In this example, we’re using the -a (action) option to reboot the instance we just created. Salt-cloud loops through all of your providers, querying them for an instance with the name you provide. Once found, it sends the proper API call to the cloud vendor to reboot the instance.

salt-cloud -p ec2-www -d web1
[INFO    ] salt-cloud starting
The following virtual machines are set to be destroyed:
  ec2-www:
    ec2:
      web1

Proceed? [N/y] y
... proceeding
[INFO    ] Destroying in non-parallel mode.
[INFO    ] [{'instanceId': 'i-e7800116', 'currentState': {'code': '48', 'name': 'terminated'}, 'previousState': {'code': '80', 'name': 'stopped'}}]
ec2-www:
    ----------
    ec2:
        ----------
        web1:
            ----------
            currentState:
                ----------
                code:
                    48
                name:
                    terminated
            instanceId:
                i-e7800116
            previousState:
                ----------
                code:
                    80
                name:
                    stopped

Now that we’re done playing, I’ve deleted the instance we just booted. Easy come, easy go.

salt-cloud -m /etc/salt/cloud.maps.d/demo.map -P

In this last example, we’re booting the map we created earlier. This should bring up 3 VM’s: web1, web2, and staging. The -P option makes this happen in parallel rather than one at a time. The whole point of working in the cloud is speed, so why wait around?

Wrapping Up

That pretty well covers the basics of salt-cloud. What it is, how to configure it, and how to turn those configs into real, live VM’s at your cloud vendor(s) of choice. There’s certainly more to salt-cloud than what I’ve covered so far. The official docs could also stand some improvement, to put it mildly. So I definitely plan to revisit salt-cloud in future posts. I’m already planning one to talk about deploy scripts such as the default salt-bootstrap.

If you’re wondering “why go to all this trouble writing configs just to boot a dang VM?”, it’s a fair point. But there are reasons! One major benefit of salt-cloud is the way it abstracts away vendor details. You write your configs once, and then use the same CLI syntax to manage your VM’s wherever they may live. It also gives you the advantages of infrastructure as code. You can keep these configs in version control systems like git. You can see at a glance what VM’s should exist, and how they should be configured. It gives you a level of consistency and repeatability you don’t get from ad-hoc work at the command line or a web GUI. These are all basic tenets of good, modern system administration.

I hope that this series was helpful! Please feel free to leave a comment with any questions, corrections or discussion.

Introduction to Salt-cloud (Part 1)

I’ll come right out with it: I’m a big fan of SaltStack–or Salt, for short. Salt is an open-source configuration management and remote execution tool that plays in the same sandbox as products like Puppet, Chef and Ansible. Written in Python, Salt actually started out as a tool purely for remote execution. Think of the infamous “SSH in a for-loop” that every sysadmin has written to automate repetitive tasks, on steroids. Config management was only added later as demand for those features grew. Because of that heritage, Salt has always excelled at orchestration and administration tasks.

One lesser-known member of the Salt family is salt-cloud, a tool for provisioning new VM’s that abstracts away the differences between vendors. This makes it easy to deal with multiple cloud providers without having to stop and learn a new API for each one. Write a short YAML configuration containing your credentials and detailing how many and what type of instances you want to boot, and salt-cloud will make it happen.

This is the first post in a short series on salt-cloud, and assumes some basic familiarity with Salt, such as how to write YAML states and execute simple commands from a CLI. If you need a refresher, the official documentation and tutorials are a great place to start.

Enter Salt-cloud

Salt-cloud is a relative newcomer to the Salt ecosystem, although it has been in development for a couple years now. It started out as a separate project, but was rolled into the main Salt release bundle for version 2014.1, aka “Hydrogen”. Salt-cloud’s humble mission is to take Salt’s config management and execution capabilities and scale them up to managing the instances that make up your cloud infrastructure. Instead of editing files and starting services on individual machines, salt-cloud defines which machines should exist at all, specifies their hardware profile, and lets you boot, reboot or terminate them at will. This takes infrastructure as code to a new level.

Like all Salt tools, salt-cloud runs from a CLI and takes its configuration from simple, YAML-formatted files. This config is made up of “providers”, “profiles”, and optional “maps” and “deploy scripts”. Let’s take a deeper look at each of these components.

Getting Started With Salt-Cloud

To play with salt-cloud, you’ll need a recent build of Salt on your machine. I’m working on Mac OS X, using the excellent Homebrew package manager. So in my case, a simple brew install saltstack was all it took. Several Linux distributions make Salt available out of the box, but it’s typically an ancient version so you will want to use a third-party repo. Ubuntu users can take advantage of SaltStack’s official PPA repo, while RHEL/CentOS folks can get it from EPEL (you may need to enable epel-testing to get the very latest and greatest). salt-cloud has its own package, though it depends on salt-master to function. So you must install both.

Configuring Your Cloud

By default, salt-cloud expects to find config files underneath /etc/salt/, although you can point that anywhere you like with the -c parameter. The Linux packages will create this by default; homebrew does not. Because I’d prefer to be able to edit these configs without constantly running sudo, I chose to mirror them in my home directory. You will need to store sensitive credentials in these files, so do what makes sense for your environment.

mkdir -p ~/salt/{cloud.conf.d,cloud.deploy.d,cloud.maps.d,cloud.profiles.d,cloud.providers.d}

There’s a mouthful. Let’s take a minute to chew.

Config Elements

Core Config contains a handful of top-level settings common to all Providers, Profiles and Maps. This is the place to put your default master and minion configs, and miscellaneous customizations like where salt-cloud should write log files. This is read from /etc/salt/cloud and /etc/salt/cloud.conf.d/*.conf by default.

Providers define top-level settings for a given cloud vendor (Amazon, Digital Ocean, OpenStack, Rackspace, and many more). Things like credentials, security groups, and common settings you want to apply to all VM’s you create at this provider. Any *.conf files underneath cloud.providers.d/ will automatically be parsed by salt-cloud. That pattern continues for the other config elements below.

Profiles are linked to a provider. They define an individual VM, and include settings such as the instance size, which region the VM should boot into, and what image or template it should be based upon.

Maps are an optional feature that let you string together a number of profiles to build a full-blown application stack. Say you’ve defined a small www profile and a second, beefier db profile. With a map, you can ask for three www servers and one db in Amazon US-East-1, with the same in US-West-2, and then have salt-cloud spin the whole bunch up with one command.

Deploy scripts are another optional piece. By default, Salt loads itself onto any cloud VM’s you boot so that you can manage and configure them with no additional work. Which is awesome. This is done using a torturous 5000 line Bash script (seriously!) named salt-bootstrap. If you need functionality that the built-in script does not provide, you can write your own deploy script instead.

Many configuration options can be passed at any of these levels (core, provider, profile, map) which is both a little confusing and very powerful. For example, you can provide a custom minion configuration that all of your VM’s will automatically boot with at the Core level. Which you can then override on an individual basis down in a profile or even a map, if you so choose.

That’s Great, But How Do I Actually Use It?

So, there’s an overview of the pieces that make up salt-cloud. In part 2 of this series, we’ll get into some concrete examples of how to actually write a config and boot your cloud.