Nginx Load Balancer Improvements to proxy_next_upstream

This change happened in March of 2016, but was still news to me when I stumbled across it recently. So I wanted to share since it’s important but didn’t seem to be loudly broadcast. Nginx is no longer dangerously bad at load balancing!

Among the many features of the outstanding Nginx webserver is the ability to act as a load balancer. Using the built-in upstream module, you can define a pool of backend app servers that should take turns servicing requests. And in theory, you can tell nginx to skip a server if it is down or returning error (HTTP 5xx) responses.

In practice, however, Nginx’s handling of downed servers can be very dangerous. That Hacker News thread notes that when a server returns an error, Nginx will by default always retry the request on a second server. This is fine most of the time. But what if the request was “charge $10,000 to my credit card”? Maybe the server correctly applied the charge, but then failed while rendering the confirmation page and returned an error. Well, get ready for some real angry customer support calls. Nginx would have resubmitted that same $10,000 charge over and over until a server responded with an HTTP 200 OK.

For this reason, many admins recommend setting the value proxy_next_upstream off;. This makes a failed backend request simply return an error page instead of retrying it on another server. Definitely not ideal; who wants their users to see error pages? But better than handling a deluge of chargebacks from outraged customers who were billed multiple times. In reality, this often meant admins chose another, specialized tool for their load balancing needs, like HAProxy or an expensive hardware appliance from the likes of F5 or A10.

But wait! With the release of Nginx 1.9.13, things got better. Nginx will now never retry “non-idempotent” requests unless you explicitly tell it to. Idempotent means that no matter how many times you perform an action, it always has the same result. So this excludes POSTs, and a few more obscure methods.

So if you’re still running with proxy_next_upstream off; in your config because of those concerns, it’s time to test removing it. Nginx’s load balancing is much safer and saner than it was this time last year.

New Tool: Docker RPM Builder

Being a CentOS shop, at my employer we maintain a library of homegrown RPM’s. We use these for a number of purposes, from packaging up software that doesn’t ship with its own RPM to rebuilding open source apps with custom patches applied. Historically, though, these… haven’t exactly been built and managed using best practices. RPM’s were built ad-hoc on any available machine. Source RPM’s were usually (but not always) dumped onto a file server, and could be a nightmare to rebuild in the future if there were a lot of dependencies.

I’ve been looking into tools to help our Ops team come into the 21st century with respect to software building and packaging. Running it all through Jenkins CI is a no-brainer. But what would do the heavy lifting of actually building the RPM’s? Conveniently, just as I was starting to explore this, the DevOps Weekly Newsletter came to the rescue! As an aside, DevOps Weekly is awesome and I highly encourage you to check it out.

That week’s issue highlighted a tool called docker-rpm-builder by Alan Franzoni. It leverages Docker to perform the RPM builds. Given a base image and an appropriate Dockerfile, the tool quickly installs all dependencies and runs the build inside of a container, spitting out a finished RPM in just a few seconds. This saves you from the joys of managing all of the build dependencies (and their inevitable conflicts), or needing to run dozens of VM’s each customized for an RPM’s specific needs.

I’m only just getting started with docker-rpm-builder, but it looks quite slick. As I work with it more seriously, I plan to post some hands-on tutorials and report on how it’s worked out for taming our Ops build environment mess.

If you have any experience with the tool, or have tackled this challenge before, I’d love to hear about it.

Making Varnish 4 and SELinux Play Nice

Why do you hate productivity?

When standing up this blog, I chose CentOS 7 as the underlying OS to get some experience with systemd and other new tech in Red Hat’s latest release. With Red Hat, of course, comes the specter of SELinux. There’s an attitude among some Linux admins that SELinux is just a pain in the ass that prevents you from getting work done, and the “fix” is to disable it outright. I get it. It’s extremely confusing when something as simple as trying to access a file you appear to have read permissions on fails with a misleading error message. Or a service fails to start for no apparent reason.

But configured properly, SELinux can give you a real leg up when it comes to security. With a new exploit or high-profile corporate breach in the news every week these days, you don’t need to be a Level 10 UNIX Wizard to see the value in another layer of protection. For 2015, I’ve decided to suck it up, eat my veggies and learn to love (or at least deal with) SELinux.

Configuring SELinux for Varnish 4

I chose to install Varnish as a caching layer in front of Apache, for the day my little blog makes the front page of Reddit. It’s going to happen any minute now. Just watch. And naturally, since I was starting from scratch anyway, I installed the latest version (4.0.2 as of this post). Apparently Varnish 3.x is properly configured for SELinux out of the box, but that is not the case for the new hotness in Varnish 4. You can find some gory details in the Red Hat bug tracker, but basically a code change in Varnish 4 makes it require access to a few new system calls which have not been whitelisted in the CentOS 7 SELinux packages.

This issue shows itself when you attempt to start up Varnish, and it fails. Checking on why, you can see there’s a strange permissions problem. SELinux rears its ugly delightful head.

# systemctl status varnish
<snip>
varnishd[20364]: Failed to set permissions on ./vcl.UdrgPE5O.so: Operation not permitted

You can use the audit2allow tool to parse the SELinux logs in /var/log/audit/ and not only tell you why something was blocked, but how to fix it. Here, we’ll use the -M flag which generates a module file that you can then import into SELinux.

# grep varnishd /var/log/audit/audit.log | audit2allow -M varnishd2
# semodule -i varnishd2.pp
# systemctl restart varnish
# systemctl status varnish
varnish.service - Varnish a high-perfomance HTTP accelerator
   Loaded: loaded (/usr/lib/systemd/system/varnish.service; enabled)
   Active: active (running) since Sun 2015-02-01 04:09:15 UTC; 100ms ago

That’s it! If everything went correctly, varnish should now be running. The first command finds all Varnish activities that were blocked, and feeds them to audit2allow with the -M flag. This generates a SELinux module named varnishd2.pp which can be loaded to allow all of the calls that had previously been blocked. I’ve named it varnishd2 because there’s already a varnishd module shipped with CentOS. However, it’s the old Varnish 3.x edition that doesn’t work with version 4.

Was that so bad? (Ok, it was kind of bad). But now that you know this pattern, you can work your way around a lot of SELinux issues quickly the next time they crop up. Hopefully sometime down the line the SELinux packages will be updated, making this step unnecessary for Varnish 4.