Nginx Load Balancer Improvements to proxy_next_upstream

This change happened in March of 2016, but was still news to me when I stumbled across it recently. So I wanted to share since it’s important but didn’t seem to be loudly broadcast. Nginx is no longer dangerously bad at load balancing!

Among the many features of the outstanding Nginx webserver is the ability to act as a load balancer. Using the built-in upstream module, you can define a pool of backend app servers that should take turns servicing requests. And in theory, you can tell nginx to skip a server if it is down or returning error (HTTP 5xx) responses.

In practice, however, Nginx’s handling of downed servers can be very dangerous. That Hacker News thread notes that when a server returns an error, Nginx will by default always retry the request on a second server. This is fine most of the time. But what if the request was “charge $10,000 to my credit card”? Maybe the server correctly applied the charge, but then failed while rendering the confirmation page and returned an error. Well, get ready for some real angry customer support calls. Nginx would have resubmitted that same $10,000 charge over and over until a server responded with an HTTP 200 OK.

For this reason, many admins recommend setting the value proxy_next_upstream off;. This makes a failed backend request simply return an error page instead of retrying it on another server. Definitely not ideal; who wants their users to see error pages? But better than handling a deluge of chargebacks from outraged customers who were billed multiple times. In reality, this often meant admins chose another, specialized tool for their load balancing needs, like HAProxy or an expensive hardware appliance from the likes of F5 or A10.

But wait! With the release of Nginx 1.9.13, things got better. Nginx will now never retry “non-idempotent” requests unless you explicitly tell it to. Idempotent means that no matter how many times you perform an action, it always has the same result. So this excludes POSTs, and a few more obscure methods.

So if you’re still running with proxy_next_upstream off; in your config because of those concerns, it’s time to test removing it. Nginx’s load balancing is much safer and saner than it was this time last year.

Rundeck Performance Tuning With MySQL

At my current job, we use a tool called Rundeck to automate a slew of tasks. I initially stood up a test instance on a small VM, so people could kick the tires and decide if it was useful. Before I knew it, five or six dev teams were running dozens of critical jobs out of there, raving about its power, flexibility, and visibility. It had been voted into production whether I liked it or not.

There was just one problem: Web performance was awful. I’m talking “click a link, go get a fresh cup of coffee, come back, and the page is still loading” slow. Literally. It had started out fine, but as popularity grew, it quickly became unusable.

Hours of fruitless troubleshooting later, I came across a GitHub issue mentioning some missing indices on MySQL tables. Hmm, we use MySQL… searching for that table name led to a couple more issues and mailing list threads homing in on one issue: Add these indices, and perf is fixed. I gave it a shot.

Backup your database before running these commands. They should be harmless, but better safe than sorry.

ALTER TABLE workflow_workflow_step ADD INDEX workflow_commands_id ( workflow_commands_id );
ALTER TABLE workflow_workflow_step ADD INDEX commands_idx ( commands_idx );

Performance immediately improved for us. Load times on problem pages went from minutes to a second or less. Again, literally.

Why these indices are not created by default, I couldn’t say. For all I know, there’s a great reason, but it’s never been articulated by the Rundeck developers. There are (ignored) GitHub issues and mailing list threads mentioning them dating back years. Which is a shame, because Rundeck is a great tool. But it borders on unusable until you fix this problem once you get beyond a handful of frequently-running jobs.

That is also unfortunately a metaphor for Rundeck itself. Hard to find–yet crucial–tweaks and confusing interfaces challenge operators and users at every turn. I think it’s an A+ idea that suffers from some unfortunate C- documentation, design, and operational decisions. It’s an outstanding tool for fostering a DevOps culture, which in a beautifully ironic twist reminds us of software a dev team threw over the wall to ops with no care in the world for those who had to operate or use it in production.

End rant. As I said, we really do use Rundeck extensively despite all that. I’ll try to post some more real-world stories of use cases, tweaks, and gotchas on this blog in the future.