Salt Multi-Master Bug in 2014.7

A word of warning about the Salt 2014.7 series if you run in multi-master mode. This past week, I tried rolling out Salt 2014.7.1 (aka Heliuim) to our production environment at work. The 2014.7 line has a lot of exciting new features and fixes, so we’ve been eager to get it out. Having been bitten by bugs in the past, though, we wanted to wait for the first point release to land. That recently dropped, and after a few days kicking the tires in our test environment I was confident that the upgrade would go smoothly

Sadly, it was not to be. We run a check in our Zabbix monitoring system periodically to make sure that every master is able to make a simple test.ping connection to every minion. Our production setup uses four masters in multi-master mode for redundancy. Shortly after completing the upgrade of all masters and minions, this check began to fail. Not in a very consistent way, but each master had a different subset of minions it could not reach. Normally when a minion loses touch with a master, it’s fixable by simply restarting the salt-minion service, but that did not work here. It just turned into a game of whack-a-mole where the minion would become unreachable from a different master instead.

Diving into the Salt issue tracker, I came upon issues 18322 and 19932 which were filed against 2014.7 and sounded very familiar. They both indicate minions failing to respond to commands from masters, seemingly at random. The common thread was use of multi-master mode. One user suggested a workaround of setting multiprocessing: False on the minions. I found that improved matters–fewer minions were randomly failing to respond–but did not fix it completely. It also seems that this issue is fixed in the upcoming 2015.2 “Lithium” release, but that is a long ways off and the fix is not easily backported.

Multi-master mode explained the nagging question of why testing had gone off without a hitch. Our test environment only uses one master, and would not have triggered the bug(s). So, shame on me for that! It’s a best practice to test in a configuration as close to production as possible, and I will be fixing that soon. In a virtualized world, there’s minimal cost to spinning up a second master. And the time spent will certainly be less than I spent chasing down this bug.

In the end, I was forced to roll back the upgrade from all of our servers, which was not a fun job. But at the end of it, Salt was running smoothly once again and our monitoring system was clear. I’ll continue to work with the SaltStack team to find a fix. They’re great folks and very committed to both their product and the community, so I am confident it will happen sooner rather than later.

Please leave a comment if you’ve encountered this issue, or have a workaround.

Leave a Reply

Your email address will not be published. Required fields are marked *