A few months ago we started to receive Bugsnag notifications of a new error:
Redis::TimeoutError Connection timeout
We use Redis in many ways throughout our application, from background job queues to tracking request throttling state. We run a redis cluster and use Stunnel to provide SSL encryption to the Redis servers from various systems.
When I started troubleshooting this error I found an odd error message in the stunnel log:
2016.10.27 12:59:20 LOG4[1885:139630634207040]: Connection rejected: too many clients (>=500)
After a quick search on Google, I found out that stunnel was somehow dependant on the limits of resources imposed on the shell by the Linux kernel, specifically the total number of open file descriptors in this case.
Increasing limits in a running process to confirm the first hypothesis
My first decision was to verify that hypothesis. The first issue I ran into was that I did not want to restart stunnel in order to interrupt our workers from accessing redis. There is a tool called prlimit that provides a way to change limits on a running process, but for reasons I haven't been able to determine it was removed from the util-linux pacakge in Ubuntu 14.04, the Linux distribution we run. I tried writing directly to
/proc/<pid>/limits, as sugested by this tweet, but that didn't work either.
So I decided to build a prlimit package. Fortunately the engineers at Shopify have already done that in a Github repository with util-linux backports. I compiled the package and uploaded to our packagecloud repository.
After deploying the package I tried to change the limits of the running process, but the next day the errors were still there. I was puzzled because the linux kernel reported the limits being changed. Various possibilities were considered: that the too many clients error was a red herring, or that prlimit was not really working because some odd kernel setting was missing, or even that the ubuntu kernel did not support it.
At this point I still wasn't able to confirm my hypothesis, and the errors were (of course) only appearing in production.
Trying to reproduce the hypothesis by other means
My next step was to try to reproduce the hypothesis in our staging environment. For that task I used the tool
redis-benchmark to generate load on the other side of the tunnel and try to reach that limit of 500 connections.
After a bit of trial and error, I was able to reach the connection limit. For that I had to manipulate the limits of the running shell with ulimit since I was able to hit the limit on the side of the tunnel where I was running
redis-benchmark(Stunnel needs to create tunnels in both directions).
This provided the first clue, because I needed to restart stunnel in my tests after changing limits in my shell. I was getting close but still I was not able to really determine the process.
I tried to go back to the documentation but the author did not say too much about this issue apart from having to change the limits.
Finding the real source of the problem
At this point I started to dig into the stunnel source code and there I found the cause: Stunnel internally sets the maximum number of clients during its startup by using the soft limit on the number of open file descriptors. This is the specific formula that is used:
That makes complete sense given the ulimit settings in the servers, so I had the feeling that I was close to confirm my original hypothesis. The downside: I had to restart stunnel for that.
Restarting stunnel without affecting service and making limits permanent
At this point I asked my teammate David to help me to devise a smart solution to restart stunnel without affecting service. The solution was pretty simple, just create another tunnel and reconfigure sidekiq workers to point to that tunnel, then raise limits and restart the original tunnel.
The best option for that was adding the ulimit parameters to the stunnel init script, that way we were sure that stunnel was always running with the right limits.
Confirming the hypothesis and mitigating the problem
After a few days we were experiencing far fewer errors with stunnel. At this point the problem is still present but is at least mitigated. The number of errors are bearable for now until we can dig further into why so many Redis connections are being generated.
Summing it up
To conclude, here are four points to consider whenever trying to solve any tricky operational issue:
- Use troubleshooting thoeory
- Browse the source
- Ask for help if you get lost