Failover with Redis Sentinel
At Vinted, we use a data structures server Redis for many things including Resque, news feed, application, etc. We are not able to restart or upgrade Redis instances without having zero downtime. High availability is critical for us. Therefore, we decided to try database services like Redis Sentinel or Redis Cluster.
The first thing we did was test Redis Cluster. However, due to a lack of client-side software we decided not to go with this solution. Redis Cluster itself is stable, but it’s client-side is very basic and lacks advanced functionalities, such as pipelining, which we use.
Once we were finished with testing Redis Cluster, we moved onto Redis Sentinel. Redis Sentinel monitors slave servers and elects a new master if the quorum is satisfied. In our case, we tested it with 3 nodes (quorum=2). It is not worth going into details about Redis Sentinel, as the configuration is very simple.
We run multiple mini clusters, each one formed by one master and two slaves. This allows us to run as many instances inside one server (due to listening via different port numbers).
If we need to launch another cluster, we simply add the role redis-shards-<country>
and Chef will automatically spawn what is needed.
The most interesting thing about Sentinel is that it writes the state into the configuration file. As a result this file cannot be overwritten. This means that Chef will regenerate these files if they do not exist.
Technical details
Failover
Every time Redis completes a failover, it calls sentinelStartFailover()
. Sentinels exchange hello messages using Pub/Sub and update the last_pub_time
variable.
So, let’s dig deeper into this. Here is the snippet (Systemtap) used to probe the user-space:
probe process("/usr/local/bin/redis-server").function("sentinelStartFailover")
{
elapsed = gettimeofday_ms() - $master->last_pub_time;
printf("%d.%03ds\n", (elapsed / 1000), (elapsed % 1000));
}
Manual failover using redis-cli took 0.835s, while failover with configured timeout took 5.843s.
Measuring how quickly manual failover can converge was crucial for us, as we care about latency. Failing fast is also integral, so it is important to adjust these timers to determine whether it is enough to perform manual failovers for maintenance, or if it is preferable to go with configured timeouts.
Migration process
- Stop all sentinel instances, to avoid electing new master;
- Make sure every redis instance is master;
- Sentinel master node replicates from origin;
- Sentinel slaves replicate from sentinel master;
- After everything is in sync, stop syncing master from origin and start sentinel instances.
Monitoring
We monitor Redis instances using Redistop.rb.
We don’t use the built-in monitoring tool (redis-cli -p <port> monitor
), because it is more intrusive (~12%) than our own. In addition, our own tool allows us to monitor how many requests we have per second per instance, sort by latency, sort by count, and see the most used keys and commands.
~$ ruby redistop.rb -R
Probing...Type CTRL+C to stop probing.
PID REQ/S
1794 2345
22463 1025
2068 785
53680 757
1747 519
1841 462
53633 204
Total: 6116 req/s
~$ ruby redistop.rb -F
Probing...Type CTRL+C to stop probing.
PID COUNT LATENCY CMD
1794 925 <0.000023> zrangebyscore
2068 324 <0.000032> zrangebyscore
22463 293 <0.000033> get
53680 255 <0.000014> get
53680 252 <0.000017> hget
1794 249 <0.000015> get
1794 248 <0.000018> hget
22463 230 <0.000039> hget
1747 225 <0.000053> zrangebyscore
2068 179 <0.000018> hget
~$ ruby redistop.rb -K
Probing...Type CTRL+C to stop probing.
COUNT KEY
1320 get
1107 hget
966 zrangebyscore
486 fr:ab_test_ids
462 pl:ab_test_ids
442 de_babies:ab_test_ids
308 cz:ab_test_ids
Lessons Learned
- Redis Cluster is a very cool service, but due to the immaturity of client-side we decided to postpone using it.
- Redis Sentinel failover is implemented as expected. Manual failover works instantly.
- Migration from standalone instance to Redis Sentinel is very simple.
- Monitoring Redis instances became very easy for us as we can inspect the most interesting things.