There are many recipes describing how much resources to allocate for Elasticsearch, but there are none about using non-uniform memory access. This post is about leveraging NUMA for efficient resource usage.
Let’s begin with some stats. Previously we were running 28 data nodes on 14 physical servers. Today we run 44 data nodes on 11 physical servers and we have a faster Elasticsearch cluster with more capacity.
All this was achieved using NUMA processors range allocation with local memory affinity. This requires a NUMA policy aware Kernel and at least 2 CPU sockets. Otherwise, there is no difference, since memory locality is same.
There are some caveats when assigning processor ranges that they are not continuous on each node.
$ numactl --hardware | tail -n 3 node 0 1 0: 10 21 1: 21 10
From the matrix above it can be seen that there is additional overhead when nodes interleave each other, therefore this concludes that CPU ranges are not continuous.
$ lscpu | grep -P 'NUMA node\d CPU\(s\)' NUMA node0 CPU(s): 0-13,28-41 NUMA node1 CPU(s): 14-27,42-55
lscpu output shows that the machine’s 56 cores have ranges that interleave both nodes not continuously (does not start from 0-27 and 28-55).
After running 28 data nodes with 2 ranges it was clear that it is possible to push Elasticsearch further, since CPU usage was low: 24% at peak time. This was a waste of resources, so I assigned 1 range (14 CPU’s) to a single node, which resulted in 44 data nodes by removing 3 physical servers from the cluster. In theory 7 physical servers are enough to reach the identical 28 data node capacity (half of initial physical servers).
Since we run Elasticsearch in 3 data centers, it is important to have shard allocation awareness on both DC and RACK level, because multi node might host primary and replica shards on the same machine or RACK (even DC).
Shard allocation awareness is achieved through
cluster.routing.allocation.awareness.attributes . Moreover, one must force zone (DC) awareness through
cluster.routing.allocation.awareness.force.zone.values  if running multiple data centers.
Furthermore, when running multiple nodes on same machine
cluster.routing.allocation.same_shard.host  must be enabled.
Do not trust Elasticsearch shard allocation - monitor it. Forced shard allocation might not balance shards across datacenters when enabled. To balance shards use Reindex API  or reindex completely.
CheckElasticsearchShardsDcAwareness CRITICAL: Shard 2 of country1-some-items_20170825100323 is not DC aware Shard 0 of country2-some-items_20170304121659 is not DC aware, Shard 2 of country3-some-items_20170302224501 is not DC aware, Shard 0 of country3-some-items_20170302224501 is not DC aware
- Increased CPU resource utilization.
- Better CPU cache hit ratio.
- Fewer context switches as opposed to using virtualization approach.
- Main memory locality.