Using the Power of Real-Time Distributed Search with ElasticSearch. Part 2
Note: This post is the second part of the article about the use of ElasticSearch. Find the first part here.
Big data? Let’s scale the search!
It is very easy to build clusters with ES. If two or more nodes are running on the same or on another server in the same network then by default all of them will automatically discover each other and will form clusters.
Indexes in ElasticSearch are scaling horizontally and scattered on shards (shard is a single Lucene indexes manageable by Elasticsearch). Shards, in turn, have replicas (backups) and all of them are located in nodes and nodes can be grouped in clusters. ES scans network with so-called Zen Discovery mechanism which has IP multicast and unicast methods. Using one of these methods it checks the presence of other nodes uniting them together forming a cluster. Unicast discovery is preferable because a new node is not necessary to know about all others in the cluster, it’s enough to be connected to only one. Then it can directly ask a master node to get information about other nodes in the cluster.
In case of failure of a shard, its role becomes playing appropriate replica so that the user does not notice any substitution since it has the same data as the shard.
By default ES node configured to have 5 shards with 1 replica each. It means that indexes will have 5 primary shards and 5 it’s reserved copies (replicas). What does it give? In case if a cluster has at least 2 nodes and one of them fails then the cluster will still contain the entire index because the second node has copies of shards from the first one. If shards configured to have 2 replicas then ES guarantees data integrity even if 2 nodes fail (of course there should be more than 2 nodes in a cluster) and so on.
ElasticSearch Cluster with 2 nodes having 2 shards each and 1 replica per shard
When the amount of data or requests increases, arises the question about extending cluster. In ES it’s very simple: it is necessary to start one or more nodes and ES will automatically move a few shards into these new nodes within a cluster, thus unloading the old ones.
Also, it should be noted that ES is able to effectively use advantages of multi-core processors. Each node in ES may play of the following roles:
- Workhorse. The node only holds data in one or more shards which are actually Lucene indexes, never becomes a master node. They are responsible for indexing and executing queries.
- Coordinator. Serves as a master: not to store any data and have free resources. Node marked as a master is a potential candidate to become the Master of the cluster. ES automatically selects one of it. When Master node goes down then ES initiates new elections of the Master between all nodes having a master role.
- Load balancer. Node is neither master nor data node but acts as a “search load balancer” (fetching data from nodes, aggregating results, etc.). It also responsible for ES REST interface.
By default, ES node plays all three roles but it can be easily tuned in a configuration file. Since ElasticSearch takes care of load balancing then there is no need to use any external tools for managing a load of clusters.
What is an optimal number of shards and replicas?
The number of shards and number of replicas can be configured for each node separately. But how to know how many shards and replicas in a node do we need for our application? And how many nodes are needed to form an optimal cluster?
Actually, there is no magic formula that always gives 100% correct answer to this question. But there are some general guidelines that can be used when selecting the number of replicas and shards.
- Prepare the same environment that will be used on production
- Create an index and configure a node to have only one shard and no replicas
- Index data into that shard
- Load the shard with the typical queries and typical load
- Measure performance
At some moment querying becomes too slow. It means that the max capacity on that hardware is reached. That’s the maximum shard size. Using it and knowing the size of the index we can calculate the number of shards needed for us by a formula:
Number of shards = index size / max shard size
Also, ElasticSearch provides a general rule of thumb that should be used when configuring of shards and replicas:
Assuming you have enough machines to hold shards and replicas, the rule of thumb is:
- Having more shards enhances the indexing performance and allows to distribute a big index across machines.
- Having more replicas enhances the search performance and improves the cluster availability
ElasticSearch in Cloud
ElsaticSearch can be installed on any cloud and extended to hundreds of instances without any changes in a client application. A good video tutorial about an installation of ElasticSearch with Cloudify is here.
ElasticSearch can be used on Amazon EC2 cloud. Here is a very handy guideline on how to set it up.
ElasticSearch is a very powerful (near)real-time search engine written in Java and based on Apache Lucene. It can be installed on any cloud and easily scaled to hundreds of instances.
For developers, it provides APIs to work from Java and Groovy by using libraries. But it doesn’t set a limitation for other languages or technologies, for that there is REST API with a full set of possibilities.
Features like prompting a word during input, finding a closest restaurant/hotel/cinema, gathering statistics about appearance query word in different documents and so on enhance the usability of modern applications and will attract users to use a software more often. Along with easiness of integration, configuration, scaling, ability to run on a Cloud opens a wide range of ElasticSearch usage opportunities.
Interesting article about defining maximum number of shards
Using the Power of Real-Time Distributed Search with ElasticSearch. Part 1
View a full PDF version of the article here:
Using the Power of Real-Time Distributed Search