Welcome to this introductory series on Elasticsearch and Amazon Elasticsearch Service (Amazon ES). In this and future blog posts, we provide the basic information that you need to get started with Elasticsearch on AWS.
How many shards?
Elasticsearch can take in large amounts of data, split it into smaller units, called shards, and distribute those shards across a dynamically changing set of instances. When you create an Elasticsearch index, you set the shard count for that index. Because you can’t change the shard count of an existing index, you have to make the decision on shard count before sending your first document. To begin, set the shard count based on your calculated index size, using 30 GB as a target size for each shard.
Number of Shards = Index Size / 30GB
To learn how to calculate your index size, see the blog post Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need?
As you send data and queries to the cluster, continuously evaluate the resource usage and adjust the shard count based on the performance of the cluster.
What is a shard?
A search engine has two jobs: Create an index from a set of documents, and search that index to compute the best, matching documents. If your index is small enough, a single data structure on a single machine can easily store that index. For larger document sets, in cases where a single machine is not large enough to hold the index, or in cases where a single machine can’t compute your search results, the index is split into pieces. These pieces are called shards in Elasticsearch. Each document is routed to a shard that is calculated, by default, by using a hash of that document’s ID.
A shard is both a unit of storage and a unit of computation. Elasticsearch deploys shards independently to the instances in the cluster to parallelize the storage and processing for the index. And it does this elastically (hence the “elastic” in the name “Elasticsearch”). If you add more instances to a cluster, Amazon Elasticsearch Service automatically rebalances the shards of the cluster, moving them between instances.
As storage, shards are distinct from one another. The document set in one shard doesn’t overlap the document set in any other shard. This approach makes shards independent for storage.
As computational units, shards are also distinct from one another. Each shard is an instance of an Apache Lucene index that computes results on the documents it holds. Because all of the shards comprise the index, they must function together to process each query and update request for that index. To process a query, Elasticsearch routes the query to all shards in the index. Each shard computes its response locally and then these responses are aggregated for the final response. To process a write request (a document addition or an update to an existing document), Elasticsearch routes the request to the appropriate shard.