Elasticsearch is as one of the leading solutions for Enterprise search (and not only). As such it is worth understanding how does it work internally in order to better leverage its capabilities. Let’s follow a short journey to understand how does Elasticsearch work internally.
At the beginning there was only Lucene …
The Apache Lucene library is an open-source library for full-text indexing. It is used by a number of applications to build a number of advanced search capabilities with Elasticsearch being one of them. Lucene is also used by a number of other Enterprise search applications such as Apache Solr. Why would one choose to use Apache Lucene or a search application build on top of it to implement search capabilities ? Why not use simple queries on top of i.e. a relational database already used by an application ? The key to this answer is the basic data structure used by Apache Lucene: an inverted index.
In a nutshell when we store (index) a text (a document) in Lucene it is split into tokens. Each distinct token in the inverted index points to the documents that contain it. This provides the possibility to implement faster algorithms for full text search covering a wider and more complex range of search scenarios. A traditional relational databases uses standard indexes based on data structures like a B-tree to improve performance and that provides lesser options for optimization. How is an inverted index stored by Apache Lucene internally ? It is stored in separate files on disk called Lucene segments:
Elasticsearch: a web server on top of Lucene …
Yes, Elasticsearch can be considered a web server build on top of the Lucene library or even a document-oriented database. It provides a number of important capabilities that are missing from the library itself such as:
- custering: Elasticsearch provides a robust mechanism to build a cluster of Elasticsearch instances for scalability and high availability
- JSON-based REST API
- caching
- a lot more …
An index in Elasticsearch is ditributed in one or more primary shards and zero or more replica shards. In effect an Elasticsearch shard resides on a node from the Elasticsearch cluster and corresponds to a Lucene index:
An index in Elasticsearch may not have a field mapping (schema) defined explicitly. In that case Elasticsearch tries to deduct one automatically. A field may also have multiple types associated at the same time (i.e. text and keyword).
Every search document returned is scored to determine how relevant that document is according to the search query. Earlier versions of Elasticsearch (prior to 5.0) used the tf-idf algorithm to determine score relevance but later versions use the Okapi BM25 algorithm.
Elasticsearch is designed with clustering in mind. It tries to balance the number of shards across the nodes in a cluster so that load is distributed evenly. Even replica shards may paticipate in search queries instead of only providing high availability. The shard for a document is determined based on a simple hash function on the document routing key (which is the document ID by default):
shard = hash(routing_key) % number_of_primary_shards
There are two options to add a new node to the cluster: either using a multicast address or unicast: with a list of one more existing nodes in the cluster.
A mechanism in place to deal with potential conflicts is implemented by means of optimistic locking. This is achieved by explicitly specifying a version of the document expected to be currently in the index and if that is not the case the operation fails. Traditional relational databases in contrast implement pessimistic locking where certain parts of the schema can be locked to prevent unexpected modifications. This is not the case with Elasticsearch: we cannot lock an index or parts of it during a write request.
Some general recommendations related to the number of shards and size of index are:
- too small number of shards introduces a scalability bottleneck
- too many shards introduces performance and management overhead
- determining the number of primary shards should be based on an upfront planning
- putting large amounts of data in a single index should be avoided: if that is required the index should be split into montly/weekly/daily indices so that we keep the size of each index ideally between 5 and 10 GBs of data
- aliases to reference indexes should be used as much as possible
How are requests processed in an Elasticsearch cluster ?
Let’s first see how an index request is processed:
- the index request is sent to a coordinating node in the cluster
- the coordinating node routes the request to a shard
- the shard does not write the document imediately on disk by default (it can be forced though with a parameter) but to two in-memory areas: the memory buffer and the transaction log
- the in-memory areas are then flushed to disk
Now let’s see how is a search request processed:
- a search request is processed in two phases: fetch and query
- during the fetch phase the search request is forwarded by the coordinating node to all shards to determine which ones contain data matching the query
- during the query phase these shards are queried to retrieve data that is then aggregated by the coordinating node and returned back to the client
Modules here, modules there, modules everywhere …
An Elasticsearch node is comprised internally of different modules. Earlier versions of Elasticsearch used a modified version of the Google Guice library for dependency injection. Effectively latest versions of Elasticsearch are moving away from it. Modules were bound to a Guice binder effectively enabling them to be injected and used wherever needed:
// b is a Guice binder
modules.add(b -> {
b.bind(Node.class).toInstance(this);
b.bind(NodeService.class).toInstance(nodeService);
b.bind(NamedXContentRegistry.class).toInstance(xContentRegistry);
b.bind(PluginsService.class).toInstance(pluginsService);
b.bind(Client.class).toInstance(client);
b.bind(NodeClient.class).toInstance(client);
b.bind(Environment.class).toInstance(this.environment);
b.bind(ThreadPool.class).toInstance(threadPool);
b.bind(NodeEnvironment.class).toInstance(nodeEnvironment);
…
}
Some core modules are:
- discovery and cluster formation: used for node discovery
- HTTP: for the HTTP REST API
- plugins: for managing the Elasticsearch plug-ins
- thread pools: thread pools used internally by Elasticsearch
- transport: communication layer for the Elasticsearch nodes
The evolving codebase …
The open source version of Elasticsearch can be cloned from the official Github repo. Each version has a corresponding tag (i.e. v8.4.3) and there are also branches for minor versions (i.e. 8.5). The code is well structured and easy to understand. Here are some of the root folders of the repo:
- client: implementation of the low level and high level Java REST clients
- distribution: gradle build scripts for building the various distributions (i.e. RPMs)
- docs: official Elasticsearch documentation organized in asciidoc format
- server: core Elasticsearch application, contains built-in core modules
- x-pack: implementation of the XPack extension
- plugins: Additional plugins part of Elasticsearch distribution
- modules: Additional modules implementing Elasticsearch functionality
To understand how Elasticsearch boots up you can start from the org.elasticsearch.node.Node#start() method:
Conclusion
We did a brief deep-dive into how Elasticsearch works. As with any software project the truth is in the code so you can checkout the Elasticsearch repo and analyse certain parts of it, i.e. particular modules. This is particularly useful if you find yourself in a situation where you need to understand how something works and it is not quite clear from documentation or if you need to write an Elasticsearch plugin and cannot find a good reference example.